Commit Graph

135 Commits (a31725d9891ba9bbcda796c7e79d325ccb4678b7)

Author SHA1 Message Date
Paul Mackerras 0fb8967290 core: Implement the TAR register and the bctar instruction
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
4 years ago
Paul Mackerras 36297d35f8 decode1: Fix formatting
Commit d5c8c33bae ("decode1: Reformat to 4-space indentation") resulted
in some rows of major_decode_rom_array being misaligned.  This fixes it.
No code change.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
4 years ago
Paul Mackerras 893d2bc6a2 core: Don't generate logic for log data when LOG_LENGTH = 0
This adds "if LOG_LENGTH > 0 generate" to the places in the core
where log output data is latched, so that when LOG_LENGTH = 0 we
don't create the logic to collect the data which won't be stored.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
4 years ago
Paul Mackerras 74062195ca execute1: Do forwarding of the CR result to the next instruction
This adds a path to allow the CR result of one instruction to be
forwarded to the next instruction, so that sequences such as
cmp; bc can avoid having a 1-cycle bubble.

Forwarding is not available for dot-form (Rc=1) instructions,
since the CR result for them is calculated in writeback.  The
decode.output_cr field is used to identify those instructions
that compute the CR result in execute1.

For some reason, the multiply instructions incorrectly had
output_cr = 1 in the decode tables.  This fixes that.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
4 years ago
Paul Mackerras c2da82764f core: Implement CFAR register
This implements the CFAR SPR as a slow SPR stored in 'ctrl'.  Taken
branches and rfid update it to the address of the branch or rfid
instruction.

To simplify the logic, this makes rfid use the branch logic to
generate its redirect (requiring SRR0 to come in to execute1 on
the B input and SRR1 on the A input), and the masking of the bottom
2 bits of NIA is moved to fetch1.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
4 years ago
Paul Mackerras 6687aae4d6 core: Implement a simple branch predictor
This implements a simple branch predictor in the decode1 stage.  If it
sees that the instruction is b or bc and the branch is predicted to be
taken, it sends a flush and redirect upstream (to icache and fetch1)
to redirect fetching to the branch target.  The prediction is sent
downstream with the branch instruction, and execute1 now only sends
a flush/redirect upstream if the prediction was wrong.  Unconditional
branches are always predicted to be taken, and conditional branches
are predicted to be taken if and only if the offset is negative.
Branches that take the branch address from a register (bclr, bcctr)
are predicted not taken, as we don't have any way to predict the
branch address.

Since we can now have a mflr being executed immediately after a bl
or bcl, we now track the update to LR in the hazard tracker, using
the second write register field that is used to track RA updates for
update-form loads and stores.

For those branches that update LR but don't write any other result
(i.e. that don't decrementer CTR), we now write back LR in the same
cycle as the instruction rather than taking a second cycle for the
LR writeback.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
4 years ago
Paul Mackerras 09ae2ce58d decode1: Improve timing for slow SPR decode path
This makes the logic that works out decode.unit and decode.sgl_pipe
for mtspr/mfspr to/from slow SPRs detect the fact that the
instruction is mtspr/mfspr based on a match with the instruction
word rather than looking at v.decode.insn_type.  This improves timing
substantially, as the ROM lookup to get v.decode is relatively slow.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
4 years ago
Paul Mackerras b3799c432b decode1: Add a stash buffer to the output
This means that the busy signal from execute1 (which can be driven
combinatorially from mmu or dcache) now stops at decode1 and doesn't
go on to icache or fetch1.  This helps with timing.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
4 years ago
Paul Mackerras 65a36cc0fc decode: Work out ispr1/ispr2 in parallel with decode ROM lookup
This makes the logic that calculates which SPRs are being accessed
work in parallel with the instruction decode ROM lookup instead of
being dependent on the opcode found in the decode ROM.  The reason
for doing that is that the path from icache through the decode ROM
to the ispr1/ispr2 fields has become a critical path.

Thus we are now using only a very partial decode of the instruction
word in the logic for isp1/isp2, and we therefore can no longer rely
on them being zero in all cases where no SPR is being accessed.
Instead, decode2 now ignores ispr1/ispr2 in all cases except when the
relevant decode.input_reg_a/b or decode.output_reg_a is set to SPR.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
4 years ago
Paul Mackerras b5a7dbb78d core: Remove fetch2 pipeline stage
The fetch2 stage existed primarily to provide a stash buffer for the
output of icache when a stall occurred.  However, we can get the same
effect -- of having the input to decode1 stay unchanged on a stall
cycle -- by using the read enable of the BRAMs in icache, and by
adding logic to keep the outputs unchanged on a clock cycle when
stall_in = 1.  This reduces branch and interrupt latency by one
cycle.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
4 years ago
Paul Mackerras 49a4d9f67a Add core logging
This logs 256 bits of data per cycle to a ring buffer in BRAM.  The
data collected can be read out through 2 new SPRs or through the
debug interface.

The new SPRs are LOG_ADDR (724) and LOG_DATA (725).  LOG_ADDR contains
the buffer write pointer in the upper 32 bits (in units of entries,
i.e. 32 bytes) and the read pointer in the lower 32 bits (in units of
doublewords, i.e. 8 bytes).  Reading LOG_DATA gives the doubleword
from the buffer at the read pointer and increments the read pointer.
Setting bit 31 of LOG_ADDR inhibits the trace log system from writing
to the log buffer, so the contents are stable and can be read.

There are two new debug addresses which function similarly to the
LOG_ADDR and LOG_DATA SPRs.  The log is frozen while either or both of
the LOG_ADDR SPR bit 31 or the debug LOG_ADDR register bit 31 are set.

The buffer defaults to 2048 entries, i.e. 64kB.  The size is set by
the LOG_LENGTH generic on the core_debug module.  Software can
determine the length of the buffer because the length is ORed into the
buffer write pointer in the upper 32 bits of LOG_ADDR.  Hence the
length of the buffer can be calculated as 1 << (31 - clz(LOG_ADDR)).

There is a program to format the log entries in a somewhat readable
fashion in scripts/fmt_log/fmt_log.c.  The log_entry struct in that
file describes the layout of the bits in the log entries.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
4 years ago
Paul Mackerras d5c8c33bae decode1: Reformat to 4-space indentation
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras af909840e6 decode1: Make ld/std and lwa not be single-issue
These were missed earlier when the single-issue flag was turned off on
the other loads and stores by commit 1a244d3470 ("Remove single-issue
constraint for most loads and stores").

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras 4a4a98d4b9
core: Do addpcis using the main adder (#189)
By adding logic to decode2 to be able to send the instruction address
down the A input, and making CONST_DX_HI (renamed to CONST_DXHI4) add
4 to the immediate value (easy since the bottom 16 bits were zero),
we can do addpcis using the main adder.  This reduces the width of the
result mux and frees up one value in insn_type_t, since we can now use
OP_ADD for addpcis.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Shawn Anastasio e606772aeb Implement the addpcis instruction
This commit adds support for the addpcis instruction from ISA 3.0.

A new input_reg_b_t type, CONST_DX_HI, was added to support the
shifted immediate value used in DX-Form instructions.

Signed-off-by: Shawn Anastasio <shawn@anastas.io>
5 years ago
Paul Mackerras 2843c99a71 MMU: Implement reading of the process table
This adds the PID register and repurposes SPR 720 as the PRTBL
register, which points to the base of the process table.  There
doesn't seem to be any point to implementing the partition table given
that we don't have hypervisor mode.

The MMU caches entry 0 of the process table internally (in pgtbl3)
plus the entry indexed by the value in the PID register (pgtbl0).
Both caches are invalidated by a tlbie[l] with RIC=2 or by a move to
PRTBL.  The pgtbl0 cache is invalidated by a move to PID.  The dTLB
and iTLB are cleared by a move to either PRTBL or PID.

Which of the two page table root pointers is used (pgtbl0 or pgtbl3)
depends on the MSB of the address being translated.  Since the segment
checking ensures that address(63) = address(62), this is sufficient to
map quadrants 0 and 3.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras a658766fcf Implement slbia as a dTLB/iTLB flush
Slbia (with IH=7) is used in the Linux kernel to flush the ERATs
(our iTLB/dTLB), so make it do that.

This moves the logic to work out whether to flush a single entry
or the whole TLB from dcache and icache into mmu.  We now invalidate
all dTLB and iTLB entries when the AP (actual pagesize) field of
RB is non-zero on a tlbie[l], as well as when IS is non-zero.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras f54a65f8cf Decode tlbiel as tlbie
The Linux kernel contains tlbiel instructions, which we can treat
identically to tlbie.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras 3d4712ad43 Add TLB to icache
This adds a direct-mapped TLB to the icache, with 64 entries by default.
Execute1 now sends a "virt_mode" signal from MSR[IR] to fetch1 along
with redirects to indicate whether instruction addresses should be
translated through the TLB, and fetch1 sends that on to icache.
Similarly a "priv_mode" signal is sent to indicate the privilege
mode for instruction fetches.  This means that changes to MSR[IR]
or MSR[PR] don't take effect until the next redirect, meaning an
isync, rfid, branch, etc.

The icache uses a hash of the effective address (i.e. next instruction
address) to index the TLB.  The hash is an XOR of three fields of the
address; with a 64-entry TLB, the fields are bits 12--17, 18--23 and
24--29 of the address.  TLB invalidations simply invalidate the
indexed TLB entry without checking the contents.

If the icache detects a TLB miss with virt_mode=1, it will send a
fetch_failed indication through fetch2 to decode1, which will turn it
into a special OP_FETCH_FAILED opcode with unit=LDST.  That will get
sent down to loadstore1 which will currently just raise a Instruction
Storage Interrupt (0x400) exception.

One bit in the PTE obtained from the TLB is used to check whether an
instruction access is allowed -- the privilege bit (bit 3).  If bit 3
is 1 and priv_mode=0, then a fetch_failed indication is sent down to
fetch2 and to decode1, which generates an OP_FETCH_FAILED.  Any PTEs
with PTE bit 0 (EAA[3]) clear or bit 8 (R) clear should not be put
into the iTLB since such PTEs would not allow execution by any
context.

Tlbie operations get sent from mmu to icache over a new connection.

Unfortunately the privileged instruction tests are broken for now.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras 4e6fc6811a MMU: Implement radix page table machinery
This adds the necessary machinery to the MMU for it to do radix page
table walks.  The core elements are a shifter that can shift the
address right by between 0 and 47 bits, a mask generator that can
generate a mask of between 5 and 16 bits, a final mask generator,
and new states in the state machine.

(The final mask generator is used for transferring bits of the
original address into the resulting TLB entry when the leaf PTE
corresponds to a page size larger than 4kB.)

The hardware does not implement a partition table or a process table.
Software is expected to load the appropriate process table entry
into a new SPR called PGTBL0, SPR 720.  The contents should be
formatted as described in Book III section 5.7.6.2 of the Power ISA
v3.0B.  PGTBL0 is set to 0 on hard reset.  At present, the top two bits
of the address (the quadrant) are ignored.

There is currently no caching of any step in the translation process
or of the final result, other than the entry created in the dTLB.
That entry is a 4k page entry even if the leaf PTE found in the walk
corresponds to a larger page size.

This implementation can handle almost any page table layout and any
page size.  The RTS field (in PGTBL0) can have any value between 0
and 31, corresponding to a total address space size between 2^31
and 2^62 bytes.  The RPDS field of PGTBL0 can be any value between
5 and 16, except that a value of 0 is taken to disable radix page
table walking (for use when one is using software loading of TLB
entries).  The NLS field of the page directory entries can have any
value between 5 and 16.  The minimum page size is 4kB, meaning that
the sum of RPDS and the NLS values of the PDEs found on the path to
a leaf PTE must be less than or equal to RTS + 31 - 12.

The PGTBL0 SPR is in the mmu module; thus this adds a path for
loadstore1 to read and write SPRs in mmu.  This adds code in dcache
to service doubleword read requests from the MMU, as well as requests
to write dTLB entries.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras 750b3a8e28 dcache: Implement data TLB
This adds a TLB to dcache, providing the ability to translate
addresses for loads and stores.  No protection mechanism has been
implemented yet.  The MSR_DR bit controls whether addresses are
translated through the TLB.

The TLB is a fixed-pagesize, set-associative cache.  Currently
the page size is 4kB and the TLB is 2-way set associative with 64
entries per set.

This implements the tlbie instruction.  RB bits 10 and 11 control
whether the whole TLB is invalidated (if either bit is 1) or just
a single entry corresponding to the effective page number in bits
12-63 of RB.

As an extension until we get a hardware page table walk, a tlbie
instruction with RB bits 9-11 set to 001 will load an entry into
the TLB.  The TLB entry value is in RS in the format of a radix PTE.

Currently there is no proper handling of TLB misses.  The load or
store will not be performed but no interrupt is generated.

In order to make timing at 100MHz on the Arty A7-100, we compare
the real address from each way of the TLB with the tag from each way
of the cache in parallel (requiring # TLB ways * # cache ways
comparators).  Then the result is selected based on which way hit in
the TLB.  That avoids a timing path going through the TLB EA
comparators, the multiplexer that selects the RA, and the cache tag
comparators.

The hack where addresses of the form 0xc------- are marked as
cache-inhibited is kept for now but restricted to real-mode accesses.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras 635e316f9b Pass mtspr/mfspr to MMU-related SPRs down to loadstore1
This arranges for some mfspr and mtspr to get sent to loadstore1
instead of being handled in execute1.  In particular, DAR and DSISR
are handled this way.  They are therefore "slow" SPRs.

While we're at it, fix the spelling of HEIR and remove mention of
DAR and DSISR from the comments in execute1.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras 5d282a950c Improve architectural compliance of mfspr and mtspr
Mfspr from an unimplemented SPR should be a no-op in privileged state,
so in this case we need to write back whatever was previously in the
destination register.  For problem state, both mtspr and mfspr to
unimplemented SPRs should cause a program interrupt.

There are special cases in the architecture for SPRs 0, 4 5 and 6
which we still don't implement.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras 517a91ce5e decode1: Implement eieio as a nop
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras 8a0a907e2f Implement the extswsli instruction
This mainly required the addition of an entry to the opcode 31 decode
table and a 32-bit sign-extender in the rotator.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras 041d6bef60 dcache: Implement the dcbz instruction
This adds logic to dcache and loadstore1 to implement dcbz.  For now
it zeroes a single cache line (by default 64 bytes), not 128 bytes
like IBM Power processors do.

The dcbz operation is performed much like a load miss, except that
we are writing zeroes to memory instead of reading.  As each ack
comes back, we write zeroes to the BRAM instead of data from memory.
In this way we zero the line in memory and also zero the line of
cache memory, establishing the line in the cache if it wasn't already
resident.  If it was already resident then we overwrite the existing
line in the cache.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras 56577827d4 Decode attn in the major opcode decode table
This decodes attn using entry 0 of the major_decode_rom_array table
instead of a special case in the decode1_1 process.  This means that
only the major opcode (the top 6 bits) is checked at decode time.
To make sure the instruction is attn not some random illegal pattern,
we now check bits 1-10 of the instruction at execute time and
generate an illegal instruction interrupt if those bits are not
0100000000.

This reduces LUT consumption by 42 LUTs on the Arty A7-100.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras 6f7ef8b1b9 Decode sc in the major opcode decode table
This decodes sc using entry 17 of the major_decode_rom_array table
instead of a special case in the decode1_1 process.  This means that
only the major opcode (the top 6 bits) is checked at decode time.
To make sure that the instruction is sc not scv, we now check bit
1 of the instruction at execute time and generate an illegal
instruction interrupt if it is 0 (indicating scv).  The level field
of the sc instruction is now ignored.

This reduces LUT consumption by 31 LUTs on the Arty A7-100.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras 278ac5e0eb Remove sim_config instruction
It's not used any more, and it's not in the ISA.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras 381149b2cc Consolidate trap variants under a single OP_TRAP
This replaces OP_TD, OP_TDI, OP_TW and OP_TWI with a single OP_TRAP,
distinguishing the cases by the input_reg_b and is_32bit fields of
the decode ROM.  This adds the twi and td cases to the decode tables.

For now we make all of the trap instructions unconditionally generate
a trap-type program interrupt if the TO field of the instruction is
all ones, and do nothing otherwise.

This reduces the number of values in insn_type_t from 65 to 62,
meaning that an insn_type_t can now be encoded in 6 bits rather
than 7.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras fe077a116a Rename OP_MCRF to OP_CROP and trim insn_type_t
OP_MCRF covers the CR logical ops as well as mcrf since commit
c05441bf47 ("Implement CRNOR and friends"), so this renames
OP_MCRF to OP_CROP.  The OP_* values for the individual CR logical
ops (OP_CRAND, etc.) are not used, so remove them from insn_type_t.

No functional change.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Michael Neuling 5ef5604b65 Add sc, illegal and decrementer exceptions and some supervisor state
This adds the following exceptions:
 - 0x700 program check (for illegal instructions)
 - 0x900 decrementer
 - 0xc00 system call

This also adds some supervisor state:
 - decremeter
 - msr
(SPRG0/1 and SRR0/1 already exist as fast SPRs)

It also adds some supporting instructions:
 - rfid
 - mtmsrd
 - mfmsr
 - sc

MSR state is added but only EE is used in this patch set. Other bits
are read/written but are not used at all.

This adds a 2 stage state machine to execute1.vhdl. This state machine
allows fast SPRS SRR0/1 to be written in different cycles. This state
machine can be extended later to add DAR and DSISR SPR writing for
more complex exceptions like page faults.

Signed-off-by: Michael Neuling <mikey@neuling.org>
5 years ago
Michael Neuling 594a19de37 Plumb attn instruction through to execute1
Currently we decode attn but we just mark it as an illegal.

This adds a separate case statement in execute 1 for attn to terminate
the core. Illegals also do this currently but we are soon implementing
a 0x700 execption for them.

Signed-off-by: Michael Neuling <mikey@neuling.org>
5 years ago
Paul Mackerras 81369187c0 loadstore1: Add support for cache-inhibited load and store instructions
This adds support for lbzcix, lhzcix, lwzcix, ldcix, stbcix, sthcix,
stwcix and stdcix.  The temporary hack where accesses to addresses of
the form 0xc??????? are made non-cacheable is left in for now to avoid
making existing programs non-functional.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras 5d85ede97d dcache: Implement load-reserve and store-conditional instructions
This involves plumbing the (existing) 'reserve' and 'rc' bits in
the decode tables down to dcache, and 'rc' and 'store_done' bits
from dcache to writeback.

It turns out that we had 'RC' set in the 'rc' column for several
ordinary stores and for the attn instruction.  This corrects them
to 'NONE', and sets the 'rc' column to 'ONE' for the conditional
stores.

In writeback we now have logic to set CR0 when the input from dcache
has rc = 1.

In dcache we have the reservation itself, which has a valid bit
and the address down to cache line granularity.  We don't currently
store the reservation length.  For a store conditional which fails,
we set a 'cancel_store' signal which inhibits the write to the
cache and prevents the state machine from starting a bus cycle or
going to the STORE_WAIT_ACK state.  Instead we set r1.stcx_fail
which causes the instruction to complete in the next cycle with
rc=1 and store_done=0.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras 1a244d3470 Remove single-issue constraint for most loads and stores
This removes the constraint that loads and stores are single-issue,
at the expense of a stall of at least 2 cycles for every load and
store.

To do this, we plumb the existing stall signal that was generated
in dcache to core, where it gets ORed with the stall signal from
execute1.  Execute1 generates a stall signal for the first two
cycles of each load and store, and dcache generates the stall
signal in the 3rd and subsequent cycles if it needs to.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras 2661b9b985 decode1: Mark subfic as pipelined
This seems just to have been missed in commit f291efa266 ("decode1:
Mark ALU ops using carry as pipelined").

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras 0c714f1be6 execute: Move popcnt and prty instructions into the logical unit
This implements logic in the logical entity to calculate the results
of the popcnt* and prty* instructions.  We now have one insn_type_t
value for the 3 popcnt variants and one for the two prty variants,
using the length field of the decode_rom_t to distinguish between
them.  The implementations in logical.vhdl using recursive
algorithms rather than the simple functions in ppc_fx_insns.vhdl.

This gives a saving of about 140 slice LUTs on the A7-100 and
improves timing slightly.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras d2ca625b3b execute: Do comparisons using the main adder
This handles OP_CMP like a subtraction; the main adder computes
~RA + RB + 1, and the condition codes are computed from the results.
A direct comparison of the two input operands is used to calculate the
EQ bit of the condition result.  The LT and GT bits are computed from
the MSB of the subtraction result, the carry out from the subtraction,
and the MSBs of the operands.  For a 32-bit comparison, the 32-bit
carry and bit 31 of the result and input operands are used; for a
64-bit comparison, the 64-bit carry and bit 63 of the operands and
result are used.

It turns out to be more convenient to use the 'signed' field of
the decode table to distinguish signed from unsigned comparisons,
rather than the insn_type.  Therefore this uses OP_CMP for both
cmp and cmpl, which also has the benefit of reducing the number
of values in insn_type_t.

Doing this saves over 200 slice LUTs on the Arty A7-100 and improves
timing slightly as well.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras 39d18d2738 Make divider hang off the side of execute1
With this, the divider is a unit that execute1 sends operands to and
which sends its results back to execute1, which then send them to
writeback.  Execute1 now sends a stall signal when it gets a divide
or modulus instruction until it gets a valid signal back from the
divider.  Divide and modulus instructions are no longer marked as
single-issue.

The data formatting step that used to be done in decode2 for div
and mod instructions is now done in execute1.  We also do the
absolute value operation in that same cycle instead of taking an
extra cycle inside the divider for signed operations with a
negative operand.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras 2167186b5f Make multiplier hang off the side of execute1
With this, the multiplier isn't a separate pipe that decode2 issues
instructions to, but rather is a unit that execute1 sends operands
to and which sends the result back to execute1, which then sends it
to writeback.  Execute1 now sends a stall signal when it gets a
multiply instruction until it gets a valid signal back from the
multiplier.

This all means that we no longer need to mark the multiply
instructions as single-issue.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Tom Vijlbrief c05441bf47 Implement CRNOR and friends
Signed-off-by: Tom Vijlbrief <tvijlbrief@gmail.com>
5 years ago
Benjamin Herrenschmidt e4f475e17f sprs: Store common SPRs in register file
This stores the most common SPRs in the register file.

This includes CTR and LR and a not yet final list of others.

The register file is set to 64 entries for now. Specific types
are defined that can represent a GPR index (gpr_index_t) or
a GPR/SPR index (gspr_index_t) along with conversion functions
between the two.

On order to deal with some forms of branch updating both LR and
CTR, we introduced a delayed update of LR after a branch link.

Note: We currently stall the pipeline on such a delayed branch,
but we could avoid stalling fetch in that specific case as we
know we have a branch delay. We could also limit that to the
specific case where we need to update both CTR and LR.

This allows us to make bcreg, mtspr and mfspr pipelined. decode1
will automatically force the single issue flag on mfspr/mtspr to
a "slow" SPR.

[paulus@ozlabs.org - fix direction of decode2.stall_in]

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras d04887fdcd decode1: Add OE=1 forms of add/sub, mul and div instructions
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Benjamin Herrenschmidt f291efa266 decode1: Mark ALU ops using carry as pipelined
There is no reason not to that I can think of

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Benjamin Herrenschmidt 742b21480e insn: Simplistic implementation of icbi
We don't yet have a proper snooper for the icache, so for now make
icbi just flush the whole thing

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
5 years ago
Benjamin Herrenschmidt a0d95e791e insn: Implement isync instruction
The instruction works by redirecting fetch to nia+4 (hopefully using
the same adder used to generate LR) and doing a backflush. Along with
being single issue, this should guarantee that the next instruction
only gets fetched after the pipe's been emptied.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
5 years ago
Anton Blanchard 4433118c91
Merge pull request #105 from paulusmack/writeback
Writeback
5 years ago
Paul Mackerras 9646fe28b0 Do sign-extension instructions in writeback instead of execute1
This makes the exts[bhw] instructions do the sign extension in the
writeback stage using the sign-extension logic there instead of
having unique sign extension logic in execute1.  This requires
passing the data length and sign extend flag from decode2 down
through execute1 and execute2 and into writeback.  As a side bonus
we reduce the number of values in insn_type_t by two.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras 86c53aa3f7 Implement neg using OP_ADD
We have all the machinery in place to implement the neg instruction
as OP_ADD.  Doing that means we can ditch OP_NEG, and saves about
66 slice LUTs on the A7-100.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Anton Blanchard 813f834012 Add CR hazard detection
To keep things simple we treat the CR as a single entity.

Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
5 years ago
Anton Blanchard bb65d0b899 Remove issue restrictions on a number of instructions
Anything that isn't a load or store and anything that doesn't read the
CR can go as soon as its inputs are ready.

While we could also allow SPR read/write and carry read/write, we plan
to change them to be read in decode2 and written in writeback soon and
they will need separate hazard detection to be added.

Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
5 years ago
Anton Blanchard 10a990bba8 mod* doesn't have an RC form
The RC bit should be ignored for mod* instructions.

Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
5 years ago
Anton Blanchard c41da84226 decode: Handle icbi
We will need a proper handler for icbi, but in the meantime treat it
as a nop.

Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
5 years ago
Paul Mackerras 24a4a796ce execute: Consolidate count-leading/trailing-zeroes implementations
This adds combinatorial logic that does 32-bit and 64-bit count
leading and trailing zeroes in one unit, and consolidates the
four instructions under a single OP_CNTZ opcode.

This saves 84 slice LUTs on the Arty A7-100.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Anton Blanchard b8fb721b81 Consolidate logical instructions
Consolidate and/andc/nand, or/orc/nor and xor/eqv, using a common
invert on the input and output. This saves us about 200 LUTs.

Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
5 years ago
Paul Mackerras f7c393ba7e Add a rotate/mask/shift unit and use it in execute1
This adds a new entity 'rotator' which contains combinatorial logic
for rotating and masking 64-bit values.  It implements the operations
of the rlwinm, rlwnm, rlwimi, rldicl, rldicr, rldic, rldimi, rldcl,
rldcr, sld, slw, srd, srw, srad, sradi, sraw and srawi instructions.
It consists of a 3-stage 64-bit rotator using 4:1 multiplexors at
each stage, two mask generators, output logic and control logic.

The insn_type_t values used for these instructions have been reduced
to just 5: OP_RLC, OP_RLCL and OP_RLCR for the rotate and mask
instructions (clear both left and right, clear left, clear right
variants), OP_SHL for left shifts, and OP_SHR for right shifts.
The control signals for the rotator are derived from the opcode
and from the is_32bit and is_signed fields of the decode_rom_t.

The rotator is instantiated as an entity in execute1 so that we can
be sure we only have one of it.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras 90b6e27380 Generalize the mul_32bit and mul_signed fields of decode_rom_t
This changes the names of the mul_32bit and mul_signed fields of
decode_rom_t to is_32bit and is_signed, so they can be used with
other types of operations besides multiplies.

This plumbs the is_32bit and is_signed flags down into execute1,
though they are not used at this point.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras 7fe84220a5 decode: Avoid multiplexing from instruction reg fields to regfile address ports
This aims to simplify the logic between the instruction image and
the register file read address ports and reduce the size of the decode
tables.  With this patch, the input_reg_a column of the decode tables
can only select RA or zeroes, the input_reg_b column can only select
RB or a constant (0, -1, or an immediate value from the instruction),
and the input_reg_c columns can only select RS or zeroes.

That means that the rotate/shift/logical ops now have their first
input coming in via the input_reg_c column.  That means we need to
add a read_data3 field to the Decode2ToExecuteType record, but that
will go away again when we split out the rotate/mask/logical ops to
their own unit.

As a related but not tightly connected change, this patch also sets
the read1_enable signal to the register file be 0 when RA=0 and the
input_reg_a for the instruction is RA_OR_ZERO (previously it was 1).

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras 96b402a4bf Consolidate add/subtract instructions into a single op
All of the PPC add and subtract instructions, including carrying
and extended versions, do much the same arithmetic operation:

	result = (I xor A) + B + C

where A is the value from RA, I provides a logical inversion of A
(i.e. I is 0 or -1), B is either from RB or is a constant 0 or -1,
and C is 0, 1 or the carry bit from XER (CA).

To consolidate all the add/subtract instructions into a single
OP_ADD, we add a column to decode_rom_t to indicate when A should
be inverted, and change the input_carry field to a 3-state selector
to select C in the equation above.

This also adds a new "CONST_M1" value for input_reg_b_t to indicate
that B is a constant -1.  This allows us to implement addme and
subfme.

The addex instruction appears not to exist, so the comments referring
to it are removed.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras b0f302ecf4 decode: Make all update-form indexed loads and stores use RA_OR_ZERO
Experimentation on POWER9 indicates that the invalid form of lbzux
with RA=0 uses just RB as the address, not R0 + RB.  Extrapolating
this to all update-form loads and stores with RA=0, change all the
update-form loads and stores to use RA_OR_ZERO rather than RA.

This then means that all decode ROM entries with insn_type = LDST
have input_reg_a = RA_OR_ZERO.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras 58b06eb5f3 decode: Remove const fields from decode_rom_t
The const* fields of decode_rom_t drove multiplexers in decode2 that
picked out various instruction fields and put them into the const*
fields of the Decode2ToExecute1Type record, from where they were
used in execute1.  However, the code in execute1 can just as easily
use the appropriate fields of the original instruction word, since
that is now available in execute1.  This therefore changes the
code to do that, resulting in smaller decode tables.

Suggested-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras 143d0ae9e4 decode: Fix larx/stcx instructions to use RA_OR_ZERO not RA
The l?arx and st?cx. instructions are defined to use the normal indexed
mode address calculations, i.e. (RA|0) + RB.  Fix their entries in the
decode table to say RA_OR_ZERO rather than RA.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras bbae2d1eda decode: Index minor op table with insn bits for opcode 31
This changes decode_op_31_array from being indexed by a ppc_insn_t
(which is derived from the instruction word by a whole series of
if/elsif statements) to being indexed directly by bits 10...1 of
the instruction word.  With this we no longer need ppc_insn.

This then means that the decode1 stage doesn't distinguish between
mfcr and mfocrf, or between mtcrf and mtocrf, since those are
distinguished by the value in bit 20 of the instruction.  To
accommodate that, execute1 changes so that the one op value (OP_MFCR)
does either the mfcr or the mfocrf behaviour depending on bit 20
of the instruction word; and similarly for mtcrf/mtocrf.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras 21d3f8a5ed decode: Index minor op table with insn bits for opcode 30
This comprises the 64-bit rotate and mask instructions.  In order to
reduce the table index to 3 bits, we combine rldcl and rdlcr into a
single op (OP_RLDCX), and choose the right mask at execute time based
on bit 1 of the instruction word.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras 00e9f801f6 decode: Index minor op table with insn bits for opcode 19
This changes the decoding of major opcode 19 from using the ppc_insn_t
index to using bits of the instruction word directly.  Opcode 19 has
a 10-bit minor opcode field (bits 10..1) but the space is sparsely
filled.  Therefore we index a table of single-bit entries with the
10-bit minor opcode to filter out the illegal minor opcodes, and
index a table using just 3 bits -- 5, 3 and 2 -- of the instruction
to get the decode entry.  This groups together all the instructions
in 4 columns of the opcode map as a single entry.  That means that
mcrf and all the CR logical ops get grouped together, and bcctr, bclr
and bctar get grouped together.  At present the CR logical ops are not
implemented, so their grouping has no impact.

The code for bclr and bcctr in execute1 is now common, using a single
op, and it now determines the branch address by looking at bit 10 of
the instruction word at execute time.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras e30a87593a decode: Start moving towards decoding by major opcode first
With this, we have a table for most major opcodes and separate
tables for each major opcode that has further decoding required.
These tables are still mostly indexed by the ppc_insn_t values,
however.

A few things are still decoded completely at the top level: nop,
attn and sim_config.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras c9e92483b8 decode: Push mtspr/mfspr register decoding down into execute1
Instead of doing mfctr, mflr, mftb, mtctr, mtlr as separate ops,
just pass down mfspr and mtspr ops with the spr number and let
execute1 decode which SPR we're addressing.  This will help reduce
the number of instruction bits decode1 needs to look at.

In fact we now pass down the whole instruction from decode2 to
execute1.  We will need more bits of the instruction in future,
and the tools should just optimize away any that we don't end
up using.  Since the 'aa' bit was just a copy of an instruction
bit, we can now remove it from the record.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Benjamin Herrenschmidt 3e6f656a90 Add MCRF instruction
Hopefully it's not too timing catastrophic. The variable newcrf will
be handy for the other CR ops when we implement them I suspect.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
5 years ago
Benjamin Herrenschmidt 554ae88540 Implement absolute branches
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
5 years ago
Anton Blanchard b57325ce29 Merge branch 'divider' of https://github.com/paulusmack/microwatt 5 years ago
Anton Blanchard 5a6f8d26d1 Rename OP_SUBFC -> OP_SUBFE, OP_ADDC -> OP_ADDE
These were somewhat badly named.

Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
5 years ago
Paul Mackerras d5bc6c8824 Add a divider unit and a testbench for it
This adds a divider unit, connected to the core in much the same way
that the multiplier unit is connected.  The division algorithm is
very simple-minded, taking 64 clock cycles for any division (even
32-bit division instructions).

The decoding is simplified by making use of regularities in the
instruction encoding for div* and mod* instructions.  Instead of
having PPC_* encodings from the first-stage decoder for each of the
different div* and mod* instructions, we now just have PPC_DIV and
PPC_MOD, and the inputs to the divider that indicate what sort of
division operation to do are derived from instruction word bits.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Benjamin Herrenschmidt 98f0994698 Add core debug module
This module adds some simple core controls:

  reset, stop, start, step

along with icache clear and reading the NIA and core
status bits

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org
5 years ago
Anton Blanchard 7bb88d5321
Merge pull request #59 from antonblanchard/trap-decode
Fix make check
5 years ago
Anton Blanchard 427effdaa9 Fix make check
We need to finish support for all the trap instructions, but for now
we at least need a decode entry for tw, so we know to stall until the
previous instruction completes. Some of our test cases were failing
because the trap executed before the previous instruction completed.

All these trap instructions need to be resolved at completion, not
in execute.

Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
5 years ago
Anton Blanchard 9867fb6149 Add a decode for the nop instruction
We want these to go out without any GPR dependencies, so add
a specific entry in decode for them.

Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
5 years ago
Anton Blanchard acdb2ea157 No need to gate nia or insn in decode1
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
5 years ago
Anton Blanchard b9e28598b4 Explicitly check against '1' in if statements
nvc doesn't like what I think is a VHDL 2008 construct. Lets just
check against '1' explicitly.

Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
5 years ago
Anton Blanchard a2df2a10a2 Remove sim console
We can force all existing code to use the UART console
by passing 0 in bit zero of the sim config register.

Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
5 years ago
Anton Blanchard 92a7152370 Rework pipeline, add stall and flush signals
This adds stall and flush signals to the pipeline.

Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
5 years ago
Anton Blanchard 9687034d78 Add a decode bit to mark an instruction as single through the pipeline
This is used by the pipelining patches. Mark everyone as single through
the pipeline to start.

Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
5 years ago
Benjamin Herrenschmidt b0ade2857f decode1 array fix header
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
5 years ago
Anton Blanchard 9fbaea6f08 Rework CR file and add forwarding
Handle the CR as a single field with per nibble enables. Forward any
writes in the same cycle.

If this proves to be an issue for timing, we may want to revisit
this in the future. For now, it keeps things simple.

Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
5 years ago
Anton Blanchard 5a29cb4699 Initial import of microwatt
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
5 years ago