This gets rid of the adder in writeback that computes redirect_nia.
Instead, the main adder in the ALU is used to compute the branch
target for relative branches. We now decode b and bc differently
depending on the AA field, generating INSN_brel, INSN_babs, INSN_bcrel
or INSN_bcabs as appropriate. Each one has a separate entry in the
decode table in decode1; the *rel versions use CIA as the A input.
The bclr/bcctr/bctar and rfid instructions now select ramspr_result
for the main result mux to get the redirect address into
ex1.e.write_data.
For branches which are predicted taken but not actually taken, we need
to redirect to the following instruction. We also need to do that for
isync. We do this in the execute2 stage since whether or not to do it
depends on the branch result. The next_nia computation is moved to
the execute2 stage and comes in via a new leg on the secondary result
multiplexer, making next_nia available ultimately in ex2.e.write_data.
This also means that the next_nia leg of the primary result
multiplexer is gone. Incrementing last_nia by 4 for sc (so that SRR0
points to the following instruction) is also moved to execute2.
Writing CIA+4 to LR was previously done through the main result
multiplexer. Now it comes in explicitly in the ramspr write logic.
Overall this removes the br_offset and abs_br fields and the logic to
add br_offset and next_nia, and one leg of the primary result
multiplexer, at the cost of a few extra control signals between
execute1 and execute2 and some multiplexing for the ramspr write side
and an extra input on the secondary result multiplexer.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
By not assigning to interrupt_out.srr1 in some circumstances, the
writeback_1 process creates an inferred latch, which is not
desirable. Eliminate it by restructuring the code so
interrupt_out.srr1 is always set, to zeroes if nothing else.
Fixes: bc4d02cb0d ("Start removing SPRs from register file", 2022-07-12)
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This starts the process of removing SPRs from the register file by
moving SRR0/1, SPRG0-3, HSRR0/1 and HSPRG0/1 out of the register file
and putting them into execute1. They are stored in a pair of small
RAM arrays, referred to as "even" and "odd". The reason for having
two arrays is so that two values can be read and written in each
cycle. For example, SRR0 and SRR1 can be written in parallel by an
interrupt and read in parallel by the rfid instruction.
The addresses in the RAM which will be accessed are determined in the
decode2 stage. We have one write address for both sides, but two read
addresses, since in future we will want to be able to read CTR at the
same time as either LR or TAR.
We now have a connection from writeback to execute1 which carries the
partial SRR1 value for an interrupt. SRR0 comes from the execute
pipeline; we no longer need to carry instruction addresses along the
LSU and FPU pipelines. Since SRR0 and SRR1 can be written in the same
cycle now, we don't need the little state machine in writeback any
more.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
- Arrange for XER to be written for OE=1 forms
- Arrange for condition codes to be set for RC=1 forms
(including correct handling for 32-bit mode)
- Don't instantiate the divider if we have an FPU.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
At present the busy/stall signal going to decode1 depends on whether
control thinks it can issue the current instruction, and that depends
on completion and bypass signals coming from execute1 and writeback.
To improve the timing of stall_out, this rearranges decode2 so that
stall_out is asserted when we have a valid instruction that couldn't
be issued in the previous cycle. This means that decode1 could give
us a new instruction when we haven't issued the previous instruction.
This in turn means that we can only use d_in in the first cycle of
processing an instruction. After the first cycle, we get register
addresses etc. from dc2 rather than d_in.
Then, to avoid the need to read register operands from register_file
in each cycle until the instruction issues, we bring the bypass path
for data being written to the register file into decode2 explicitly
rather than having it in register_file.
A new process called decode2_addrs does the process of calling
decode_input_reg_* and decode_output_reg and sets up the register file
addresses. This was split out (and decode_input_reg_* reworked) to
try to reduce the number of passes through the decode2_1 process that
need to be done in simulation.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This lets us forward the CR0 result to following instructions that
use CR, meaning they get to issue one cycle earlier.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This implements most of the architected PMU events. The ones missing
are mostly the ones that depend on which level of the cache hierarchy
data is fetched from. The events implemented here, and their raw
event codes, are:
Floating-point operation completed (100f4)
Load completed (100fc)
Store completed (200f0)
Icache miss (200fc)
ITLB miss (100f6)
ITLB miss resolved (400fc)
Dcache load miss (400f0)
Dcache load miss resolved (300f8)
Dcache store miss (300f0)
DTLB miss (300fc)
DTLB miss resolved (200f6)
No instruction available and none being executed (100f8)
Instruction dispatched (200f2, 300f2, 400f2)
Taken branch instruction completed (200fa)
Branch mispredicted (400f6)
External interrupt taken (200f8)
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This is the start of an implementation of a PMU according to PowerISA
v3.0B. Things not implemented yet include most architected events,
the BHRB, event-based branches, thresholding, MMCR0[TBCC] field, etc.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This moves the logic for redirecting fetching and writing SRR0 and
SRR1 to writeback. The aim is that ultimately units other than
execute1 can send their interrupts to writeback along with their
instruction completions, so that there can be multiple instructions
in flight without needing execute1 to keep track of the address
of each outstanding instruction.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This changes the way GPR hazards are detected and tracked. Instead of
having a model of the pipeline in gpr_hazard.vhdl, which has to mirror
the behaviour of the real pipeline exactly, we now assign a 2-bit tag
to each instruction and record which GSPR the instruction writes.
Subsequent instructions that need to use the GSPR get the tag number
and stall until the value with that tag is being written back to the
register file.
For now, the forwarding paths are disabled. That gives about a 8%
reduction in coremark performance.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This adds the skeleton of a floating-point unit and implements the
mffs and mtfsf instructions.
Execute1 sends FP instructions to the FPU and receives busy,
exception, FP interrupt and illegal interrupt signals from it.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This extends the register file so it can hold FPR values, and
implements the FP loads and stores that do not require conversion
between single and double precision.
We now have the FP, FE0 and FE1 bits in MSR. FP loads and stores
cause a FP unavailable interrupt if MSR[FP] = 0.
The FPU facilities are optional and their presence is controlled by
the HAS_FPU generic passed down from the top-level board file. It
defaults to true for all except the A7-35 boards.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
In 32-bit mode, effective addresses are truncated to 32 bits, both for
instruction fetches and data accesses, and CR0 is set for Rc=1 (record
form) instructions based on the lower 32 bits of the result rather
than all 64 bits.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This changes the instruction dependency tracking so that we can
generate a "busy" signal from execute1 and loadstore1 which comes
along one cycle later than the current "stall" signal. This will
enable us to signal busy cycles only when we need to from loadstore1.
The "busy" signal from execute1/loadstore1 indicates "I didn't take
the thing you gave me on this cycle", as distinct from the previous
stall signal which meant "I took that but don't give me anything
next cycle". That means that decode2 proactively gives execute1
a new instruction as soon as it has taken the previous one (assuming
there is a valid instruction available from decode1), and that then
sits in decode2's output until execute1 can take it. So instructions
are issued by decode2 somewhat earlier than they used to be.
Decode2 now only signals a stall upstream when its output buffer is
full, meaning that we can fill up bubbles in the upstream pipe while a
long instruction is executing. This gives a small boost in
performance.
This also adds dependency tracking for rA updates by update-form
load/store instructions.
The GPR and CR hazard detection machinery now has one extra stage,
which may not be strictly necessary. Some of the code now really
only applies to PIPELINE_DEPTH=1.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This makes some simplifications to the interrupt logic which will
help with later commits.
- When irq_valid is set, don't set exception to 1 until we have a
valid instruction. That means we can remove the if e_in.valid = '1'
test from the exception = '1' block.
- Don't assert stall_out on the first cycle of delivering an
interrupt. If we do get another instruction in the next cycle,
nothing will happen because we have ctrl.irq_state set and we
will just continue writing the interrupt registers.
- Make sure we deliver as many completions as we got instructions,
otherwise the outstanding instruction count in control.vhdl gets
out of sync.
- In writeback, make sure all of the other write enables are ignored
when e_in.exc_write_enable is set.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This adds separate fields in Execute1ToWritebackType for use in
writing SRR0/1 (and in future other SPRs) on an interrupt. With
this, we make timing once again on the Arty A7-100 -- previously
we were missing by 0.2ns, presumably due to the result mux being
wider than before.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This puts all the data formatting (byte rotation based on lowest three
bits of the address, byte reversal, sign extension, zero extension)
in loadstore1. Writeback now simply sends the data provided to the
register files.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
So that the dcache could in future be used by an MMU, this moves
logic to do with data formatting, rA updates for update-form
instructions, and handling of unaligned loads and stores out of
dcache and into loadstore1. For now, dcache connects only to
loadstore1, and loadstore1 now has the connection to writeback.
Dcache generates a stall signal to loadstore1 which indicates that
the request presented in the current cycle was not accepted and
should be presented again. However, loadstore1 doesn't currently
use it because we know that we can never hit the circumstances
where it might be set.
For unaligned transfers, loadstore1 generates two requests to
dcache back-to-back, and then waits to see two acks back from
dcache (cycles where d_in.valid is true).
Loadstore1 now has a FSM for tracking how many acks we are
expecting from dcache and for doing the rA update cycles when
necessary. Handling for reservations and conditional stores is
still in dcache.
Loadstore1 now generates its own stall signal back to decode2,
so we no longer need the logic in execute1 that generated the stall
for the first two cycles.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This involves plumbing the (existing) 'reserve' and 'rc' bits in
the decode tables down to dcache, and 'rc' and 'store_done' bits
from dcache to writeback.
It turns out that we had 'RC' set in the 'rc' column for several
ordinary stores and for the attn instruction. This corrects them
to 'NONE', and sets the 'rc' column to 'ONE' for the conditional
stores.
In writeback we now have logic to set CR0 when the input from dcache
has rc = 1.
In dcache we have the reservation itself, which has a valid bit
and the address down to cache line granularity. We don't currently
store the reservation length. For a store conditional which fails,
we set a 'cancel_store' signal which inhibits the write to the
cache and prevents the state machine from starting a bus cycle or
going to the STORE_WAIT_ACK state. Instead we set r1.stcx_fail
which causes the instruction to complete in the next cycle with
rc=1 and store_done=0.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
For an unaligned load or store, we do the first doubleword (dword) of
the transfer as normal, but then go to a new NEXT_DWORD state of the
state machine to do the cache tag lookup for the second dword of the
transfer. From the NEXT_DWORD state we have much the same transitions
to other states as from the IDLE state (the transitions for OP_LOAD_HIT
are a bit different but almost identical for the other op values).
We now do the preparation of the data to be written in loadstore1,
that is, byte reversal if necessary and rotation by a number of
bytes based on the low 3 bits of the address. We do rotation not
shifting so we have the bytes that need to go into the second
doubleword in the right place in the low bytes of the data sent to
dcache. The rotation and byte reversal are done in a single step
with one multiplexer per byte by setting the select inputs for each
byte appropriately.
This also fixes writeback to not write the register value until it
has received both pieces of an unaligned load value.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Diamond doesn't like the "" & method of converting std_logic to a single bit
std_logic_vector. Thanks to Olof Kindgren for this patch.
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
This moves the sign extension done by the extsb, extsh and extsw
instructions back into execute1. This means that we no longer need
any data formatting in writeback for results coming from execute1,
so this modifies writeback so the data formatter inputs come
directly from the loadstore unit output. The condition code
updates for RC=1 form instructions are now done on the value from
execute1 rather than the output of the data formatter, which should
help timing.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
With this, the divider is a unit that execute1 sends operands to and
which sends its results back to execute1, which then send them to
writeback. Execute1 now sends a stall signal when it gets a divide
or modulus instruction until it gets a valid signal back from the
divider. Divide and modulus instructions are no longer marked as
single-issue.
The data formatting step that used to be done in decode2 for div
and mod instructions is now done in execute1. We also do the
absolute value operation in that same cycle instead of taking an
extra cycle inside the divider for signed operations with a
negative operand.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
With this, the multiplier isn't a separate pipe that decode2 issues
instructions to, but rather is a unit that execute1 sends operands
to and which sends the result back to execute1, which then sends it
to writeback. Execute1 now sends a stall signal when it gets a
multiply instruction until it gets a valid signal back from the
multiplier.
This all means that we no longer need to mark the multiply
instructions as single-issue.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This stores the most common SPRs in the register file.
This includes CTR and LR and a not yet final list of others.
The register file is set to 64 entries for now. Specific types
are defined that can represent a GPR index (gpr_index_t) or
a GPR/SPR index (gspr_index_t) along with conversion functions
between the two.
On order to deal with some forms of branch updating both LR and
CTR, we introduced a delayed update of LR after a branch link.
Note: We currently stall the pipeline on such a delayed branch,
but we could avoid stalling fetch in that specific case as we
know we have a branch delay. We could also limit that to the
specific case where we need to update both CTR and LR.
This allows us to make bcreg, mtspr and mfspr pipelined. decode1
will automatically force the single issue flag on mfspr/mtspr to
a "slow" SPR.
[paulus@ozlabs.org - fix direction of decode2.stall_in]
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
The carry is currently internal to execute1. We don't handle any of
the other XER fields.
This creates type called "xer_common_t" that contains the commonly
used XER bits (CA, CA32, SO, OV, OV32).
The value is stored in the CR file (though it could be a separate
module). The rest of the bits will be implemented as a separate
SPR and the two parts reconciled in mfspr/mtspr in latter commits.
We always read XER in decode2 (there is little point not to)
and send it down all pipeline branches as it will be needed in
writeback for all type of instructions when CR0:SO needs to be
updated (such forms exist for all pipeline branches even if we don't
yet implement them).
To avoid having to track XER hazards, we forward it back in EX1. This
assumes that other pipeline branches that can modify it (mult and div)
are running single issue for now.
One additional hazard to beware of is an XER:SO modifying instruction
in EX1 followed immediately by a store conditional. Due to our writeback
latency, the store will go down the LSU with the previous XER value,
thus the stcx. will set CR0:SO using an obsolete SO value.
I doubt there exist any code relying on this behaviour being correct
but we should account for it regardless, possibly by ensuring that
stcx. remain single issue initially, or later by adding some minimal
tracking or moving the LSU into the same pipeline as execute.
Missing some obscure XER affecting instructions like addex or mcrxrx.
[paulus@ozlabs.org - fix CA32 and OV32 for OP_ADD, fix order of
arguments to set_ov]
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
The CR update currently depends on the complete data formatting
mux chain. This makes it source its inputs from a bit earlier in
the chian, thus improving timing a bit
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This replaces loadstore2 with a dcache
The dcache unit is losely based on the icache one (same basic cache
layout), but has some significant logic additions to deal with stores,
loads with update, non-cachable accesses and other differences due to
operating in the execution part of the pipeline rather than the fetch
part.
The cache is store-through, though a hit with an existing line will
update the line rather than invalidate it.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Initialize to 0 forces the mux to have an extra leg fed with zeros.
Instead initialize data_in to one of the mux inputs
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Since the condition setting got moved to writeback, execute2 does
nothing aside from wasting a cycle. This removes it.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This makes the exts[bhw] instructions do the sign extension in the
writeback stage using the sign-extension logic there instead of
having unique sign extension logic in execute1. This requires
passing the data length and sign extend flag from decode2 down
through execute1 and execute2 and into writeback. As a side bonus
we reduce the number of values in insn_type_t by two.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This adds code to writeback to format data and test the result
against zero for the purpose of setting CR0. The data formatter
is able to shift and mask by bytes and do byte reversal and sign
extension. It can also put together bytes from two input
doublewords to support unaligned loads (including unaligned
byte-reversed loads).
The data formatter starts with an 8:1 multiplexer that is able
to direct any byte of the input to any byte of the output. This
lets us rotate the data and simultaneously byte-reverse it.
The rotated/reversed data goes to a register for the unaligned
cases that overlap two doublewords. Then there is per-byte logic
that does trimming, sign extension, and splicing together bytes
from a previous input doubleword (stored in data_latched) and the
current doubleword. Finally the 64-bit result is tested to set
CR0 if rc = 1.
This removes the RC logic from the execute2, multiply and divide
units, and the shift/mask/byte-reverse/sign-extend logic from
loadstore2.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This adds a divider unit, connected to the core in much the same way
that the multiplier unit is connected. The division algorithm is
very simple-minded, taking 64 clock cycles for any division (even
32-bit division instructions).
The decoding is simplified by making use of regularities in the
instruction encoding for div* and mod* instructions. Instead of
having PPC_* encodings from the first-stage decoder for each of the
different div* and mod* instructions, we now just have PPC_DIV and
PPC_MOD, and the inputs to the divider that indicate what sort of
division operation to do are derived from instruction word bits.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
The pipeline had a cycle in writeback. Writeback is pretty
simple and unlikely to be a bottleneck, so lets remove it.
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
We want to make sure we never complete more than one
instruction per cycle, or write back more than one GPR
or CR per cycle.
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
We only need two write ports for load with update instructions.
Having two write ports just for this instruction is expensive.
For now we will force them to be the only instruction in the
pipeline, and take two cycles of writeback.
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
Handle the CR as a single field with per nibble enables. Forward any
writes in the same cycle.
If this proves to be an issue for timing, we may want to revisit
this in the future. For now, it keeps things simple.
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>