This implements the lq, stq, lqarx and stqcx. instructions.
These instructions all access two consecutive GPRs; for example the
"lq %r6,0(%r3)" instruction will load the doubleword at the address
in R3 into R7 and the doubleword at address R3 + 8 into R6. To cope
with having two GPR sources or destinations, the instruction gets
repeated at the decode2 stage, that is, for each lq/stq/lqarx/stqcx.
coming in from decode1, two instructions get sent out to execute1.
For these instructions, the RS or RT register gets modified on one
of the iterations by setting the LSB of the register number. In LE
mode, the first iteration uses RS|1 or RT|1 and the second iteration
uses RS or RT. In BE mode, this is done the other way around. In
order for decode2 to know what endianness is currently in use, we
pass the big_endian flag down from icache through decode1 to decode2.
This is always in sync with what execute1 is using because only rfid
or an interrupt can change MSR[LE], and those operations all cause
a flush and redirect.
There is now an extra column in the decode tables in decode1 to
indicate whether the instruction needs to be repeated. Decode1 also
enforces the rule that lq with RT = RT and lqarx with RA = RT or
RB = RT are illegal.
Decode2 now passes a 'repeat' flag and a 'second' flag to execute1,
and execute1 passes them on to loadstore1. The 'repeat' flag is set
for both iterations of a repeated instruction, and 'second' is set
on the second iteration. Execute1 does not take asynchronous or
trace interrupts on the second iteration of a repeated instruction.
Loadstore1 uses 'next_addr' for the second iteration of a repeated
load/store so that we access the second doubleword of the memory
operand. Thus loadstore1 accesses the doublewords in increasing
memory order. For 16-byte loads this means that the first iteration
writes GPR RT|1. It is possible that RA = RT|1 (this is a legal
but non-preferred form), meaning that if the memory operand was
misaligned, the first iteration would overwrite RA but then the
second iteration might take a page fault, leading to corrupted state.
To avoid that possibility, 16-byte loads in LE mode take an
alignment interrupt if the operand is not 16-byte aligned. (This
is the case anyway for lqarx, and we enforce it for lq as well.)
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This extends the register file so it can hold FPR values, and
implements the FP loads and stores that do not require conversion
between single and double precision.
We now have the FP, FE0 and FE1 bits in MSR. FP loads and stores
cause a FP unavailable interrupt if MSR[FP] = 0.
The FPU facilities are optional and their presence is controlled by
the HAS_FPU generic passed down from the top-level board file. It
defaults to true for all except the A7-35 boards.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This adds a path to allow the CR result of one instruction to be
forwarded to the next instruction, so that sequences such as
cmp; bc can avoid having a 1-cycle bubble.
Forwarding is not available for dot-form (Rc=1) instructions,
since the CR result for them is calculated in writeback. The
decode.output_cr field is used to identify those instructions
that compute the CR result in execute1.
For some reason, the multiply instructions incorrectly had
output_cr = 1 in the decode tables. This fixes that.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This changes the instruction dependency tracking so that we can
generate a "busy" signal from execute1 and loadstore1 which comes
along one cycle later than the current "stall" signal. This will
enable us to signal busy cycles only when we need to from loadstore1.
The "busy" signal from execute1/loadstore1 indicates "I didn't take
the thing you gave me on this cycle", as distinct from the previous
stall signal which meant "I took that but don't give me anything
next cycle". That means that decode2 proactively gives execute1
a new instruction as soon as it has taken the previous one (assuming
there is a valid instruction available from decode1), and that then
sits in decode2's output until execute1 can take it. So instructions
are issued by decode2 somewhat earlier than they used to be.
Decode2 now only signals a stall upstream when its output buffer is
full, meaning that we can fill up bubbles in the upstream pipe while a
long instruction is executing. This gives a small boost in
performance.
This also adds dependency tracking for rA updates by update-form
load/store instructions.
The GPR and CR hazard detection machinery now has one extra stage,
which may not be strictly necessary. Some of the code now really
only applies to PIPELINE_DEPTH=1.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
The icache would still spit out an instruction which could
cause a 0x700 instead of a reset.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This enables back-to-back execution of integer instructions where
the first instruction writes a GPR and the second reads the same
GPR. This is done with a set of multiplexers at the start of
execute1 which enable any of the three input operands to be taken
from the output of execute1 (i.e. r.e.write_data) rather than the
input from decode2 (i.e. e_in.read_data[123]).
This also requires changes to the hazard detection and handling.
Decode2 generates a signal indicating that the GPR being written
is available for bypass, which is true for instructions that are
executed in execute1 (rather than loadstore1/dcache). The
gpr_hazard module stores this "bypassable" bit, and if the same
GPR needs to be read by a subsequent instruction, it outputs a
"use_bypass" signal rather than generating a stall. The
use_bypass signal is then latched at the output of decode2 and
passed down to execute1 to control the input multiplexer.
At the moment there is no bypass on the inputs to loadstore1, but that
is OK because all load and store instructions are marked as
single-issue.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
I'm hitting an issue with the Fedora 31 version of GHDL that
appears to be fixed upstream:
control.vhdl:105:39:error: actual expression must be globally static
Add a signal to get rid of error.
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
This stores the most common SPRs in the register file.
This includes CTR and LR and a not yet final list of others.
The register file is set to 64 entries for now. Specific types
are defined that can represent a GPR index (gpr_index_t) or
a GPR/SPR index (gspr_index_t) along with conversion functions
between the two.
On order to deal with some forms of branch updating both LR and
CTR, we introduced a delayed update of LR after a branch link.
Note: We currently stall the pipeline on such a delayed branch,
but we could avoid stalling fetch in that specific case as we
know we have a branch delay. We could also limit that to the
specific case where we need to update both CTR and LR.
This allows us to make bcreg, mtspr and mfspr pipelined. decode1
will automatically force the single issue flag on mfspr/mtspr to
a "slow" SPR.
[paulus@ozlabs.org - fix direction of decode2.stall_in]
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
To match our one stage execute.
This might change back if we end up adding 2 stages to match the
LSU, but in that case we'll want forwards as well.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Anything that isn't a load or store and anything that doesn't read the
CR can go as soon as its inputs are ready.
While we could also allow SPR read/write and carry read/write, we plan
to change them to be read in decode2 and written in writeback soon and
they will need separate hazard detection to be added.
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
Check GPRs against any writers in the pipeline.
All instructions are still marked single in pipeline at
this stage.
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>