This implements all the sync variants (sync, lwsync, ptesync, etc.) as
a LSU op that gets sent down to the dcache and completes once the
dcache state machine is idle.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This implements dcbf, dcbt and dcbtst in the dcache. The dcbst (data
cache block store) instruction remains a no-op because our dcache is
write-through and therefore never has modified data that could need to
be written back.
Dcbt (data cache block touch) and dcbtst (data cache block touch for
store) behave similarly except that dcbtst is a no-op on a readonly
page. Neither instruction ever causes an interrupt. If they miss in
the cache and the page is cacheable, they are handled like a load miss
except that they complete immediately the state machine starts
handling the load miss rather than waiting for any data.
Dcbf (data cache block flush) can cause a data storage interrupt. If
it hits in the cache, the state machine goes to a new FLUSH_CYCLE
state in which the cache line valid bit is cleared.
In order to avoid having more than 8 values in op_t, this combines
OP_STORE_MISS and OP_STORE_HIT into a single state. A new OP_NOP
state is used for operations which can complete immediately without
changing any dcache state (now used for dcbt/dcbtst causing access
exception or on a non-cachable page, or dcbf that misses the cache).
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This restructures the reservation machinery so that the reservation is
cleared when a snooped store by another agent is done to reservation
address. The reservation address is now a real address rather than an
effective address.
For store-conditional, it is possible that a snooped store to the
reservation address could come in even after we have asserted cyc and
stb on the wishbone to do the store, and that should cause the store
not to be performed. To achieve this, store-conditional now uses a
separate state in the r1 state machine, which is set up so that losing
the reservation due to a snooped store cause cyc and stb to be dropped
immediately, and the store-conditional fails.
For load-reserve, the reservation address is set at the end of cycle 1
and the reservation is made valid when the data is available. For
lqarx, the reservation is made valid when the first doubleword of data
is available.
For the case where a snooped write comes in on cycle 0 of a larx and
hits the same cache line, we detect that the index and way of the
snooped write are the same as the index and way of the larx; it is
done this way because reservation.addr is not set until the real
address is available at the end of cycle 1. A hit on the same index
and way causes reservation.valid to be set to 0 at the end of cycle 1.
For a write in cycle 1, we compare the latched address in cycle 2 with
the reservation address and clear reservation.valid at the end of
cycle 2 if they match. In other words we compare the reservation
address with both the address being written this cycle and the address
being written in the previous cycle.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This adds implementations of lq, plq, stq, pstq, lqarx and stqcx.
Because register file addresses are now computed in decode1 before we
have the decode table entry for the instruction, we have to check the
icode directly to know when to read register RS|1 before RS (i.e. for
stq and stqcx in LE mode, but not pstq).
For the second instance of the instruction, loadstore1 uses the EA
from the first instance + 8. It generates an alignment interrupt for
unaligned lqarx and stqcx and for lq in LE mode with an unaligned
address. (The reason for the latter case is that it writes RT|1
before RT, and if we have RA = RT|1 and the second instance traps, we
will have overwritten RA.)
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This arranges to generate an illegal instruction type program
interrupt for illegal prefixed instructions, that is, those where the
suffix is not a legal value given the prefix, or the prefix has a
reserved value in the subtype field. This implementation doesn't
generate an interrupt for the invalid 8LS:D and MLS:D instruction
forms where R = 1 and RA != 0. (In those cases it uses (RA) as the
addend, i.e. it ignores the R bit.)
This detects the case where the address of an instruction prefix is
equal mod 64 to 60, and generates an alignment interrupt in that case.
This also arranges to set bit 34 of SRR1 when an interrupt occurs due
to a prefixed instruction, for those interrupts where that is required
(i.e. trace, alignment, floating-point unavailable, data storage, data
segment, and most cases of program interrupt).
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Instead of having a multiplexer in loadstore1 in order to be able to
put the instruction address into v.addr, we now set decode.input_reg_a
to CIA in the decode table entry for OP_FETCH_FAILED. That means that
the operand selection machinery in decode2 will supply the instruction
address to loadstore1 on the lv.addr1 input and no special case is
needed in loadstore1. This saves a few LUTs (~40 on the Artix-7).
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This removes some logic that was previously added for the 16-byte
loads and stores (lq, lqarx, stq, stqcx.) and not completely removed
in commit c9e838b656 ("Remove support for lq, stq, lqarx and
stqcx.", 2022-06-04).
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This eliminates one leg of the output value multiplexer, and seems
to improve timing slightly on the A7-100.
Since SPR values are written in stage 3 and read in stage 2, an mfspr
immediately following an mtspr to the same SPR won't give the correct
value. To avoid this, we make mtspr to the load/store SPRs single
issue in decode1.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
With this, the register file now contains 64 entries, for 32 GPRs and
32 FPRs, rather than the 128 it had previously. Several things get
simplified - decode1 no longer has to work out the ispr{1,2,o} values,
decode_input_reg_{a,b,c} no longer have the t = SPR case, etc.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This starts the process of removing SPRs from the register file by
moving SRR0/1, SPRG0-3, HSRR0/1 and HSPRG0/1 out of the register file
and putting them into execute1. They are stored in a pair of small
RAM arrays, referred to as "even" and "odd". The reason for having
two arrays is so that two values can be read and written in each
cycle. For example, SRR0 and SRR1 can be written in parallel by an
interrupt and read in parallel by the rfid instruction.
The addresses in the RAM which will be accessed are determined in the
decode2 stage. We have one write address for both sides, but two read
addresses, since in future we will want to be able to read CTR at the
same time as either LR or TAR.
We now have a connection from writeback to execute1 which carries the
partial SRR1 value for an interrupt. SRR0 comes from the execute
pipeline; we no longer need to carry instruction addresses along the
LSU and FPU pipelines. Since SRR0 and SRR1 can be written in the same
cycle now, we don't need the little state machine in writeback any
more.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
They are optional in SFFS (scalar fixed-point and floating-point
subset), are not needed for running Linux, and add complexity, so
remove them.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This makes the FPU able to stall other units at execute stage 2 and be
stalled by other units (specifically the LSU).
This means that the completion and writeback for an instruction can
now end up being deferred until the second cycle of a following
instruction, i.e. the cycle when the state machine has gone through
IDLE state into one of the DO_* states, which means we need to latch
the destination FPR number, CR mask, etc. from the previous
instruction so that we present the correct information to writeback.
The advantage of this is that we can get rid of the in_progress signal
from the LSU.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Execute1 and loadstore1 now send each other stall signals that
indicate that a valid instruction in stage 2 can't complete in this
cycle, and hence any valid instruction in stage 1 in the other unit
can't move to stage 2. With this in place, an ALU instruction can
move into stage 1 while a LSU instruction is in stage 2.
Since the FPU doesn't yet have a way to stall completion, we can't yet
start FPU instructions while any LSU or ALU instruction is in
progress.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Simplify the flow control by stalling the whole upstream pipeline when
a stage can't proceed, instead of trying to let each stage progress
independently when it can.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
While these signals should only be read when valid is true, they
are only a small number of bits and we want to reduce the amount of
U/X state bouncing around the chip.
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
Some critical path reports showed r1.req.addr depending on l_in.valid,
which then depended ultimately on the dcache's r1.ls_valid. In fact
we can update r1.req.addr (and other fields of r1.req, except for
r1.req.valid) independently of l_in.valid as long as busy = 0.
We do also need to preserve r1.req.addr0 when l_in.valid = 0, so we
pull it out of r1.req and store it separately in r1.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This implements most of the architected PMU events. The ones missing
are mostly the ones that depend on which level of the cache hierarchy
data is fetched from. The events implemented here, and their raw
event codes, are:
Floating-point operation completed (100f4)
Load completed (100fc)
Store completed (200f0)
Icache miss (200fc)
ITLB miss (100f6)
ITLB miss resolved (400fc)
Dcache load miss (400f0)
Dcache load miss resolved (300f8)
Dcache store miss (300f0)
DTLB miss (300fc)
DTLB miss resolved (200f6)
No instruction available and none being executed (100f8)
Instruction dispatched (200f2, 300f2, 400f2)
Taken branch instruction completed (200fa)
Branch mispredicted (400f6)
External interrupt taken (200f8)
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
At present the logic prevents any interrupts from being handled while
there is a load/store instruction (one that has unit=LDST) being
executed. However, load/store instructions can still get sent to
loadstore1. Thus an instruction which should generate an interrupt
such as a floating-point unavailable interrupt will instead get
executed.
To fix this, when we detect that an interrupt should be generated but
loadstore1 is still executing a previous instruction, we don't execute
any new instructions, and set a new r.intr_pending flag. That results
in busy_out being asserted (meaning that no further instructions will
come in from decode2). When loadstore1 has finished the instructions
it has, the interrupt gets sent to writeback. If one of the
instructions in loadstore1 generates an interrupt in the meantime, the
l_in.interrupt signal gets asserted and that clears r.intr_pending, so
the interrupt we detected gets discarded.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This implements a 1-entry partition table, so that instead of getting
the process table base address from the PRTBL SPR, the MMU now reads
the doubleword pointed to by the PTCR register plus 8 to get the
process table base address. The partition table entry is cached.
Having the PTCR and the vestigial partition table reduces the amount
of software change required in Linux for Microwatt support.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
If the DAR and DSISR are read before they are written, we assert with:
register_file.vhdl:55:25:@60195ns:(report note): Writing GPR 09 00000000XXXXXXXX
register_file.vhdl:61:17:@60195ns:(assertion failure): Assertion violation
This initialises DAR/DSISR to avoid this.
Signed-off-by: Michael Neuling <mikey@neuling.org>
The idea here is that we can have multiple instructions in progress at
the same time as long as they all go to the same unit, because that
unit will keep them in order. If we get an instruction for a
different unit, we wait for all the previous instructions to finish
before executing it. Since the loadstore unit is the only one that is
currently pipelined, this boils down to saying that loadstore
instructions can go ahead while l_in.in_progress = 1 but other
instructions have to wait until it is 0.
This gives a 2% increase on coremark performance on the Arty A7-100
(from ~190 to ~194).
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This makes loadstore use a 3-stage pipeline. For now, only one
instruction goes through the pipe at a time. Completion and writeback
are still combinatorial off the valid signal back from the dcache, so
performance should be the same as before. In future it should be able
to sustain one load or store per cycle provided they hit in the
dcache.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This fixes two bugs which show up when multiple operations are in
flight in the dcache, and adds a 'hold' input which will be needed
when loadstore1 is pipelined.
The first bug is that dcache needs to sample the data for a store on
the cycle after the store request comes in even if the store request
is held up because of a previous request (e.g. if the previous request
is a load miss or a dcbz).
The second bug is that a load request coming in for a cache line being
refilled needs to be handled immediately in the case where it is for
the row whose data arrives on the same cycle. If it is not, then it
will be handled as a separate cache miss and the cache line will be
refilled again into a different way, leading to two ways both being
valid for the same tag. This can lead to data corruption, in the
scenario where subsequent writes go to one of the ways and then that
way gets displaced but the other way doesn't. This bug could in
principle show up even without having multiple operations in flight in
the dcache.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This changes the way GPR hazards are detected and tracked. Instead of
having a model of the pipeline in gpr_hazard.vhdl, which has to mirror
the behaviour of the real pipeline exactly, we now assign a 2-bit tag
to each instruction and record which GSPR the instruction writes.
Subsequent instructions that need to use the GSPR get the tag number
and stall until the value with that tag is being written back to the
register file.
For now, the forwarding paths are disabled. That gives about a 8%
reduction in coremark performance.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This uses the instruction-doubling machinery to send load with update
instructions down to loadstore1 as two separate ops, rather than
one op with two destinations. This will help to simplify the value
tracking mechanisms.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This makes timing easier and also means that store floating-point
single precision instructions no longer need to take an extra cycle.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This makes it simpler to work out when to deliver a FPU unavailable
interrupt. This also means we can get rid of the OP_FPLOAD and
OP_FPSTORE insn_type values.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This implements the lq, stq, lqarx and stqcx. instructions.
These instructions all access two consecutive GPRs; for example the
"lq %r6,0(%r3)" instruction will load the doubleword at the address
in R3 into R7 and the doubleword at address R3 + 8 into R6. To cope
with having two GPR sources or destinations, the instruction gets
repeated at the decode2 stage, that is, for each lq/stq/lqarx/stqcx.
coming in from decode1, two instructions get sent out to execute1.
For these instructions, the RS or RT register gets modified on one
of the iterations by setting the LSB of the register number. In LE
mode, the first iteration uses RS|1 or RT|1 and the second iteration
uses RS or RT. In BE mode, this is done the other way around. In
order for decode2 to know what endianness is currently in use, we
pass the big_endian flag down from icache through decode1 to decode2.
This is always in sync with what execute1 is using because only rfid
or an interrupt can change MSR[LE], and those operations all cause
a flush and redirect.
There is now an extra column in the decode tables in decode1 to
indicate whether the instruction needs to be repeated. Decode1 also
enforces the rule that lq with RT = RT and lqarx with RA = RT or
RB = RT are illegal.
Decode2 now passes a 'repeat' flag and a 'second' flag to execute1,
and execute1 passes them on to loadstore1. The 'repeat' flag is set
for both iterations of a repeated instruction, and 'second' is set
on the second iteration. Execute1 does not take asynchronous or
trace interrupts on the second iteration of a repeated instruction.
Loadstore1 uses 'next_addr' for the second iteration of a repeated
load/store so that we access the second doubleword of the memory
operand. Thus loadstore1 accesses the doublewords in increasing
memory order. For 16-byte loads this means that the first iteration
writes GPR RT|1. It is possible that RA = RT|1 (this is a legal
but non-preferred form), meaning that if the memory operand was
misaligned, the first iteration would overwrite RA but then the
second iteration might take a page fault, leading to corrupted state.
To avoid that possibility, 16-byte loads in LE mode take an
alignment interrupt if the operand is not 16-byte aligned. (This
is the case anyway for lqarx, and we enforce it for lq as well.)
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This adds code to loadstore1 to convert between single-precision and
double-precision formats, and implements the lfs* and stfs*
instructions. The conversion processes are described in Power ISA
v3.1 Book 1 sections 4.6.2 and 4.6.3.
These conversions take one cycle, so lfs* and stfs* are one cycle
slower than lfd* and stfd*.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This extends the register file so it can hold FPR values, and
implements the FP loads and stores that do not require conversion
between single and double precision.
We now have the FP, FE0 and FE1 bits in MSR. FP loads and stores
cause a FP unavailable interrupt if MSR[FP] = 0.
The FPU facilities are optional and their presence is controlled by
the HAS_FPU generic passed down from the top-level board file. It
defaults to true for all except the A7-35 boards.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Load-and-reserve and store-conditional instructions are required to
generate an alignment interrupt (0x600 vector) if their EA is not
aligned. Implement this.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
In 32-bit mode, effective addresses are truncated to 32 bits, both for
instruction fetches and data accesses, and CR0 is set for Rc=1 (record
form) instructions based on the lower 32 bits of the result rather
than all 64 bits.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Big-endian mode affects both instruction fetches and data accesses.
For instruction fetches, we byte-swap each word read from memory when
writing it into the icache data RAM, and use a tag bit to indicate
whether each cache line contains instructions in BE or LE form.
For data accesses, we simply need to invert the existing byte_reverse
signal in BE mode. The only thing to be careful of is to get the sign
bit from the correct place when doing a sign-extending load that
crosses two doublewords of memory.
For now, interrupts unconditionally set MSR[LE]. We will need some
sort of interrupt-little-endian bit somewhere, perhaps in LPCR.
This also fixes a debug report statement in fetch1.vhdl.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This rearranges the code used for store data formatting so that the
"for i in 0 to 7" loop indexes the output bytes rather than the
input bytes. The new expression is formally identical to the old
but is easier to synthesize. This reduces the number of LUTs by
about 250 on the Artix-7 and improves timing.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This reworks the way that the busy and done signals are generated in
loadstore in order to work around some problems where yosys/nextpnr
are reporting combinatorial loops (not in fact on the current code but
on minor variations needed for supporting the FPU). It seems that
yosys has problems with the case statement on v.state.
This also lifts the maddr and byte_sel generation out of the case
statement. The overall result is a slight reduction in resource usage
(~30 6-input LUTs on the A7-100).
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This computes the address sent to the MMU separately from that sent
to the dcache. This means that the address sent to the MMU doesn't
have the delay through the lsu_sum adder, making it available earlier.
The path through the lsu_sum adder and through the MMU to the MMU
done and err outputs showed up as a critical path on some builds.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This makes the calculation of busy as simple as possible and dependent
only on register outputs. The timing of busy is critical, as it gates
the valid signal for the next instruction, and therefore any delays
in dropping busy at the end of a load or store directly impact the
timing of a host of other paths.
This also separates the 'done without error' and 'done with error'
cases from the MMU into separate signals that are both driven directly
from registers.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This adds "if LOG_LENGTH > 0 generate" to the places in the core
where log output data is latched, so that when LOG_LENGTH = 0 we
don't create the logic to collect the data which won't be stored.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
The computation of two_dwords from r.second_bytes has shown up as
part of a critical path at times. Instead we add a 'last_dword'
flag to the reg_stage_t record which tells us more directly
whether a valid flag coming in from dcache means that the
instruction is done, thereby shortening the path to the busy output
back to execute1.
This also simplifies some of the trim_ctl logic. The two_dwords = 0
case could never have use_second(i) = 1 for any of the bytes being
transferred, so "not use_second(i)" is always 1.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This reduces the number of cycles where loadstore1 asserts its busy
output, leading to increased throughput of loads and stores. Loads
that hit in the cache can now be executed at the rate of one every two
cycles. Stores take 4 cycles assuming the wishbone slave responds
with an ack the cycle after we assert strobe.
To achieve this, the state machine code is split into two parts, one
for when we have an existing instruction in progress, and one for
starting a new instruction. We can now combinatorially clear busy and
start a new instruction in the same cycle that we get a done signal
from the dcache; in other words we are completing one instruction and
potentially writing back results in the same cycle that we start a new
instruction and send its address and data to the dcache.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>