This adds an optional 16 bit x 16 bit signed multiplier and uses it
for multiply instructions that return the low 64 bits of the product
(mull[dw][o] and mulli, but not maddld) when the operands are both in
the range -2^15 .. 2^15 - 1. The "short" 16-bit multiplier produces
its result combinatorially, so a multiply that uses it executes in one
cycle. This improves the coremark result by about 4%, since coremark
does quite a lot of multiplies and they almost all have operands that
fit into 16 bits.
The presence of the short multiplier is controlled by a generic at the
execute1, SOC, core and top levels. For now, it defaults to off for
all platforms, and can be enabled using the --has_short_mult flag to
fusesoc.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Some critical path reports showed r1.req.addr depending on l_in.valid,
which then depended ultimately on the dcache's r1.ls_valid. In fact
we can update r1.req.addr (and other fields of r1.req, except for
r1.req.valid) independently of l_in.valid as long as busy = 0.
We do also need to preserve r1.req.addr0 when l_in.valid = 0, so we
pull it out of r1.req and store it separately in r1.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This adds a bit to the BTC to store whether the corresponding branch
instruction was taken last time it was encountered. That lets us pass
a not-taken prediction down to decode1, which for backwards direct
branches inhibits it from redirecting fetch to the target of the
branch. This increases coremark by about 2%.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This changes s0 to use the P register rather than the A/B/C input
registers, thus improving the timing of the multiplier output. The
m00, m02 and m03 multipliers now use their P registers rather than the
M registers, moving the addition they do from the second cycle to the
first.
Also, the XOR that inverts the 32 LSBs is moved before the output
register.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
A non-cacheable load should only load the data requested and no more. We
do the right thing for real mode cache inhibited storage instructions,
but when loading through a non-cacheable PTE we load the entire 64 bits
regardless of the size.
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
The row in the decode table for isel with BC=0 was inadvertently left
marked as single-issue by commit 813f834012 ("Add CR hazard
detection", 2019-10-15). Fix it.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
- mcrxrx put the bits in the wrong order
- addpcis was setting CR0 if the instruction bit 0 = 1, which it
shouldn't
- bpermd was producing 0 always and additionally had the wrong bit
numbering
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This checks that the store forwarding machinery in the dcache
correctly combines forwarded stores when they are partial stores
(i.e. only writing part of the doubleword, as for a byte store).
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
We have two stages of forwarding to cover the two cycles of latency
between when something is written to BRAM and when that new data can
be read from BRAM. When the writes to BRAM result from store
instructions, the write may write only some bytes of a row (8 bytes)
and not others, so we have a mask to enable only the written bytes to
be forwarded. However, we only forward written data from either the
first stage of forwarding or the second, not both. So if we have
two stores in succession that write different bytes of the same row,
and then a load from the row, we will only forward the data from the
second store, and miss the data from the first store; thus the load
will get the wrong value.
To fix this, we make the decision on which forward stage to use for
each byte individually. This results in a 4-input multiplexer feeding
r1.data_out, with its inputs being the BRAM, the wishbone, the current
write data, and the 2nd-stage forwarding register. Each byte of the
multiplexer is separately controlled. The code for this multiplexer
is moved to the dcache_fast_hit process since it is used for cache
hits as well as cache misses.
This also simplifies the BRAM code by ensuring that we can use the
same source for the BRAM address and way selection for writes, whether
we are writing store data or cache line refill data from memory.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This moves the way multiplexer for the data from the BRAM, and the
multiplexers for forwarding data from earlier stores or refills,
before a clock edge rather than after, so that now the data output
from the dcache comes from a clean latch. To do this we remove the
extra latch on the output of the data BRAM (i.e. ADD_BUF=false) and
rearrange the logic. The choice whether to forward or not now depends
not on way comparisons but rather on a tag comparisons, for the sake
of timing.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
While verilator finds the correct top level module with the current
setup, if we start adding simulation models it can get confused.
Explicitly specify the top level module.
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
Recent versions of verilator support the --build option, allowing
us to remove a step.
Also add a Docker image for verilator.
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
This implements most of the architected PMU events. The ones missing
are mostly the ones that depend on which level of the cache hierarchy
data is fetched from. The events implemented here, and their raw
event codes, are:
Floating-point operation completed (100f4)
Load completed (100fc)
Store completed (200f0)
Icache miss (200fc)
ITLB miss (100f6)
ITLB miss resolved (400fc)
Dcache load miss (400f0)
Dcache load miss resolved (300f8)
Dcache store miss (300f0)
DTLB miss (300fc)
DTLB miss resolved (200f6)
No instruction available and none being executed (100f8)
Instruction dispatched (200f2, 300f2, 400f2)
Taken branch instruction completed (200fa)
Branch mispredicted (400f6)
External interrupt taken (200f8)
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
The architecture states that when MMCR0[PMCC] = 0b11, PMC5 and PMC6
are not part of the Performance Monitor, meaning that they are not
controlled by bits in MMCRs, and counter negative conditions in PMCs 5
and 6 don't generate Performance Monitor alerts, exceptions or
interrupts. It doesn't say that PMC5 and PMC6 are frozen in this
case, so presumably they should continue to count run instructions and
run cycles.
This implements that behaviour.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
The verilator build fails with warnings and errors, because NGPIO
is 0 and we do things like:
gpio_out : out std_ulogic_vector(NGPIO - 1 downto 0);
Set NGPIO to something reasonable (eg 32) and add HAS_GPIO to avoid
building the macro entirely if it isn't in use.
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
Experimenting with using ghdl to do VHDL to Verilog conversion (instead
of ghdl+yosys), verilator complains that a signal is a SystemVerilog
keyword:
%Error: microwatt.v:15013:18: Unexpected 'do': 'do' is a SystemVerilog keyword misused as an identifier.
... Suggest modify the Verilog-2001 code to avoid SV keywords, or use `begin_keywords or --language.
We could probably make this go away by disabling SystemVerilog, but
it's easy to rename the signal in question. Rename di at the same
time.
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
This is the start of an implementation of a PMU according to PowerISA
v3.0B. Things not implemented yet include most architected events,
the BHRB, event-based branches, thresholding, MMCR0[TBCC] field, etc.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
We've been investigating why the barrel rotator uses an enormous
number of cells on the yosys ECP5 target. Eventually it was narrowed
down to the -abc9 -nowidelut options, which see the cell count go from
4985 cells to 841 cells.
Using the same options on an Orange Crab build reduces the cell count
from 50864 to 36085. The main differences:
LUT4 31040 -> 25270
PFUMX 6956 -> 0
L6MUX21 1746 -> 0
CCU2C 2066 -> 1759
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
This adds a load before a floating-point load which should generate a
floating-point unavailable interrupt, to test for the bug where
unavailability interrupts can get dropped while loadstore1 is
executing instructions.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
At present the logic prevents any interrupts from being handled while
there is a load/store instruction (one that has unit=LDST) being
executed. However, load/store instructions can still get sent to
loadstore1. Thus an instruction which should generate an interrupt
such as a floating-point unavailable interrupt will instead get
executed.
To fix this, when we detect that an interrupt should be generated but
loadstore1 is still executing a previous instruction, we don't execute
any new instructions, and set a new r.intr_pending flag. That results
in busy_out being asserted (meaning that no further instructions will
come in from decode2). When loadstore1 has finished the instructions
it has, the interrupt gets sent to writeback. If one of the
instructions in loadstore1 generates an interrupt in the meantime, the
l_in.interrupt signal gets asserted and that clears r.intr_pending, so
the interrupt we detected gets discarded.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
litesdcard provides a macro per vendor (eg xilinx, lattice) and not per
board, so modify the fusesoc generator to take a vendor. This will make
it easier to add litesdcard to more boards.
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
Unfortunately the CSR layout has shifted on upstream litex, so this
is built with the following litex patch backed out:
aad56a047a33 ("integration/soc: Use CSR automatic allocation.")
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
csr_data_width is no longer required. Add ntxslots and nrxslots
parameters but set them to the default value.
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>