This implements the cfuged, pdepd and pextd instructions in a new unit
called bit_sorter (so called because cfuged and pextd can be viewed as
sorting the bits of the mask).
The cnt* instructions and the popcnt* instructions now use the same
OP_COUNTB insn_type so as to free up an insn_type value to use for the
new instructions.
The new instructions are implemented using a slow and simple algorithm
that takes 64 cycles to compute the result. The ex1 stage is stalled
while this happens, as for a 64-bit multiply, or for a divide when
there is no FPU.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
In future we will want to support targets using the same vendor but
running at different clock frequencies. Since the clock frequency is
a parameter to the gateware generation process, we now name the target
directories as "vendor.frequency", i.e., "xilinx.100e6" and
"lattice.48e6" rather than "xilinx" and "lattice".
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Now that the icache tag RAM is accessed synchronously, the free tools
recognize it as block RAM on ECP5-based platforms; thus we no longer
need to force it to a very small value.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This reworks (and simplifies) plru_tb to use the new plrufn module
instead of the old (and now unused) plru module.
The latter is now removed completely.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Rather than having update and decode logic for each individual PLRU
as well as a register to store the current PLRU state, we now put the
PLRU state in a little RAM, which will typically use LUT RAM on FPGAs,
and have just a single copy of the logic to calculate the pseudo-LRU
way and to update the PLRU state. This logic is in the plrufn module
and is just combinatorial logic. A new module was created for this as
other parts of the system are still using plru.vhdl.
The PLRU RAM in the icache is read asynchronously in the cycle
after the cache tag matching is done. At the end of that cycle the
PLRU RAM entry is updated if the access was a cache hit, or a victim
way is calculated and stored if the access was a cache miss and
miss handling is starting in this cycle.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
We disabled --trace by default, so we need to stop linking verilated_vcd_c.o
as it doesn't exist in that case.
While at it, make a Makefile variable to enable/disable verilator tracing
and add a couple of generics to those test benches to control tracing
in the L2 and in litedram.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
It also stores the dirty status so that's known.
This does some Makefile tricks so that we only rebuild when the git
hash changes. This avoids rebuilding the world every time we run
make.
Also adds fusesoc generator, so that should continue to work as
before.
Signed-off-by: Dan Horák <dan@danny.cz>
Signed-off-by: Michael Neuling <mikey@neuling.org>
This adds a pipelined 33-bit by 33-bit signed multiplier with one
cycle latency to the execute pipeline, and uses it for the mullw,
mulhw and mulhwu instructions. Because it has one cycle of latency we
can assume that its result is available in the second execute stage
without needing to add busy logic to the second stage.
This adds both a generic version of the multiplier and a
Xilinx-specific version using four DSP slices of the Artix-7.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This splits out the decoding done in the decode0 step into a separate
predecoder, used when writing instructions into the icache. The
icache now holds 36 bits per instruction rather than 32. For valid
instructions, those 36 bits comprise the bottom 26 bits of the
instruction word, a 9-bit insn_code value (which uniquely identifies
the instruction), and a zero in the MSB. For illegal instructions,
the MSB is one and the full instruction word is in the bottom 32 bits.
Having the full instruction word available for illegal instructions
means that it can be printed in the log when simulating, or in future
could be placed in the HEIR register.
If we don't have an FPU, then the floating-point instructions are
regarded as illegal. In that case, the insn_code values would fit
into 8 bits, which could be used in future to reduce the size of
decode_rom from 512 to 256 entries.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This uses the JTAGG primitive which is similar to BSCANE2.
The LUT4 delay approach came from Florian and Greg in
https://github.com/enjoy-digital/litex/pull/1087
Has been tested on an OrangeCrab with 48MHz sysclk
FT232H up to 30MHz (though libusb/urjtag is by far the bottleneck vs
the JTAG clock)
Signed-off-by: Matt Johnston <matt@codeconstruct.com.au>
This moves the calculation of the result for popcnt* into the
countbits unit, renamed from countzero, so that we can take two cycles
to get the result. The motivation for this is that the popcnt*
calculation was showing up as a critical path.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
top-orangecrab0.2 is a copy of top-arty with various changes.
USRMCLK is added for the SPI clock
ethernet is removed
Signed-off-by: Matt Johnston <matt@codeconstruct.com.au>
Yosys changed command line behaviour following the v0.12 release. Work
around this by using read_verilog, which maintains the old behaviour.
This should work fine for current yosys and be compatible with
future releases.
See https://github.com/YosysHQ/yosys/issues/3109
Signed-off-by: Joel Stanley <joel@jms.id.au>
The existing orange crab target is for an older board with a
LFE5UM5G-85F device. Newer orange crab boards (v0.21) have a
LFE5U-85F device in the -8 speed grade, so make a new target for them
called ORANGE-CRAB-0.21.
Also add flags to ecppack to indicate that the bitstream should be
compressed and can be loaded at 38.8MHz.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
I'm not sure why I set the input frequency for the Orange Crab to 50MHz.
Since we easily make timing now, bump our output frequency to 48MHz as
well.
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
While verilator finds the correct top level module with the current
setup, if we start adding simulation models it can get confused.
Explicitly specify the top level module.
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
Recent versions of verilator support the --build option, allowing
us to remove a step.
Also add a Docker image for verilator.
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
This is the start of an implementation of a PMU according to PowerISA
v3.0B. Things not implemented yet include most architected events,
the BHRB, event-based branches, thresholding, MMCR0[TBCC] field, etc.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
We've been investigating why the barrel rotator uses an enormous
number of cells on the yosys ECP5 target. Eventually it was narrowed
down to the -abc9 -nowidelut options, which see the cell count go from
4985 cells to 841 cells.
Using the same options on an Orange Crab build reduces the cell count
from 50864 to 36085. The main differences:
LUT4 31040 -> 25270
PFUMX 6956 -> 0
L6MUX21 1746 -> 0
CCU2C 2066 -> 1759
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
The icache RAM is currently LUT ram not block ram. This massively
bloats the icache size. We think this is due to yosys not inferencing
the RAM correctly but that's yet to be confirmed.
Work around this for now by reducing the default size of the icache
RAM for the ECP5 builds.
On the ECP5 85K builts, this gets us from 95% down to 76% and helps
our CI to pass.
Signed-off-by: Michael Neuling <mikey@neuling.org>
This commit also removes the dependencies these testbenches have on VHPIDIRECT.
The use of VHPIDIRECT limits the number of available simulators for the project. Rather than using
foreign functions the testbenches can be implemented entirely in VHDL where equivalent functionality exists.
For these testbenches the VHPIDIRECT-based randomization functions were replaced with VHDL-based functions.
The testbenches recognized by VUnit can be executed in parallel threads for better simulation performance using
the -p option to the run.py script
Signed-off-by: Lars Asplund <lars.anders.asplund@gmail.com>
This adds a GPIO controller which provides 32 bits of I/O. The
registers are modelled on the set used by the gpio-ftgpio010.c driver
in the Linux kernel. Currently there is no interrupt capability
implemented, though an interrupt line from the GPIO subsystem to the
XICS has been connected.
For the Arty A7 board, GPIO lines 0 to 13 are connected to the pins
labelled IO0 to IO13 on the "shield" connector, GPIO lines 14 to 29
connect to IO26 to IO41, GPIO line 30 connects to the pin labelled A
(aka IO42), and GPIO line 31 is connected to LED 7.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This changes the way GPR hazards are detected and tracked. Instead of
having a model of the pipeline in gpr_hazard.vhdl, which has to mirror
the behaviour of the real pipeline exactly, we now assign a 2-bit tag
to each instruction and record which GSPR the instruction writes.
Subsequent instructions that need to use the GSPR get the tag number
and stall until the value with that tag is being written back to the
register file.
For now, the forwarding paths are disabled. That gives about a 8%
reduction in coremark performance.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Our Makefiles need some work, but for now create an FPGA target:
make FPGA_TARGET=verilator microwatt-verilator
ghdl and yosys can use containers using PODMAN=1 or DOCKER=1
options.
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
yosys and verilator did not like us passing in the verilog and
exporting it again. Pass the source directly to verilator instead.
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>