This adds a new module to implement an MMU. At the moment it doesn't
do very much. Tlbie instructions now get sent by loadstore1 to mmu,
which sends them to dcache, rather than loadstore1 sending them
directly to dcache. TLB misses from dcache now get sent by loadstore1
to mmu, which currently just returns an error. Loadstore1 then
generates a DSI in response to the error return from mmu.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This comes in two parts:
- A generator script which uses LiteX to generate litedram cores
along with their init files for various boards (currently Arty and
Nexys-video). This comes with configs for arty and nexys_video.
- A fusesoc "generator" which uses pre-generated litedram cores
The generation process is manual on purpose. This include pre-generated
cores for the two above boards.
This is done so that one doesn't have to install LiteX to build
microwatt. In addition, the generator script or wrapper vhdl tend to
break when LiteX changes significantly which happens.
This is still rather standalone and hasn't been plumbed into the SoC
or the FPGA toplevel files yet.
At this point LiteDRAM self-initializes using a built-in VexRiscv
"Minimum" core obtained from LiteX and included in this commit. There
is some plumbing to generate and cores that are initialized by Microwatt
directly but this isn't working yet and so isn't enabled yet.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
LiteDRAM at the moment pretty much enforces 100Mhz, and our software
isn't quite yet adaptable, so switch out default to 100Mhz accross
the board. Recent timing improvements should make it a non-issue.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This replaces the simple_ram_behavioural and mw_soc_memory modules
with a common wishbone_bram_wrapper.vhdl that interfaces the
pipelined WB with a lower-level RAM module, along with an FPGA
and a sim variants of the latter.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Vivado by default tries to flatten the module hierarchy to improve
placement and timing. However this makes debugging timing issues
really hard as the net names in the timing report can be pretty
bogus.
This adds a generic that can be used to control attributes to stop
vivado from flattening the main core components. The resulting design
will have worst timing overall but it will be easier to understand
what the worst timing path are and address them.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This replaces loadstore2 with a dcache
The dcache unit is losely based on the icache one (same basic cache
layout), but has some significant logic additions to deal with stores,
loads with update, non-cachable accesses and other differences due to
operating in the execution part of the pipeline rather than the fetch
part.
The cache is store-through, though a hit with an existing line will
update the line rather than invalidate it.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Since the condition setting got moved to writeback, execute2 does
nothing aside from wasting a cycle. This removes it.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Check GPRs against any writers in the pipeline.
All instructions are still marked single in pipeline at
this stage.
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
This adds combinatorial logic that does 32-bit and 64-bit count
leading and trailing zeroes in one unit, and consolidates the
four instructions under a single OP_CNTZ opcode.
This saves 84 slice LUTs on the Arty A7-100.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Consolidate and/andc/nand, or/orc/nor and xor/eqv, using a common
invert on the input and output. This saves us about 200 LUTs.
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
This adds support for set associativity to the icache. It can still
be direct mapped by setting NUM_WAYS to 1.
The replacement policy uses a simple tree-PLRU for each set.
This is only lightly tested, tests pass but I have to double check
that we are using the ways effectively and not creating duplicates.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This adds a new entity 'rotator' which contains combinatorial logic
for rotating and masking 64-bit values. It implements the operations
of the rlwinm, rlwnm, rlwimi, rldicl, rldicr, rldic, rldimi, rldcl,
rldcr, sld, slw, srd, srw, srad, sradi, sraw and srawi instructions.
It consists of a 3-stage 64-bit rotator using 4:1 multiplexors at
each stage, two mask generators, output logic and control logic.
The insn_type_t values used for these instructions have been reduced
to just 5: OP_RLC, OP_RLCL and OP_RLCR for the rotate and mask
instructions (clear both left and right, clear left, clear right
variants), OP_SHL for left shifts, and OP_SHR for right shifts.
The control signals for the rotator are derived from the opcode
and from the is_32bit and is_signed fields of the decode_rom_t.
The rotator is instantiated as an entity in execute1 so that we can
be sure we only have one of it.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
We can now pass both the input clock and target clock frequency
via generics. Add support for both 50Mhz and 100Mhz target freqs
for both cases.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
These are a copy of the A7-35 definitions with 35 changed to 100.
The A7-100 uses the same .xdc file (arty_a7-35.xdc) as the A7-35
since the only difference between the two is the FPGA part; the
hardware and connections on the two boards are identical.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This adds a divider unit, connected to the core in much the same way
that the multiplier unit is connected. The division algorithm is
very simple-minded, taking 64 clock cycles for any division (even
32-bit division instructions).
The decoding is simplified by making use of regularities in the
instruction encoding for div* and mod* instructions. Instead of
having PPC_* encodings from the first-stage decoder for each of the
different div* and mod* instructions, we now just have PPC_DIV and
PPC_MOD, and the inputs to the divider that indicate what sort of
division operation to do are derived from instruction word bits.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This adds a debug module off the DMI (debug) bus which can act as a
wishbone master to generate read and write cycles.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This adds a simple bus that can be mastered from an external
system via JTAG, which will be used to hookup various debug
modules.
It's loosely based on the RiscV model (hence the DMI name).
The module currently only supports hooking up to a Xilinx BSCANE2
but it shouldn't be too hard to adapt it to support different TAPs
if necessary.
The JTAG protocol proper is not exactly the RiscV one at this point,
though I might still change it.
This comes with some sim variants of Xilinx BSCANE2 and BUFG and a
test bench.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
(And rename it to mw_soc_memory).
This makes soc.vhdl simpler and provides the same interface as
the simulated memory, which will help when sharing soc.vhdl
with sim later
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This will be useful when we start needing different toplevels for
different boards.
We keep the reset and clock generators in the toplevel as they will
eventually be taken over by litedram when we integrate it, and they
are more likely to change on different system types.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This adds support for the Digilane Cmod A7-35.
I had to use the MMCM because the clock (12 MHz) is below the PLL
minimum of 19 MHz.
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
The old reset code was overly complicated and never worked properly.
Replace it with a simpler sequence that uses a couple of shift registers
to assert resets:
- Wait a number of external clock cycles before removing reset from
the PLL.
- After the PLL locks and the external reset button isn't pressed,
wait a number of PLL clock cycles before removing reset from the SOC.
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
The decode2 stage was spaghetti code and needed cleaning up.
Create a series of functions to pull fields from a ppc instruction
and also a series of helpers to extract values for the execution
units.
As suggested by Paul, we should pass all signals to the execution
units and only set the valid signal conditionally, which should
use less resources.
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
The synth target can be used to analyze the core after synthesis
without running P&R. Currently, the only edalize backends that
support synthesis without P&R are vivado and icestorm, and icestorm
needs yosys built with verific support to parse vhdl.
To run synthesis only for a part, run
fusesoc run --target=synth --tool=vivado microwatt --part=<part>
where part is a valid Xilinx part such as xc7a100tcsg324-1