While these signals should only be read when valid is true, they
are only a small number of bits and we want to reduce the amount of
U/X state bouncing around the chip.
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
decode1 has a lot of logic that uses i_out.insn without first looking at
i_iout.valid. Play it safe and never output X state.
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
While we should only look at this when d_out.valid = 1, we may as remove
some U state across interfaces.
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
While this is not an issue in VHDL, I noticed this when running
a script over the source and we may as well fix it.
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
While trying to reduce U/X state issues, I notice that our BSS is not
being initialised in the hello world test.
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
These instructions are similar to those at
https://ozlabs.org/~joel/microwatt/README
except they describe how to build the artifacts from scratch instead of
downloading them.
Signed-off-by: Joel Stanley <joel@jms.id.au>
The SoC defaults to using the uart16550 so provide instructions on how
to fetch that library when seetting up fusesoc.
Also remove the text about a working directory; fusesoc doesn't need
one.
Signed-off-by: Joel Stanley <joel@jms.id.au>
log2ceil() returns the number of bits required to store a value, so we
need to pass in memory_size-1, not memory_size.
Every other user of log2ceil() gets this right.
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
Setting icache to be privileged and accessing physical memory directly.
And set big_endian to 0 to correspond to the testbench result.
Signed-off-by: Tianrui Wei <tianrui@tianruiwei.com>
Revert to linking dynamically by default, can statically link with
`make STATIC_URJTAG=1`
Fixes#351
Signed-off-by: Matt Johnston <matt@codeconstruct.com.au>
We've had these for a while now:
- D/I cache
- GPR bypassing
- Supervisor state (and can boot linux)
We still need Vector/VMX/VSX (and probably some other things)
Signed-off-by: Michael Neuling <mikey@neuling.org>
This is necessary for the upcoming Arctic Tern system enablement,
since Arctic Tern uses two DRAM devices and a separate clock line
is routed to each device. LiteX handles this behavior correctly,
therefore we assume other hardware exists that uses a similar
DRAM clock design.
Updates from Mikey to fix some compile issues.
Signed-off-by: Timothy Pearson <tpearson@raptorengineering.com>
Signed-off-by: Michael Neuling <mikey@neuling.org>
At present, the loop in the irq_gen process generates a chain of
comparators and other logic to work out the source number and priority
of the most-favoured (lowest priority number) pending interrupt.
This replaces that chain with (1) logic to generate an array of bits,
one per priority, indicating whether any interrupt is pending at that
priority, (2) a priority encoder to select the most favoured priority
with an interrupt pending, (3) logic to generate an array of bits, one
per source, indicating whether an interrupt is pending at the priority
calculated in step 2, and (4) a priority encoder to work out the
lowest numbered source that has an interrupt pending at the selected
priority. This reduces LUT utilization.
The priority encoder function implemented here uses the optimized
count-leading-zeroes logic from helpers.vhdl.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This implements an alternative count-leading-zeroes algorithm which
uses less LUTs to generate the higher-order bits (2..5) of the
result.
By doing (v | -v) rather than (v & -v), we get a value which has ones
from the MSB down to the rightmost 1 bit in v and then zeroes down to
the LSB. This means that we can generate the MSB of the result (the
index of the rightmost 1 bit in v) just by looking at bits 63 and 31
of (v | -v), assuming that v is 64 bits. Bit 4 of the result requires
looking at bits 63, 47, 31 and 15. In contrast, each bit of the
result using (v & -v), which has a single 1, requires ORing together
32 bits.
It turns out that the minimum LUT usage comes from using (v & -v) to
generate bits 0 and 1 of the result, and using (v | -v) to generate
bits 2 to 5. This saves almost 60 6-input LUTs on the Artix-7.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This generates a series of io_cycle_* signals which are clean latches
and which become the 'cyc' signals of the wishbone buses going to
various peripherals (syscon, uarts, XICS, GPIO, etc.). Effectively
this is done by moving the address decoding into the slave_io_latch
process. The slave_io_type, which drives the multiplexer which
selects which wishbone to look for a response on, is reduced to just 8
values in the expectation that an 8-way multiplexer will use less
logic than one with more than 8 inputs.
With this timing is considerably better on the A7-100T.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
"-b ecp5" will select ECP5 interface that talks to a JTAGG
primitive.
For example with a FT232H JTAG board:
./mw_debug -t 'ft2232 vid=0x0403 pid=0x6014' -s 30000000 -b ecp5 mr ff003888 6
Connected to libftdi driver.
Found device ID: 0x41113043
00000000ff003888: 6d6f636c65570a0a ..Welcom
00000000ff003890: 63694d206f742065 e to Mic
00000000ff003898: 2120747461776f72 rowatt !
00000000ff0038a0: 0000000000000a0a ........
00000000ff0038a8: 67697320636f5320 Soc sig
00000000ff0038b0: 203a65727574616e nature:
Core: running
NIA: c0000000000187f8
MSR: 9000000000001033
Signed-off-by: Matt Johnston <matt@codeconstruct.com.au>
This uses the JTAGG primitive which is similar to BSCANE2.
The LUT4 delay approach came from Florian and Greg in
https://github.com/enjoy-digital/litex/pull/1087
Has been tested on an OrangeCrab with 48MHz sysclk
FT232H up to 30MHz (though libusb/urjtag is by far the bottleneck vs
the JTAG clock)
Signed-off-by: Matt Johnston <matt@codeconstruct.com.au>
liburjtag isn't in Debian, so usually we're pointing at a urjtag
build directory when building mw_debug
Signed-off-by: Matt Johnston <matt@codeconstruct.com.au>
This removes logic that I added some time ago with the thought that it
would enable us to do prefetching in the icache. This logic detects
when the fetch address is an odd multiple of 4 and the next address in
sequence from the previous cycle. In that case the instruction we
want is in the output register of the icache RAM already so there is
no need to do another read or any icache tag or TLB lookup.
However, this logic adds complexity, and removing it improves timing,
so this removes it.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This moves the calculation of the result for popcnt* into the
countbits unit, renamed from countzero, so that we can take two cycles
to get the result. The motivation for this is that the popcnt*
calculation was showing up as a critical path.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Orangecrab missed out on:
Make wishbone addresses be in units of doublewords or words
Author: Paul Mackerras <paulus@ozlabs.org>
Date: Wed Sep 15 18:18:09 2021 +1000
Signed-off-by: Matt Johnston <matt@codeconstruct.com.au>
Modifies litescard generate script to take a clock speed.
Regenerated verilog with latest litesdcard
e52c731 ("Bump year.")
Signed-off-by: Matt Johnston <matt@codeconstruct.com.au>
Reduce litedram NUM_LINES 64->8
This allows us to meet timing. Can probably
be improved in future with better BRAM usage.
Signed-off-by: Matt Johnston <matt@codeconstruct.com.au>
top-orangecrab0.2 is a copy of top-arty with various changes.
USRMCLK is added for the SPI clock
ethernet is removed
Signed-off-by: Matt Johnston <matt@codeconstruct.com.au>
Parameters are based on
https://github.com/gregdavill/OrangeCrab-test-sw/blob/main/hw/OrangeCrab-bitstream.py
and litex-boards orangecrab.py
rtt_nom and cmd_delay are overridden for OrangeCrab, we do the same here.
Generated with litedram and litex
62abf9c ("litedram_gen: Add block_until_ready port parameter to control blocking behaviour.")
add2746a ("tools/litex_cli: Rename wb to bus.")
Signed-off-by: Matt Johnston <matt@codeconstruct.com.au>
Recent litedram gets stuck at memtest unless block_until_ready=False.
(discussion in https://github.com/enjoy-digital/litedram/pull/292)
This change regenerates with latest litedram and litex
62abf9c ("litedram_gen: Add block_until_ready port parameter to control blocking behaviour.")
add2746a ("tools/litex_cli: Rename wb to bus.")
Signed-off-by: Matt Johnston <matt@codeconstruct.com.au>
Yosys changed command line behaviour following the v0.12 release. Work
around this by using read_verilog, which maintains the old behaviour.
This should work fine for current yosys and be compatible with
future releases.
See https://github.com/YosysHQ/yosys/issues/3109
Signed-off-by: Joel Stanley <joel@jms.id.au>
At present, code (such as simple_random) which produces serial port
output during the first few milliseconds of operation produces garbled
output. The reason is that the clock has not yet stabilized and is
running slow, resulting in the bit time of the serial characters being
too long.
The ECP5 data sheet says that the phase detector should be operated
between 10 and 400 MHz. The current code operates it at 2MHz.
Consequently, the PLL lock indication doesn't work, i.e. it is always
zero. The current code works around that by inverting it, i.e. taking
the "not locked" indication to mean "locked".
Instead, we now run it at 12MHz, chosen because the common external
clock inputs on ECP5 boards are 12MHz and 48MHz. Normally this would
mean that the available system clock frequencies would be multiples of
12MHz, but this is a little inconvenient as we use 40MHz on the Orange
Crab v0.21 boards. Instead, by using the secondary clock output for
feedback, we can have any divisor of the PLL frequency as the system
clock frequency.
The ECP5 data sheet says the PLL oscillator can run at 400 to 800
MHz. Here we choose 480MHz since that allows us to generate 40MHz and
48MHz easily and is a multiple of 12MHz.
With this, the lock signal works correctly, and the inversion can be
removed.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
The existing orange crab target is for an older board with a
LFE5UM5G-85F device. Newer orange crab boards (v0.21) have a
LFE5U-85F device in the -8 speed grade, so make a new target for them
called ORANGE-CRAB-0.21.
Also add flags to ecppack to indicate that the bitstream should be
compressed and can be loaded at 38.8MHz.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
These convert addresses to/from wishbone addresses, and use them
in parts of the caches, in order to make the code a bit more readable.
Along the way, rename some functions in the caches to make it a bit
clearer what they operate on and fix a bug in the icache STOP_RELOAD state where
the wb address wasn't properly converted.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This moves REAL_ADDR_BITS out of the caches and defines a real_addr_t
type for a real address, along with a addr_to_real() conversion helper.
It makes the vhdl a bit more readable
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We have a bug where an store near a dcbz can cause the dcbz to only zero
8 bytes. Add a test case for this.
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
Note: There are a few patches to upstream to fix an upstream breakage
of litedram standalone generator, and fix some issues with liteeth
in the way it's used on Wukong. All these have pending pull requests.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
For now only the V2 of the board (slightly different pinout)
and only the A100T variant. I also haven't added GPIOs or anything
else on the PMODs really.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This fixes a bug where a dcbz can get incorrectly handled as an
ordinary 8-byte store if it arrives while the dcache state machine is
handling other stores with the same tag value (i.e. within the same
set-sized area of memory). The logic that says whether to include a
new store in the current wishbone cycle didn't take into account
whether the new store was a dcbz. This adds a "req.dcbz = '0'" factor
so that it does. This is necessary because dcbz is handled more like
a cache line refill (but writing to memory rather than reading) than
an ordinary store.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This fixes two bugs in the flash invalidation of the icache.
The first is that an instruction could get executed twice. The
i-cache RAM is 2 instructions (64 bits) wide, so one read can supply
results for 2 cycles. The fetch1 stage tells icache when the address
is equal to the address of the previous cycle plus 4, and in cases
where that is true, bit 2 of the address is 1, and the previous cycle
was a cache hit, we just use the second word of the doubleword read
from the cache RAM. However, the cache hit/miss logic also continues
to operate, so in the case where the first word hits but the second
word misses (because of an icache invalidation or a snoop occurring in
the first cycle), we supply the instruction from the data previously
read from the icache RAM but also stall fetch1 and start a cache
reload sequence, and subsequently supply the second instruction
again. This fixes the issue by inhibiting req_is_miss and stall_out
when use_previous is true.
The second bug is that if an icache invalidation occurs while
reloading a line, we continue to reload the line, and make it valid
when the reload finishes, even though some of the data may have been
read before the invalidation occurred. This adds a new state
STOP_RELOAD which we go to if an invalidation happens while we are in
CLR_TAG or WAIT_ACK state. In STOP_RELOAD state we don't request any
more reads from memory and wait for the reads we have previously
requested to be acked, and then go to IDLE state. Data returned is
still written to the icache RAM, but that doesn't matter because the
line is invalid and is never made valid.
Note that we don't have to worry about invalidations due to snooped
writes while reloading a line, because the wishbone arbiter won't
switch to another master once it has started sending our reload
requests to memory. Thus a store to memory will either happen before
any of our reads have got to memory, or after we have finished the
reload (in which case we will no longer be in WAIT_ACK state).
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
I'm not sure why I set the input frequency for the Orange Crab to 50MHz.
Since we easily make timing now, bump our output frequency to 48MHz as
well.
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
This makes the 64-bit wishbone buses have the address expressed in
units of doublewords (64 bits), and similarly for the 32-bit buses the
address is in units of words (32 bits). This is to comply with the
wishbone spec. Previously the addresses on the wishbone buses were in
units of bytes regardless of the bus data width, which is not correct
and caused problems with interfacing with externally-generated logic.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This adds an optional 16 bit x 16 bit signed multiplier and uses it
for multiply instructions that return the low 64 bits of the product
(mull[dw][o] and mulli, but not maddld) when the operands are both in
the range -2^15 .. 2^15 - 1. The "short" 16-bit multiplier produces
its result combinatorially, so a multiply that uses it executes in one
cycle. This improves the coremark result by about 4%, since coremark
does quite a lot of multiplies and they almost all have operands that
fit into 16 bits.
The presence of the short multiplier is controlled by a generic at the
execute1, SOC, core and top levels. For now, it defaults to off for
all platforms, and can be enabled using the --has_short_mult flag to
fusesoc.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Some critical path reports showed r1.req.addr depending on l_in.valid,
which then depended ultimately on the dcache's r1.ls_valid. In fact
we can update r1.req.addr (and other fields of r1.req, except for
r1.req.valid) independently of l_in.valid as long as busy = 0.
We do also need to preserve r1.req.addr0 when l_in.valid = 0, so we
pull it out of r1.req and store it separately in r1.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This adds a bit to the BTC to store whether the corresponding branch
instruction was taken last time it was encountered. That lets us pass
a not-taken prediction down to decode1, which for backwards direct
branches inhibits it from redirecting fetch to the target of the
branch. This increases coremark by about 2%.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This changes s0 to use the P register rather than the A/B/C input
registers, thus improving the timing of the multiplier output. The
m00, m02 and m03 multipliers now use their P registers rather than the
M registers, moving the addition they do from the second cycle to the
first.
Also, the XOR that inverts the 32 LSBs is moved before the output
register.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
A non-cacheable load should only load the data requested and no more. We
do the right thing for real mode cache inhibited storage instructions,
but when loading through a non-cacheable PTE we load the entire 64 bits
regardless of the size.
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
The row in the decode table for isel with BC=0 was inadvertently left
marked as single-issue by commit 813f834012 ("Add CR hazard
detection", 2019-10-15). Fix it.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
- mcrxrx put the bits in the wrong order
- addpcis was setting CR0 if the instruction bit 0 = 1, which it
shouldn't
- bpermd was producing 0 always and additionally had the wrong bit
numbering
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This checks that the store forwarding machinery in the dcache
correctly combines forwarded stores when they are partial stores
(i.e. only writing part of the doubleword, as for a byte store).
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
We have two stages of forwarding to cover the two cycles of latency
between when something is written to BRAM and when that new data can
be read from BRAM. When the writes to BRAM result from store
instructions, the write may write only some bytes of a row (8 bytes)
and not others, so we have a mask to enable only the written bytes to
be forwarded. However, we only forward written data from either the
first stage of forwarding or the second, not both. So if we have
two stores in succession that write different bytes of the same row,
and then a load from the row, we will only forward the data from the
second store, and miss the data from the first store; thus the load
will get the wrong value.
To fix this, we make the decision on which forward stage to use for
each byte individually. This results in a 4-input multiplexer feeding
r1.data_out, with its inputs being the BRAM, the wishbone, the current
write data, and the 2nd-stage forwarding register. Each byte of the
multiplexer is separately controlled. The code for this multiplexer
is moved to the dcache_fast_hit process since it is used for cache
hits as well as cache misses.
This also simplifies the BRAM code by ensuring that we can use the
same source for the BRAM address and way selection for writes, whether
we are writing store data or cache line refill data from memory.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This moves the way multiplexer for the data from the BRAM, and the
multiplexers for forwarding data from earlier stores or refills,
before a clock edge rather than after, so that now the data output
from the dcache comes from a clean latch. To do this we remove the
extra latch on the output of the data BRAM (i.e. ADD_BUF=false) and
rearrange the logic. The choice whether to forward or not now depends
not on way comparisons but rather on a tag comparisons, for the sake
of timing.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
While verilator finds the correct top level module with the current
setup, if we start adding simulation models it can get confused.
Explicitly specify the top level module.
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
Recent versions of verilator support the --build option, allowing
us to remove a step.
Also add a Docker image for verilator.
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
This implements most of the architected PMU events. The ones missing
are mostly the ones that depend on which level of the cache hierarchy
data is fetched from. The events implemented here, and their raw
event codes, are:
Floating-point operation completed (100f4)
Load completed (100fc)
Store completed (200f0)
Icache miss (200fc)
ITLB miss (100f6)
ITLB miss resolved (400fc)
Dcache load miss (400f0)
Dcache load miss resolved (300f8)
Dcache store miss (300f0)
DTLB miss (300fc)
DTLB miss resolved (200f6)
No instruction available and none being executed (100f8)
Instruction dispatched (200f2, 300f2, 400f2)
Taken branch instruction completed (200fa)
Branch mispredicted (400f6)
External interrupt taken (200f8)
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
The architecture states that when MMCR0[PMCC] = 0b11, PMC5 and PMC6
are not part of the Performance Monitor, meaning that they are not
controlled by bits in MMCRs, and counter negative conditions in PMCs 5
and 6 don't generate Performance Monitor alerts, exceptions or
interrupts. It doesn't say that PMC5 and PMC6 are frozen in this
case, so presumably they should continue to count run instructions and
run cycles.
This implements that behaviour.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
The verilator build fails with warnings and errors, because NGPIO
is 0 and we do things like:
gpio_out : out std_ulogic_vector(NGPIO - 1 downto 0);
Set NGPIO to something reasonable (eg 32) and add HAS_GPIO to avoid
building the macro entirely if it isn't in use.
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
Experimenting with using ghdl to do VHDL to Verilog conversion (instead
of ghdl+yosys), verilator complains that a signal is a SystemVerilog
keyword:
%Error: microwatt.v:15013:18: Unexpected 'do': 'do' is a SystemVerilog keyword misused as an identifier.
... Suggest modify the Verilog-2001 code to avoid SV keywords, or use `begin_keywords or --language.
We could probably make this go away by disabling SystemVerilog, but
it's easy to rename the signal in question. Rename di at the same
time.
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
This is the start of an implementation of a PMU according to PowerISA
v3.0B. Things not implemented yet include most architected events,
the BHRB, event-based branches, thresholding, MMCR0[TBCC] field, etc.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
We've been investigating why the barrel rotator uses an enormous
number of cells on the yosys ECP5 target. Eventually it was narrowed
down to the -abc9 -nowidelut options, which see the cell count go from
4985 cells to 841 cells.
Using the same options on an Orange Crab build reduces the cell count
from 50864 to 36085. The main differences:
LUT4 31040 -> 25270
PFUMX 6956 -> 0
L6MUX21 1746 -> 0
CCU2C 2066 -> 1759
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
This adds a load before a floating-point load which should generate a
floating-point unavailable interrupt, to test for the bug where
unavailability interrupts can get dropped while loadstore1 is
executing instructions.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
At present the logic prevents any interrupts from being handled while
there is a load/store instruction (one that has unit=LDST) being
executed. However, load/store instructions can still get sent to
loadstore1. Thus an instruction which should generate an interrupt
such as a floating-point unavailable interrupt will instead get
executed.
To fix this, when we detect that an interrupt should be generated but
loadstore1 is still executing a previous instruction, we don't execute
any new instructions, and set a new r.intr_pending flag. That results
in busy_out being asserted (meaning that no further instructions will
come in from decode2). When loadstore1 has finished the instructions
it has, the interrupt gets sent to writeback. If one of the
instructions in loadstore1 generates an interrupt in the meantime, the
l_in.interrupt signal gets asserted and that clears r.intr_pending, so
the interrupt we detected gets discarded.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
litesdcard provides a macro per vendor (eg xilinx, lattice) and not per
board, so modify the fusesoc generator to take a vendor. This will make
it easier to add litesdcard to more boards.
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
Unfortunately the CSR layout has shifted on upstream litex, so this
is built with the following litex patch backed out:
aad56a047a33 ("integration/soc: Use CSR automatic allocation.")
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
csr_data_width is no longer required. Add ntxslots and nrxslots
parameters but set them to the default value.
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
This option was added in the commit but is no longer needed for github
CI to work.
commit ef0dcf3bc6
Author: Michael Neuling <mikey@neuling.org>
Date: Thu Jul 2 14:36:14 2020 +1000
Add SYNTH_ECP5_FLAGS option for building
Removing noflatten has the added advantage that it gets our builds
from 75% down to 59% usage on ECP5 85K.
Signed-off-by: Michael Neuling <mikey@neuling.org>
The icache RAM is currently LUT ram not block ram. This massively
bloats the icache size. We think this is due to yosys not inferencing
the RAM correctly but that's yet to be confirmed.
Work around this for now by reducing the default size of the icache
RAM for the ECP5 builds.
On the ECP5 85K builts, this gets us from 95% down to 76% and helps
our CI to pass.
Signed-off-by: Michael Neuling <mikey@neuling.org>
This implements a 1-entry partition table, so that instead of getting
the process table base address from the PRTBL SPR, the MMU now reads
the doubleword pointed to by the PTCR register plus 8 to get the
process table base address. The partition table entry is cached.
Having the PTCR and the vestigial partition table reduces the amount
of software change required in Linux for Microwatt support.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
The VUnit log package is a SW style logging framework in VHDL and the check package is an assertion library doing its error reporting with VUnit logging.
These testbenches don't use, and do not need, very advanced logging/checking features but the following was possible to improve
- Checking equality in VHDL can be quite tedious with a lot of type conversions and long message strings to explain the data received and what was expected.
VUnit's check_equal procedure allow comparison between same or similar types and automatically create the error message for you.
- The code has report statements used for testbench progress reporting and debugging. These were replaced with the info and debug procedures.
info logs are visible by default while debug is not. This means that debug logs don't have to be commented, which they are now, when not used.
Instead there is a show procedure making debug messages visible. The show procedure has been commented to hide the debug messages but a more elegant
solution is to control visibility from a generic and then set that generic from the command line. I've left this as a TODO but the run script allow you to
extend the standard CLI of VUnit to add new options and you can also set generics from the run script.
- VUnit log messages are color coded if color codes are supported by the terminal. It makes it quicker to spot messages of different types when there are many log messages.
Error messages will always be made visible on the terminal but you must use the -v (verbose) to see other logs.
- Some tests have a lot of "metvalue detected" warning messages from the numeric_std package and these clutter the logs when using the -v option. VUnit has a simulator independent
option allowing you to suppress those messages. That option has been enabled.
Signed-off-by: Lars Asplund <lars.anders.asplund@gmail.com>
Several of the testbenches have stimuli code divided into sections preceded with a header comment explaining
what is being tested. These sections have been made into VUnit test cases. The default behavior of VUnit is
to run each test case in a separate simulation which comes with a number of benefits:
* A failing test case doesn't prevent other test cases to be executed
* Test cases are independent. A test case cannot fail as a side-effect to a problem with another test case
* Test execution can be more parallelized and the overall test execution time reduced
Signed-off-by: Lars Asplund <lars.anders.asplund@gmail.com>
This commit also removes the dependencies these testbenches have on VHPIDIRECT.
The use of VHPIDIRECT limits the number of available simulators for the project. Rather than using
foreign functions the testbenches can be implemented entirely in VHDL where equivalent functionality exists.
For these testbenches the VHPIDIRECT-based randomization functions were replaced with VHDL-based functions.
The testbenches recognized by VUnit can be executed in parallel threads for better simulation performance using
the -p option to the run.py script
Signed-off-by: Lars Asplund <lars.anders.asplund@gmail.com>
The VUnit run script will find all VHDL files based on given search patterns, figure out their dependencies, and support incremental compile based on the dependencies.
The same script is used for all VUnit supported simulators. Supporting several simulators simplifies the adoption of this project.
At this point only compilation is performed. Coming commits will enable simulation of VHDL testbenches.
Signed-off-by: Lars Asplund <lars.anders.asplund@gmail.com>
This adds litesdcard.v generated from the litex/litesdcard project,
along with logic in top-arty.vhdl to connect it into the system.
There is now a DMA wishbone coming in to soc.vhdl which is narrower
than the other wishbone masters (it has 32-bit data rather than
64-bit) so there is a widening/narrowing adapter between it and the
main wishbone master arbiter.
Also, litesdcard generates a non-pipelined wishbone for its DMA
connection, which needs to be converted to a pipelined wishbone. We
have a latch on both the incoming and outgoing sides of the wishbone
in order to help make timing (at the cost of two extra cycles of
latency).
litesdcard generates an interrupt signal which is wired up to input 3
of the ICS (IRQ 19).
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This makes the icache snoop writes to memory in the same way that the
dcache does, thus making DMA cache-coherent for the icache as well as
the dcache.
This also simplifies the logic for the WAIT_ACK state by removing the
stbs_done variable, since is_last_row(r.store_row, r.end_row_ix) can
only be true when stbs_done is true.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Since the expression is_last_row(r1.store_row, r1.end_row_ix) can only
be true when stbs_done is true, there is no need to include stbs_done
in the expression for the reload being completed, and hence no need to
compute stbs_done in the RELOAD_WAIT_ACK state.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This adds a path where the wishbone that goes out to memory and I/O
also gets fed back to the dcache, which looks for writes that it
didn't initiate, and invalidates any cache line that gets written to.
This involves a second read port on the cache tag RAM for looking up
the snooped writes, and effectively a second write port on the cache
valid bit array to clear bits corresponding to snoop hits.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Update documentation to reference fusesoc init for Xilinx boards, for
those like me who have never used fusesoc before. Add a reference to the
board files for Digilent boards and comment on perhaps installing them
for other boards as appropriate.
Signed-off-by: Antony Vennard <antony@vennard.ch>
Our SPI controller sends 8 dummy clocks at boot which Ben
added for some Xilinx boards. This should be harmless but
it is confusing the flash testbench in the Caravel project.
Add a parameter so it can be overridden at the top level.
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
The protocol used by the spi bridge firmware changed as of openocd
v0.11. As this is the version packaged by Debian Bullseye, add the
firmware for convince.
Signed-off-by: Joel Stanley <joel@jms.id.au>
This adds a GPIO controller which provides 32 bits of I/O. The
registers are modelled on the set used by the gpio-ftgpio010.c driver
in the Linux kernel. Currently there is no interrupt capability
implemented, though an interrupt line from the GPIO subsystem to the
XICS has been connected.
For the Arty A7 board, GPIO lines 0 to 13 are connected to the pins
labelled IO0 to IO13 on the "shield" connector, GPIO lines 14 to 29
connect to IO26 to IO41, GPIO line 30 connects to the pin labelled A
(aka IO42), and GPIO line 31 is connected to LED 7.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2 years ago
162 changed files with 112198 additions and 61483 deletions
- Create a working directory and point FuseSoC at microwatt:
- If this is your first time using fusesoc, initialize fusesoc.
This is needed to be able to pull down fussoc library components referenced
by microwatt. Run
```
mkdir microwatt-fusesoc
cd microwatt-fusesoc
fusesoc library add microwatt /path/to/microwatt/
fusesoc init
fusesoc fetch uart16550
fusesoc library add microwatt /path/to/microwatt
```
- Build using FuseSoC. For hello world (Replace nexys_video with your FPGA board such as --target=arty_a7-100):
You may wish to ensure you have [installed Digilent Board files](https://reference.digilentinc.com/vivado/installing-vivado/start#installing_digilent_board_files)
or appropriate files for your board first.
```
fusesoc run --target=nexys_video microwatt --memory_size=16384 --ram_init_file=/path/to/microwatt/fpga/hello_world.hex
@ -118,6 +122,68 @@ You should then be able to see output via the serial port of the board (/dev/tty
fusesoc run --target=nexys_video microwatt
```
## Linux on Microwatt
Mainline Linux supports Microwatt as of v5.14. The Arty A7 is the best tested
platform, but it's also been tested on the OrangeCrab and ButterStick.
1. Use buildroot to create a userspace
A small change is required to glibc in order to support the VMX/AltiVec-less
Microwatt, as float128 support is mandiatory and for this in GCC requires
VSX/AltiVec. This change is included in Joel's buildroot fork, along with a