Commit Graph

111 Commits (9c64f8a98ba63a3ab249f272716ada201fd52aa6)

Author SHA1 Message Date
Matt Johnston 049f0549d8 orangecrab: Fix sdcard wishbone addressing
Orangecrab missed out on:

Make wishbone addresses be in units of doublewords or words
Author: Paul Mackerras <paulus@ozlabs.org>
Date:   Wed Sep 15 18:18:09 2021 +1000

Signed-off-by: Matt Johnston <matt@codeconstruct.com.au>
3 years ago
Matt Johnston abc6a4f372 orangecrab: use litesdcard
Currently not working (tested in Linux)

Signed-off-by: Matt Johnston <matt@codeconstruct.com.au>
3 years ago
Matt Johnston d794cc70b1 orangecrab: No BTC, LOG_LENGTH, dram NUM_LINES
Reduce litedram NUM_LINES 64->8
This allows us to meet timing. Can probably
be improved in future with better BRAM usage.

Signed-off-by: Matt Johnston <matt@codeconstruct.com.au>
3 years ago
Matt Johnston a8d9203c5d orangecrab: Use litedram
Signed-off-by: Matt Johnston <matt@codeconstruct.com.au>
3 years ago
Matt Johnston 57d4c4c117 orangecrab: set HAS_SHORT_MULT
It seems free, generated as a single MULT18X18D

Signed-off-by: Matt Johnston <matt@codeconstruct.com.au>
3 years ago
Matt Johnston a9b467f43b orangecrab: add Orange Crab r0.2 target
top-orangecrab0.2 is a copy of top-arty with various changes.
USRMCLK is added for the SPI clock
ethernet is removed

Signed-off-by: Matt Johnston <matt@codeconstruct.com.au>
3 years ago
Paul Mackerras d458b5845c ECP5: Adjust PLL constants so the PLL lock indication works
At present, code (such as simple_random) which produces serial port
output during the first few milliseconds of operation produces garbled
output.  The reason is that the clock has not yet stabilized and is
running slow, resulting in the bit time of the serial characters being
too long.

The ECP5 data sheet says that the phase detector should be operated
between 10 and 400 MHz.  The current code operates it at 2MHz.
Consequently, the PLL lock indication doesn't work, i.e. it is always
zero.  The current code works around that by inverting it, i.e. taking
the "not locked" indication to mean "locked".

Instead, we now run it at 12MHz, chosen because the common external
clock inputs on ECP5 boards are 12MHz and 48MHz.  Normally this would
mean that the available system clock frequencies would be multiples of
12MHz, but this is a little inconvenient as we use 40MHz on the Orange
Crab v0.21 boards.  Instead, by using the secondary clock output for
feedback, we can have any divisor of the PLL frequency as the system
clock frequency.

The ECP5 data sheet says the PLL oscillator can run at 400 to 800
MHz.  Here we choose 480MHz since that allows us to generate 40MHz and
48MHz easily and is a multiple of 12MHz.

With this, the lock signal works correctly, and the inversion can be
removed.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
3 years ago
Benjamin Herrenschmidt da0189af1e Add support for QMTech Wukong v2 board
For now only the V2 of the board (slightly different pinout)
and only the A100T variant. I also haven't added GPIOs or anything
else on the PMODs really.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
3 years ago
Benjamin Herrenschmidt 621a0f6b28 fpga/clk_gen_plle2: Add support for 50Mhz->100Mhz
50Mhz clkin, 100Mhz sys_clk, as needed for Wukon v2

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
3 years ago
Anton Blanchard af6bc48d36
Merge pull request #329 from paulusmack/wb-fix
Wishbone addressing fix
3 years ago
Paul Mackerras ca4eb46aea Make wishbone addresses be in units of doublewords or words
This makes the 64-bit wishbone buses have the address expressed in
units of doublewords (64 bits), and similarly for the 32-bit buses the
address is in units of words (32 bits).  This is to comply with the
wishbone spec.  Previously the addresses on the wishbone buses were in
units of bytes regardless of the bus data width, which is not correct
and caused problems with interfacing with externally-generated logic.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
3 years ago
Paul Mackerras 734e4c4a52 core: Add a short multiplier
This adds an optional 16 bit x 16 bit signed multiplier and uses it
for multiply instructions that return the low 64 bits of the product
(mull[dw][o] and mulli, but not maddld) when the operands are both in
the range -2^15 .. 2^15 - 1.   The "short" 16-bit multiplier produces
its result combinatorially, so a multiply that uses it executes in one
cycle.  This improves the coremark result by about 4%, since coremark
does quite a lot of multiplies and they almost all have operands that
fit into 16 bits.

The presence of the short multiplier is controlled by a generic at the
execute1, SOC, core and top levels.  For now, it defaults to off for
all platforms, and can be enabled using the --has_short_mult flag to
fusesoc.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
3 years ago
Paul Mackerras 8cdb00652b
Merge pull request #316 from antonblanchard/verilator-fix
Rename 'do' signal to avoid verilator System Verilog warning
3 years ago
Anton Blanchard 591e96d1a2 gpio: Add HAS_GPIO to avoid verilator build errors
The verilator build fails with warnings and errors, because NGPIO
is 0 and we do things like:

        gpio_out : out std_ulogic_vector(NGPIO - 1 downto 0);

Set NGPIO to something reasonable (eg 32) and add HAS_GPIO to avoid
building the macro entirely if it isn't in use.

Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
3 years ago
Anton Blanchard bc0f7cf236 Rename 'do' signal to avoid verilator System Verilog warning
Experimenting with using ghdl to do VHDL to Verilog conversion (instead
of ghdl+yosys), verilator complains that a signal is a SystemVerilog
keyword:

%Error: microwatt.v:15013:18: Unexpected 'do': 'do' is a SystemVerilog keyword misused as an identifier.
        ... Suggest modify the Verilog-2001 code to avoid SV keywords, or use `begin_keywords or --language.

We could probably make this go away by disabling SystemVerilog, but
it's easy to rename the signal in question. Rename di at the same
time.

Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
3 years ago
Anton Blanchard 7cfbcd5514 litesdcard: Add Nexys Video support
This board has a reset line that needs to be held low to power up the
SD card hardware.

Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
3 years ago
Anton Blanchard 458dfe01a6 Add liteeth support to Nexys Video
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
3 years ago
Michael Neuling 69a1440204
Merge pull request #309 from antonblanchard/clk-cleanup
Small cleanups to clock definitions
3 years ago
Anton Blanchard 75e06a1e30 Remove -add from xdc files
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
3 years ago
Anton Blanchard 187199c489 Remove -waveform from xdc files
A 50% duty cycle is the default, so no need to use -waveform.

Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
3 years ago
Anton Blanchard 7994b98404 Fix some whitespace issues
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
3 years ago
Michael Neuling d7458d5beb
Reduce the size of icache to help yosys ECP5 builds (#303)
The icache RAM is currently LUT ram not block ram. This massively
bloats the icache size. We think this is due to yosys not inferencing
the RAM correctly but that's yet to be confirmed.

Work around this for now by reducing the default size of the icache
RAM for the ECP5 builds.

On the ECP5 85K builts, this gets us from 95% down to 76% and helps
our CI to pass.

Signed-off-by: Michael Neuling <mikey@neuling.org>
3 years ago
Michael Neuling 84473eda1b Merge pull request #277 from paulus/gpio
A few cleanups. GPIO IRQ number is now 4 as 3 is now taken by the SD card.
4 years ago
Paul Mackerras 21ed730514 arty_a7: Add litesdcard interface
This adds litesdcard.v generated from the litex/litesdcard project,
along with logic in top-arty.vhdl to connect it into the system.
There is now a DMA wishbone coming in to soc.vhdl which is narrower
than the other wishbone masters (it has 32-bit data rather than
64-bit) so there is a widening/narrowing adapter between it and the
main wishbone master arbiter.

Also, litesdcard generates a non-pipelined wishbone for its DMA
connection, which needs to be converted to a pipelined wishbone.  We
have a latch on both the incoming and outgoing sides of the wishbone
in order to help make timing (at the cost of two extra cycles of
latency).

litesdcard generates an interrupt signal which is wired up to input 3
of the ICS (IRQ 19).

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
4 years ago
Paul Mackerras f06a0f4e5a arty: Update GPIOs for Boxarty BMC
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
4 years ago
Paul Mackerras f06ffcf9b7 Add a GPIO controller and use it to drive the shield I/O pins on the Arty
This adds a GPIO controller which provides 32 bits of I/O.  The
registers are modelled on the set used by the gpio-ftgpio010.c driver
in the Linux kernel.  Currently there is no interrupt capability
implemented, though an interrupt line from the GPIO subsystem to the
XICS has been connected.

For the Arty A7 board, GPIO lines 0 to 13 are connected to the pins
labelled IO0 to IO13 on the "shield" connector, GPIO lines 14 to 29
connect to IO26 to IO41, GPIO line 30 connects to the pin labelled A
(aka IO42), and GPIO line 31 is connected to LED 7.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
4 years ago
Paul Mackerras 0fb207be60 fetch1: Implement a simple branch target cache
This implements a cache in fetch1, where each entry stores the address
of a simple branch instruction (b or bc) and the target of the branch.
When fetching sequentially, if the address being fetched matches the
cache entry, then fetching will be redirected to the branch target.
The cache has 1024 entries and is direct-mapped, i.e. indexed by bits
11..2 of the NIA.

The bus from execute1 now carries information about taken and
not-taken simple branches, which fetch1 uses to update the cache.
The cache entry is updated for both taken and not-taken branches, with
the valid bit being set if the branch was taken and cleared if the
branch was not taken.

If fetching is redirected to the branch target then that goes down the
pipe as a predicted-taken branch, and decode1 does not do any static
branch prediction.  If fetching is not redirected, then the next
instruction goes down the pipe as normal and decode1 does its static
branch prediction.

In order to make timing, the lookup of the cache is pipelined, so on
each cycle the cache entry for the current NIA + 8 is read.  This
means that after a redirect (from decode1 or execute1), only the third
and subsequent sequentially-fetched instructions will be able to be
predicted.

This improves the coremark value on the Arty A7-100 from about 180 to
about 190 (more than 5%).

The BTC is optional.  Builds for the Artix 7 35-T part have it off by
default because the extra ~1420 LUTs it takes mean that the design
doesn't fit on the Arty A7-35 board.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
4 years ago
Paul Mackerras 2be2440734 Arty A7: Document pin connections for on-board headers
This adds, as comments, lines which would if uncommented define
properties which associate the pins of the headers on the Arty A7
board with FPGA pins.  It also adds properties for LEDs 1--3, also
commented out for now.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
4 years ago
Anton Blanchard 80cf489e96 Add LOG_LENGTH to top-generic.vhdl
The other top level files allow LOG_LENGTH to be configured.

Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
4 years ago
Paul Mackerras 45cd8f4fc3 core: Add support for floating-point loads and stores
This extends the register file so it can hold FPR values, and
implements the FP loads and stores that do not require conversion
between single and double precision.

We now have the FP, FE0 and FE1 bits in MSR.  FP loads and stores
cause a FP unavailable interrupt if MSR[FP] = 0.

The FPU facilities are optional and their presence is controlled by
the HAS_FPU generic passed down from the top-level board file.  It
defaults to true for all except the A7-35 boards.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
4 years ago
Michael Neuling 6d6cf59bb7
Merge pull request #235 from paulusmack/master
More instructions and a random number generator
4 years ago
Boris Shingarov 679c547e5f fpga: Add support for Genesys2
Signed-off-by: Boris Shingarov <shingarov@labware.com>
4 years ago
Benjamin Herrenschmidt dbb137437c acorn: Add support for the Acorn CLE 215+
This is a NiteFury based PCIe M2 form-factor board originally
used for mining. It contains a speed grade 2 Artix 7 200T,
1GB of DDR3 and 32MB of flash.

The serial port is routed to pin 2 (RX) and 3 (TX) of the P2
connector (pin 1 is GND).

Note: Only 16MB of flash is currently usable until code is added
to configure the flash controller to use 4-bytes address commands
on that part.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
4 years ago
Paul Mackerras 1a7aebeef8 Add random number generator and implement the darn instruction
This adds a true random number generator for the Xilinx FPGAs which
uses a set of chaotic ring oscillators to generate random bits and
then passes them through a Linear Hybrid Cellular Automaton (LHCA) to
remove bias, as described in "High Speed True Random Number Generators
in Xilinx FPGAs" by Catalin Baetoniu of Xilinx Inc., in:

https://pdfs.semanticscholar.org/83ac/9e9c1bb3dad5180654984604c8d5d8137412.pdf

This requires adding a .xdc file to tell vivado that the combinatorial
loops that form the ring oscillators are intentional.  The same
code should work on other FPGAs as well if their tools can be told to
accept the combinatorial loops.

For simulation, the random.vhdl module gets compiled in, which uses
the pseudorand() function to generate random numbers.

Synthesis using yosys uses nonrandom.vhdl, which always signals an
error, causing darn to return 0xffff_ffff_ffff_ffff.

This adds an implementation of the darn instruction.  Darn can return
either raw or conditioned random numbers.  On Xilinx FPGAs, reading a
raw random number gives the output of the ring oscillators, and
reading a conditioned random number gives the output of the LHCA.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
4 years ago
Benjamin Herrenschmidt 02abb135a8 litedram: l2: Add support for more geometries
Make the DRAM data lines and user port width configurable, also
don't hard wire dependency on the wishbone data width.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
4 years ago
Benjamin Herrenschmidt b0241d9f2d corefile/nexys_video: Parameter fixes
This fixes up a few issues with parameters:

Only arty has "has_uart1" since we haven't added plumbing for a second UART
anywhere else. Also "uart_is_16550" was mixing on one of the nexys_video
targets, and nexys_video toplevel was missing LOG_LENGTH.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
4 years ago
Benjamin Herrenschmidt a5fa92f71b fpga: nexys-video: Wire up core_alt_reset
It looks like we left it dangling

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
4 years ago
Benjamin Herrenschmidt 5449d842dd nexys_video: Fix nexys-video build
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
4 years ago
Michael Neuling 5aaa63ee3b Add PLL for ECP5 device
Means we can synthesize at 40Mhz (where we currently make timing) and
our UART still works at 115200 baud.

Tested working hello world unmodified with ECP5 eval board. Orange
Crab is updated but is untested.

Signed-off-by: Michael Neuling <mikey@neuling.org>
4 years ago
Benjamin Herrenschmidt fb5c16d05e uart: Make 16550 the default
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
4 years ago
Benjamin Herrenschmidt 7575b1e0c2 uart: Import and hook up opencore 16550 compatible UART
This imports via fusesoc a 16550 compatible (ie "standard") UART,
and wires it up optionally in the SoC instead of the potato one.

This also adds support for a second UART (which is always a
16550) to Arty, wired to JC "bottom" port.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
4 years ago
Benjamin Herrenschmidt 8366710217 liteeth: Hook up LiteX LiteEth ethernet controller
Currently only generated for Arty.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
4 years ago
Joel Stanley 60e5f7b958 spi: Fix dat_i_l constraints
No cells matched 'get_cells -hierarchical -filter {NAME =~*/spi_rxtx/dat_i_l*}'. [build/microwatt_0/src/microwatt_0/fpga/arty_a7.xdc:42]

The signal is in it's own process so the net name ends up being
spi_rxtx/input_delay_1.dat_i_l_reg.

After this change the log shows:

Applied set_property IOB = TRUE for soc0/\spiflash_gen.spiflash /spi_rxtx/\input_delay_1.dat_i_l_reg . (constraint file  fpga/arty_a7.xdc, line 42).
Applied set_property IOB = TRUE for soc0/\spiflash_gen.spiflash /spi_rxtx/\input_delay_1.dat_i_l_reg . (constraint file  fpga/arty_a7.xdc, line 42).
Applied set_property IOB = TRUE for soc0/\spiflash_gen.spiflash /spi_rxtx/\input_delay_1.dat_i_l_reg . (constraint file  fpga/arty_a7.xdc, line 42).
Applied set_property IOB = TRUE for soc0/\spiflash_gen.spiflash /spi_rxtx/\input_delay_1.dat_i_l_reg . (constraint file  fpga/arty_a7.xdc, line 42).

Signed-off-by: Joel Stanley <joel@jms.id.au>
4 years ago
Michael Neuling b90a0a2139
Merge pull request #208 from paulusmack/faster
Make the core go faster

Several major improvements in here:
- Simple branch predictor
- Reduced latency for mispredicted branches and interrupts by removing fetch2 stage
- Cache improvements
  o Request critical dword first on refill
  o Handle hits while refilling, including on line being refilled
  o Sizes doubled for both D and I
- Loadstore improvements: can now do one load or store every two cycles in most cases
- Optimized 2-cycle multiplier for Xilinx 7-series parts using DSP slices
- Timing improvements, including:
  o Stash buffer in decode1
  o Reduced width of execute1 result mux
  o Improved SPR decode in decode1
  o Some non-critical operation take a cycle longer so we can break some long combinatorial chains
- Core logging: logs 256 bits of info every cycle into a ring buffer, to help with debugging and performance analysis

This increases the LUT usage for the "synth" + A35 target from 9182 to 10297 = 12%.
5 years ago
Paul Mackerras 78de4fef72 Make LOG_LENGTH configurable per FPGA variant
This plumbs the LOG_LENGTH parameter (which controls how many entries
the core log RAM has) up to the top level so that it can be set on
the fusesoc command line and have different default values on
different FPGAs.

It now defaults to 512 entries generally and on the Artix-7 35 parts,
and 2048 on the larger Artix-7 FPGAs.  It can be set to 0 if desired.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Benjamin Herrenschmidt f9f18906a3 soc: Rename wb_dram_ctrl to wb_ext_io and rework decoding
This makes the control bus currently going out of "soc" towards
litedram more generic for external IO devices added by the
top-level rather than inside the SoC proper.

This is mostly renaming of signals and a small change on how the
address decoder operates, using a separate "cascaded" decode for
the external IOs.

We make the region 0xc8nn_nnnn be the "external IO" region for
now.

This will make it easier / cleaner to add more external devices.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
5 years ago
Benjamin Herrenschmidt bf7def5503 soc: Don't require dram wishbones signals to be wired by toplevel
Currently, when not using litedram, the top level still has to hook
up "dummy" wishbones to the main dram and control dram busses coming
out of the SoC and provide ack signals.

Instead, make the SoC generate the acks internally when not using
litedram and use defaults to make the wiring entirely optional.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
5 years ago
Benjamin Herrenschmidt 1ffc89e58b soc: Add defaults for some input signals
That way the top-level's don't need to assign them

Also remove generics that are set to the default anyways

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
5 years ago
Benjamin Herrenschmidt 4244b54984 soc: Remove unused RESET_LOW generic
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
5 years ago
Benjamin Herrenschmidt e5aa0e9dc9 uart: Remove combinational loops on ack and stall signal
They hurt timing forcing signals to come from the master and back
again in one cycle. Stall isn't sampled by the master unless there
is an active cycle so masking it with cyc is pointless. Masking acks
is somewhat pointless too as we don't handle early dropping of cyc
in any of our slaves properly anyways.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
5 years ago