microwatt

Commit Graph

Author	SHA1	Message	Date
Paul Mackerras	d1e8e62fee	Remove option for "short" 16x16 bit multiplier Now that we have a 33 bit x 33 bit signed multiplier in execute1, there is really no need for the 16 bit multiplier. The coremark results are just as good without it as with it. This removes the option for the sake of simplicity. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	2 years ago
Paul Mackerras	734e4c4a52	core: Add a short multiplier This adds an optional 16 bit x 16 bit signed multiplier and uses it for multiply instructions that return the low 64 bits of the product (mull[dw][o] and mulli, but not maddld) when the operands are both in the range -2^15 .. 2^15 - 1. The "short" 16-bit multiplier produces its result combinatorially, so a multiply that uses it executes in one cycle. This improves the coremark result by about 4%, since coremark does quite a lot of multiplies and they almost all have operands that fit into 16 bits. The presence of the short multiplier is controlled by a generic at the execute1, SOC, core and top levels. For now, it defaults to off for all platforms, and can be enabled using the --has_short_mult flag to fusesoc. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	3 years ago
Michael Neuling	d7458d5beb	Reduce the size of icache to help yosys ECP5 builds (#303 ) The icache RAM is currently LUT ram not block ram. This massively bloats the icache size. We think this is due to yosys not inferencing the RAM correctly but that's yet to be confirmed. Work around this for now by reducing the default size of the icache RAM for the ECP5 builds. On the ECP5 85K builts, this gets us from 95% down to 76% and helps our CI to pass. Signed-off-by: Michael Neuling <mikey@neuling.org>	3 years ago
Paul Mackerras	0fb207be60	fetch1: Implement a simple branch target cache This implements a cache in fetch1, where each entry stores the address of a simple branch instruction (b or bc) and the target of the branch. When fetching sequentially, if the address being fetched matches the cache entry, then fetching will be redirected to the branch target. The cache has 1024 entries and is direct-mapped, i.e. indexed by bits 11..2 of the NIA. The bus from execute1 now carries information about taken and not-taken simple branches, which fetch1 uses to update the cache. The cache entry is updated for both taken and not-taken branches, with the valid bit being set if the branch was taken and cleared if the branch was not taken. If fetching is redirected to the branch target then that goes down the pipe as a predicted-taken branch, and decode1 does not do any static branch prediction. If fetching is not redirected, then the next instruction goes down the pipe as normal and decode1 does its static branch prediction. In order to make timing, the lookup of the cache is pipelined, so on each cycle the cache entry for the current NIA + 8 is read. This means that after a redirect (from decode1 or execute1), only the third and subsequent sequentially-fetched instructions will be able to be predicted. This improves the coremark value on the Arty A7-100 from about 180 to about 190 (more than 5%). The BTC is optional. Builds for the Artix 7 35-T part have it off by default because the extra ~1420 LUTs it takes mean that the design doesn't fit on the Arty A7-35 board. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Anton Blanchard	80cf489e96	Add LOG_LENGTH to top-generic.vhdl The other top level files allow LOG_LENGTH to be configured. Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	4 years ago
Paul Mackerras	45cd8f4fc3	core: Add support for floating-point loads and stores This extends the register file so it can hold FPR values, and implements the FP loads and stores that do not require conversion between single and double precision. We now have the FP, FE0 and FE1 bits in MSR. FP loads and stores cause a FP unavailable interrupt if MSR[FP] = 0. The FPU facilities are optional and their presence is controlled by the HAS_FPU generic passed down from the top-level board file. It defaults to true for all except the A7-35 boards. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>	4 years ago
Benjamin Herrenschmidt	fb5c16d05e	uart: Make 16550 the default Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	4 years ago
Benjamin Herrenschmidt	7575b1e0c2	uart: Import and hook up opencore 16550 compatible UART This imports via fusesoc a 16550 compatible (ie "standard") UART, and wires it up optionally in the SoC instead of the potato one. This also adds support for a second UART (which is always a 16550) to Arty, wired to JC "bottom" port. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	4 years ago
Benjamin Herrenschmidt	bf7def5503	soc: Don't require dram wishbones signals to be wired by toplevel Currently, when not using litedram, the top level still has to hook up "dummy" wishbones to the main dram and control dram busses coming out of the SoC and provide ack signals. Instead, make the SoC generate the acks internally when not using litedram and use defaults to make the wiring entirely optional. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	4 years ago
Benjamin Herrenschmidt	1ffc89e58b	soc: Add defaults for some input signals That way the top-level's don't need to assign them Also remove generics that are set to the default anyways Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	4 years ago
Benjamin Herrenschmidt	4244b54984	soc: Remove unused RESET_LOW generic Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	4 years ago
Benjamin Herrenschmidt	cc4dcb3597	spi: Add SPI Flash controller This adds an SPI flash controller which supports direct memory-mapped access to the flash along with a manual mode to send commands. The direct mode can be set via generic to default to single wire or quad mode. The controller supports normal, dual and quad accesses with configurable commands, clock divider, dummy clocks etc... The SPI clock can be an even divider of sys_clk starting at 2 (so max 50Mhz with our typical Arty designs). A flash offset is carried via generics to syscon to tell SW about which portion of the flash is reserved for the FPGA bitfile. There is currently no plumbing to make the CPU reset past that address (TBD). Note: Operating at 50Mhz has proven unreliable without adding some delay to the sampling of the input data. I'm working in improving this, in the meantime, I'm leaving the default set at 25 Mhz. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	4 years ago
Benjamin Herrenschmidt	573b6b4bc4	soc: Rework interconnect This changes the SoC interconnect such that the main 64-bit wishbone out of the processor is first split between only 3 slaves (BRAM, DRAM and a general "IO" bus) instead of all the slaves in the SoC. The IO bus leg is then latched and down-converted to 32 bits data width, before going through a second address decoder for the various IO devices. This significantly reduces routing and timing pressure on the main bus, allowing to get rid of frequent timing violations when synthetizing on small'ish FPGAs such as the Artix-7 35T found on the original Arty board. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	5 years ago
Anton Blanchard	f96d179f66	Some yosys fixes This gets the yosys build further along, but I'm now chasing what looks like a yosys bug. Signed-off-by: Anton Blanchard <anton@linux.ibm.com>	5 years ago
Benjamin Herrenschmidt	3ac815823c	fpga: Hookup Arty to litedram The old toplevel.vhdl becomes top-generic.vhdl, which is to be used by platforms that do not have a litedram option. Arty has its own top-arty.vhdl which supports litedram and is now hooked up Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	5 years ago

15 Commits (d92af779eb1388268f066d69614ec5658c1f5ccc)