Commit Graph

60 Commits (0af906232f51827d624f570c1dc8eaeee8713725)

Author SHA1 Message Date
Anton Blanchard 2d21b95f87 Pass icache/dcache/tlb parameters down from soc
We want much smaller caches and tlbs when building for sky130, so
allow the toplevel file to override the defaults.

Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
4 years ago
Paul Mackerras 3cd3449b4b core: Move redirect and interrupt delivery logic to writeback
This moves the logic for redirecting fetching and writing SRR0 and
SRR1 to writeback.  The aim is that ultimately units other than
execute1 can send their interrupts to writeback along with their
instruction completions, so that there can be multiple instructions
in flight without needing execute1 to keep track of the address
of each outstanding instruction.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
4 years ago
Paul Mackerras ae2afeca5c core: Track CR hazards and bypasses using tags
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
4 years ago
Paul Mackerras d290d2a9bb core: Restore bypass path from execute1
This changes the bypass path.  Previously it went from after
execute1's output to after decode2's output.  Now it goes from before
execute1's output register to before decode2's output register.  The
reason is that the new path will be simpler to manage when there are
possibly multiple instructions in flight.  This means that the
bypassing can be managed inside decode2 and control.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
4 years ago
Paul Mackerras c0b45e153b core: Track GPR hazards using tags that propagate through the pipelines
This changes the way GPR hazards are detected and tracked.  Instead of
having a model of the pipeline in gpr_hazard.vhdl, which has to mirror
the behaviour of the real pipeline exactly, we now assign a 2-bit tag
to each instruction and record which GSPR the instruction writes.
Subsequent instructions that need to use the GSPR get the tag number
and stall until the value with that tag is being written back to the
register file.

For now, the forwarding paths are disabled.  That gives about a 8%
reduction in coremark performance.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
4 years ago
Paul Mackerras 0fb207be60 fetch1: Implement a simple branch target cache
This implements a cache in fetch1, where each entry stores the address
of a simple branch instruction (b or bc) and the target of the branch.
When fetching sequentially, if the address being fetched matches the
cache entry, then fetching will be redirected to the branch target.
The cache has 1024 entries and is direct-mapped, i.e. indexed by bits
11..2 of the NIA.

The bus from execute1 now carries information about taken and
not-taken simple branches, which fetch1 uses to update the cache.
The cache entry is updated for both taken and not-taken branches, with
the valid bit being set if the branch was taken and cleared if the
branch was not taken.

If fetching is redirected to the branch target then that goes down the
pipe as a predicted-taken branch, and decode1 does not do any static
branch prediction.  If fetching is not redirected, then the next
instruction goes down the pipe as normal and decode1 does its static
branch prediction.

In order to make timing, the lookup of the cache is pipelined, so on
each cycle the cache entry for the current NIA + 8 is read.  This
means that after a redirect (from decode1 or execute1), only the third
and subsequent sequentially-fetched instructions will be able to be
predicted.

This improves the coremark value on the Arty A7-100 from about 180 to
about 190 (more than 5%).

The BTC is optional.  Builds for the Artix 7 35-T part have it off by
default because the extra ~1420 LUTs it takes mean that the design
doesn't fit on the Arty A7-35 board.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
4 years ago
Anton Blanchard 659be2780f Fully initialize FPU buses when FPU is disabled
Some of the bits in the FPU buses end up as z state. Yosys
flags them, so we may as well clean it up.

Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
4 years ago
Paul Mackerras 856e9e955f core: Add framework for an FPU
This adds the skeleton of a floating-point unit and implements the
mffs and mtfsf instructions.

Execute1 sends FP instructions to the FPU and receives busy,
exception, FP interrupt and illegal interrupt signals from it.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
4 years ago
Paul Mackerras 45cd8f4fc3 core: Add support for floating-point loads and stores
This extends the register file so it can hold FPR values, and
implements the FP loads and stores that do not require conversion
between single and double precision.

We now have the FP, FE0 and FE1 bits in MSR.  FP loads and stores
cause a FP unavailable interrupt if MSR[FP] = 0.

The FPU facilities are optional and their presence is controlled by
the HAS_FPU generic passed down from the top-level board file.  It
defaults to true for all except the A7-35 boards.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
4 years ago
Paul Mackerras 893d2bc6a2 core: Don't generate logic for log data when LOG_LENGTH = 0
This adds "if LOG_LENGTH > 0 generate" to the places in the core
where log output data is latched, so that when LOG_LENGTH = 0 we
don't create the logic to collect the data which won't be stored.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
4 years ago
Paul Mackerras 78de4fef72 Make LOG_LENGTH configurable per FPGA variant
This plumbs the LOG_LENGTH parameter (which controls how many entries
the core log RAM has) up to the top level so that it can be set on
the fusesoc command line and have different default values on
different FPGAs.

It now defaults to 512 entries generally and on the Artix-7 35 parts,
and 2048 on the larger Artix-7 FPGAs.  It can be set to 0 if desired.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
4 years ago
Paul Mackerras 6687aae4d6 core: Implement a simple branch predictor
This implements a simple branch predictor in the decode1 stage.  If it
sees that the instruction is b or bc and the branch is predicted to be
taken, it sends a flush and redirect upstream (to icache and fetch1)
to redirect fetching to the branch target.  The prediction is sent
downstream with the branch instruction, and execute1 now only sends
a flush/redirect upstream if the prediction was wrong.  Unconditional
branches are always predicted to be taken, and conditional branches
are predicted to be taken if and only if the offset is negative.
Branches that take the branch address from a register (bclr, bcctr)
are predicted not taken, as we don't have any way to predict the
branch address.

Since we can now have a mflr being executed immediately after a bl
or bcl, we now track the update to LR in the hazard tracker, using
the second write register field that is used to track RA updates for
update-form loads and stores.

For those branches that update LR but don't write any other result
(i.e. that don't decrementer CTR), we now write back LR in the same
cycle as the instruction rather than taking a second cycle for the
LR writeback.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
4 years ago
Paul Mackerras b3799c432b decode1: Add a stash buffer to the output
This means that the busy signal from execute1 (which can be driven
combinatorially from mmu or dcache) now stops at decode1 and doesn't
go on to icache or fetch1.  This helps with timing.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
4 years ago
Paul Mackerras 6701e7346b core: Use a busy signal rather than a stall
This changes the instruction dependency tracking so that we can
generate a "busy" signal from execute1 and loadstore1 which comes
along one cycle later than the current "stall" signal.  This will
enable us to signal busy cycles only when we need to from loadstore1.

The "busy" signal from execute1/loadstore1 indicates "I didn't take
the thing you gave me on this cycle", as distinct from the previous
stall signal which meant "I took that but don't give me anything
next cycle".  That means that decode2 proactively gives execute1
a new instruction as soon as it has taken the previous one (assuming
there is a valid instruction available from decode1), and that then
sits in decode2's output until execute1 can take it.  So instructions
are issued by decode2 somewhat earlier than they used to be.

Decode2 now only signals a stall upstream when its output buffer is
full, meaning that we can fill up bubbles in the upstream pipe while a
long instruction is executing.  This gives a small boost in
performance.

This also adds dependency tracking for rA updates by update-form
load/store instructions.

The GPR and CR hazard detection machinery now has one extra stage,
which may not be strictly necessary.  Some of the code now really
only applies to PIPELINE_DEPTH=1.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
4 years ago
Paul Mackerras f80da65799 core: Double the dcache and icache sizes
This makes the dcache and icache both be 8kB.  This still only uses
one BRAM per way per cache on the Artix-7, since the BRAMs were only
half-used previously.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
4 years ago
Paul Mackerras b5a7dbb78d core: Remove fetch2 pipeline stage
The fetch2 stage existed primarily to provide a stash buffer for the
output of icache when a stall occurred.  However, we can get the same
effect -- of having the input to decode1 stay unchanged on a stall
cycle -- by using the read enable of the BRAMs in icache, and by
adding logic to keep the outputs unchanged on a clock cycle when
stall_in = 1.  This reduces branch and interrupt latency by one
cycle.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
4 years ago
Paul Mackerras 49a4d9f67a Add core logging
This logs 256 bits of data per cycle to a ring buffer in BRAM.  The
data collected can be read out through 2 new SPRs or through the
debug interface.

The new SPRs are LOG_ADDR (724) and LOG_DATA (725).  LOG_ADDR contains
the buffer write pointer in the upper 32 bits (in units of entries,
i.e. 32 bytes) and the read pointer in the lower 32 bits (in units of
doublewords, i.e. 8 bytes).  Reading LOG_DATA gives the doubleword
from the buffer at the read pointer and increments the read pointer.
Setting bit 31 of LOG_ADDR inhibits the trace log system from writing
to the log buffer, so the contents are stable and can be read.

There are two new debug addresses which function similarly to the
LOG_ADDR and LOG_DATA SPRs.  The log is frozen while either or both of
the LOG_ADDR SPR bit 31 or the debug LOG_ADDR register bit 31 are set.

The buffer defaults to 2048 entries, i.e. 64kB.  The size is set by
the LOG_LENGTH generic on the core_debug module.  Software can
determine the length of the buffer because the length is ORed into the
buffer write pointer in the upper 32 bits of LOG_ADDR.  Hence the
length of the buffer can be calculated as 1 << (31 - clz(LOG_ADDR)).

There is a program to format the log entries in a somewhat readable
fashion in scripts/fmt_log/fmt_log.c.  The log_entry struct in that
file describes the layout of the bits in the log entries.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
4 years ago
Benjamin Herrenschmidt b863791e38
icache: Fix icbi potentially clobbering the icache (#192)
icbi currently just resets the icache. This has some nasty side
effects such as also clearing the TLB, but also the wishbone interface.

That means that any ongoing cycle will be dropped.

However, most of our slaves don't handle that well and will continue
sending acks for already issued requests.

Under some circumstances we can thus restart an icache load and get
spurious ack/data from the wishbone left over from the "cancelled"
sequence.

This has broken booting Linux for me.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
4 years ago
Benjamin Herrenschmidt f86fb74bfe irq: Simplify xics->core irq input
Use a simple wire. common.vhdl types are better kept for things
local to the core. We can add more wires later if we need to for
HV irqs etc...

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
5 years ago
Anton Blanchard 4e78b8078e
Merge branch 'master' into litedram 5 years ago
Benjamin Herrenschmidt acbdd396a5 soc/core: Add reset latches
This adds one-cycle latches to the various resets out of the soc and
into the various core modules. It *seems* to help vivado P&R a bit
and has shown to avoid timing violations under some circumstances.

Interestingly those resets never seem to appear in the bad timing
path. It looks like those long resets simply impose placement
constraints that Vivado satisfies at the expense of timing elsewhere.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
5 years ago
Paul Mackerras c164a2f4ea Merge branch 'mmu'
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras 3d4712ad43 Add TLB to icache
This adds a direct-mapped TLB to the icache, with 64 entries by default.
Execute1 now sends a "virt_mode" signal from MSR[IR] to fetch1 along
with redirects to indicate whether instruction addresses should be
translated through the TLB, and fetch1 sends that on to icache.
Similarly a "priv_mode" signal is sent to indicate the privilege
mode for instruction fetches.  This means that changes to MSR[IR]
or MSR[PR] don't take effect until the next redirect, meaning an
isync, rfid, branch, etc.

The icache uses a hash of the effective address (i.e. next instruction
address) to index the TLB.  The hash is an XOR of three fields of the
address; with a 64-entry TLB, the fields are bits 12--17, 18--23 and
24--29 of the address.  TLB invalidations simply invalidate the
indexed TLB entry without checking the contents.

If the icache detects a TLB miss with virt_mode=1, it will send a
fetch_failed indication through fetch2 to decode1, which will turn it
into a special OP_FETCH_FAILED opcode with unit=LDST.  That will get
sent down to loadstore1 which will currently just raise a Instruction
Storage Interrupt (0x400) exception.

One bit in the PTE obtained from the TLB is used to check whether an
instruction access is allowed -- the privilege bit (bit 3).  If bit 3
is 1 and priv_mode=0, then a fetch_failed indication is sent down to
fetch2 and to decode1, which generates an OP_FETCH_FAILED.  Any PTEs
with PTE bit 0 (EAA[3]) clear or bit 8 (R) clear should not be put
into the iTLB since such PTEs would not allow execution by any
context.

Tlbie operations get sent from mmu to icache over a new connection.

Unfortunately the privileged instruction tests are broken for now.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras 8160f4f821 Add framework for implementing an MMU
This adds a new module to implement an MMU.  At the moment it doesn't
do very much.  Tlbie instructions now get sent by loadstore1 to mmu,
which sends them to dcache, rather than loadstore1 sending them
directly to dcache.  TLB misses from dcache now get sent by loadstore1
to mmu, which currently just returns an error.  Loadstore1 then
generates a DSI in response to the error return from mmu.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras 42d0fcc511 Implement data storage interrupts
This adds a path from loadstore1 back to execute1 for reporting
errors, and machinery in execute1 for generating data storage
interrupts at vector 0x300.

If dcache is given two requests in successive cycles and the
first encounters an error (e.g. a TLB miss), it will now cancel
the second request.

Loadstore1 now responds to errors reported by dcache by sending
an exception signal to execute1 and returning to the idle state.
Execute1 then writes SRR0 and SRR1 and jumps to the 0x300 Data
Storage Interrupt vector.  DAR and DSISR are held in loadstore1.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras dd2e71930c debug: Provide a way to examine GPRs, fast SPRs and MSR
This provides commands on the debug interface to read the value of
the MSR or any of the 64 GSPR register file entries.  The GSPR values
are read using the B port of the register file in a cycle when
decode2 is not using it.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Benjamin Herrenschmidt 6853d22203 core: Add alternate reset address
An external signal can control whether the core will start
executing at the standard or the alternate reset address.

This will be used when litedram is initialized by microwatt
itself, to route the reset to the built-in init code secondary
block RAM.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
5 years ago
Michael Neuling b4f20c20b9 XICS interrupt controller
New unified ICP and ICS XICS compliant interrupt controller.
Configurable number of hardware sources.

Fixed hardware source number based on hardware line taken. All
hardware interrupts are a fixed priority. Level interrupts supported
only.

Hardwired to 0xc0004000 in SOC (UART is kept at 0xc0002000).

Signed-off-by: Michael Neuling <mikey@neuling.org>
5 years ago
Paul Mackerras b349cc891a loadstore1: Move logic from dcache to loadstore1
So that the dcache could in future be used by an MMU, this moves
logic to do with data formatting, rA updates for update-form
instructions, and handling of unaligned loads and stores out of
dcache and into loadstore1.  For now, dcache connects only to
loadstore1, and loadstore1 now has the connection to writeback.

Dcache generates a stall signal to loadstore1 which indicates that
the request presented in the current cycle was not accepted and
should be presented again.  However, loadstore1 doesn't currently
use it because we know that we can never hit the circumstances
where it might be set.

For unaligned transfers, loadstore1 generates two requests to
dcache back-to-back, and then waits to see two acks back from
dcache (cycles where d_in.valid is true).

Loadstore1 now has a FSM for tracking how many acks we are
expecting from dcache and for doing the rA update cycles when
necessary.  Handling for reservations and conditional stores is
still in dcache.

Loadstore1 now generates its own stall signal back to decode2,
so we no longer need the logic in execute1 that generated the stall
for the first two cycles.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras 1a244d3470 Remove single-issue constraint for most loads and stores
This removes the constraint that loads and stores are single-issue,
at the expense of a stall of at least 2 cycles for every load and
store.

To do this, we plumb the existing stall signal that was generated
in dcache to core, where it gets ORed with the stall signal from
execute1.  Execute1 generates a stall signal for the first two
cycles of each load and store, and dcache generates the stall
signal in the 3rd and subsequent cycles if it needs to.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras 5422007f83 Plumb loadstore1 input from execute1 not decode2
This allows us to use the bypass at the input of execute1 for the
address and data operands for loadstore1.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras b14d982011 execute: Implement bypass from output of execute1 to input
This enables back-to-back execution of integer instructions where
the first instruction writes a GPR and the second reads the same
GPR.  This is done with a set of multiplexers at the start of
execute1 which enable any of the three input operands to be taken
from the output of execute1 (i.e. r.e.write_data) rather than the
input from decode2 (i.e. e_in.read_data[123]).

This also requires changes to the hazard detection and handling.
Decode2 generates a signal indicating that the GPR being written
is available for bypass, which is true for instructions that are
executed in execute1 (rather than loadstore1/dcache).  The
gpr_hazard module stores this "bypassable" bit, and if the same
GPR needs to be read by a subsequent instruction, it outputs a
"use_bypass" signal rather than generating a stall.  The
use_bypass signal is then latched at the output of decode2 and
passed down to execute1 to control the input multiplexer.

At the moment there is no bypass on the inputs to loadstore1, but that
is OK because all load and store instructions are marked as
single-issue.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras 39d18d2738 Make divider hang off the side of execute1
With this, the divider is a unit that execute1 sends operands to and
which sends its results back to execute1, which then send them to
writeback.  Execute1 now sends a stall signal when it gets a divide
or modulus instruction until it gets a valid signal back from the
divider.  Divide and modulus instructions are no longer marked as
single-issue.

The data formatting step that used to be done in decode2 for div
and mod instructions is now done in execute1.  We also do the
absolute value operation in that same cycle instead of taking an
extra cycle inside the divider for signed operations with a
negative operand.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras 2167186b5f Make multiplier hang off the side of execute1
With this, the multiplier isn't a separate pipe that decode2 issues
instructions to, but rather is a unit that execute1 sends operands
to and which sends the result back to execute1, which then sends it
to writeback.  Execute1 now sends a stall signal when it gets a
multiply instruction until it gets a valid signal back from the
multiplier.

This all means that we no longer need to mark the multiply
instructions as single-issue.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Anton Blanchard dcee60a729 Fix a ghdlsynth issue in icache
ghdlsynth doesn't like the debug statement, so wrap it in a generate.

Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
5 years ago
Anton Blanchard 467630573c Dump CTR, LR and CR on sim termination, and update our tests
Right now our test cases fold the SPRs into the GPRs. That makes
debugging fails more difficult than it needs to be, so print
out the CTR, LR and CR.

We still need to print the XER, but that is in two spots in microwatt
and will take some more work.

This also adds many instructions to the tests that we have added
lately including overflow instructions, CR logicals and mt/mfxer.

Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
5 years ago
Benjamin Herrenschmidt e4f475e17f sprs: Store common SPRs in register file
This stores the most common SPRs in the register file.

This includes CTR and LR and a not yet final list of others.

The register file is set to 64 entries for now. Specific types
are defined that can represent a GPR index (gpr_index_t) or
a GPR/SPR index (gspr_index_t) along with conversion functions
between the two.

On order to deal with some forms of branch updating both LR and
CTR, we introduced a delayed update of LR after a branch link.

Note: We currently stall the pipeline on such a delayed branch,
but we could avoid stalling fetch in that specific case as we
know we have a branch delay. We could also limit that to the
specific case where we need to update both CTR and LR.

This allows us to make bcreg, mtspr and mfspr pipelined. decode1
will automatically force the single issue flag on mfspr/mtspr to
a "slow" SPR.

[paulus@ozlabs.org - fix direction of decode2.stall_in]

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Benjamin Herrenschmidt d2762e70e5 Add option to not flatten hierarchy
Vivado by default tries to flatten the module hierarchy to improve
placement and timing. However this makes debugging timing issues
really hard as the net names in the timing report can be pretty
bogus.

This adds a generic that can be used to control attributes to stop
vivado from flattening the main core components. The resulting design
will have worst timing overall but it will be easier to understand
what the worst timing path are and address them.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
5 years ago
Anton Blanchard 247d7d4aa0
Merge pull request #113 from mikey/exec-sim-remove
Remove SIM generic from execute1
5 years ago
Michael Neuling bd4ac06243 Remove SIM generic from execute1
This does nothing, so remove.

Signed-off-by: Michael Neuling <mikey@neuling.org>
5 years ago
Benjamin Herrenschmidt 742b21480e insn: Simplistic implementation of icbi
We don't yet have a proper snooper for the icache, so for now make
icbi just flush the whole thing

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
5 years ago
Benjamin Herrenschmidt 265fbf894b icache/dcache: Make both caches 32 lines, 2 ways
Adding lines seems to add only little extra as the BRAMs aren't
full, 2 ways is our current comprimise to limit pressure on small
FPGAs. We could go to 64 lines for a little more, but timing is
becoming a bit too right to my linking on the tags/LRU path of
the icache, so let's leave it at 32 for now.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
5 years ago
Benjamin Herrenschmidt b513f0fb48 dcache: Add a dcache
This replaces loadstore2 with a dcache

The dcache unit is losely based on the icache one (same basic cache
layout), but has some significant logic additions to deal with stores,
loads with update, non-cachable accesses and other differences due to
operating in the execution part of the pipeline rather than the fetch
part.

The cache is store-through, though a hit with an existing line will
update the line rather than invalidate it.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
5 years ago
Paul Mackerras f49a5a99a5 Remove execute2 stage
Since the condition setting got moved to writeback, execute2 does
nothing aside from wasting a cycle.  This removes it.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras d4f51e08c8 divider: Return 0 for invalid and overflow cases, like P9 does
This adds logic to detect the cases where the quotient of the
division overflows the range of the output representation, and
return all zeroes in those cases, which is what POWER9 does.
To do this, we extend the dividend register by 1 bit and we do
an extra step in the division process to get a 2^64 bit of the
quotient, which ends up in the 'overflow' signal.  This catches all
the cases where dividend >= 2^64 * divisor, including the case
where divisor = 0, and the divde/divdeu cases where |RA| >= |RB|.

Then, in the output stage, we also check that the result fits in
the representable range, which depends on whether the division is
a signed division or not, and whether it is a 32-bit or 64-bit
division.  If dividend >= 2^64 or the result doesn't fit in the
representable range, write_data is set to 0 and write_cr_data to
0x20000000 (i.e. cr0.eq = 1).

POWER9 sets the top 32 bits of the result to zero for 32-bit signed
divisions, and sets CR0 when RC=1 according to the 64-bit value
(i.e. CR0.LT is always 0 for 32-bit signed divisions, even if the
32-bit result is negative).  However, modsw with a negative result
sets the top 32 bits to all 1s.  We follow suit.

This updates divider_tb to check the invalid cases as well as the
valid case.

This also fixes a small bug where the reset signal for the divider
was driven from rst when it should have been driven from core_rst.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Benjamin Herrenschmidt b56b46b7d1 icache: Set associative icache
This adds support for set associativity to the icache. It can still
be direct mapped by setting NUM_WAYS to 1.

The replacement policy uses a simple tree-PLRU for each set.

This is only lightly tested, tests pass but I have to double check
that we are using the ways effectively and not creating duplicates.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
5 years ago
Benjamin Herrenschmidt d40c1c1a25 icache: Use narrower block RAMs
We only ever access the cache memory for at most the wishbone bus
width at a time. So having the BRAMs organized as a cache-line-wide
port is a waste of resources.

Instead, use a wishbone-wide memory and store a line as consecutive
rows in the BRAM.

This significantly improves BRAM usage in the FPGA as we can now use
more rows in the BRAM blocks. It also saves a few LUTs and muxes.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
5 years ago
Benjamin Herrenschmidt d415e5544a fetch/icache: Fit icache in BRAM
The goal is to have the icache fit in BRAM by latching the output
into a register. In order to avoid timing issues , we need to give
the BRAM a full cycle on reads, and thus we souce the BRAM address
directly from fetch1 latched NIA.

(Note: This will be problematic if/when we want to hash the address,
we'll probably be better off having fetch1 latch a fully hashed address
along with the normal one, so the icache can use the former to address
the BRAM and pass the latter along)

One difficulty is that we cannot really stall the icache without adding
more combo logic that would break the "one full cycle" BRAM model. This
means that on stalls from decode, by the time we stall fetch1, it has
already gone to the next address, which the icache is already latching.

We work around this by having a "stash" buffer in fetch2 that will stash
away the icache output on a stall, and override the output of the icache
with the content of the stash buffer when unstalling.

This requires a rewrite of the stop/step debug logic as well. We now
do most of the hard work in fetch1 which makes more sense.

Note: Vivado is still not inferring an built-in output register for the
BRAMs. I don't want to add another cycle... I don't fully understand why
it wouldn't be able to treat current_row as such but clearly it won't. At
least the timing seems good enough now for 100Mhz, possibly more.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
5 years ago
Benjamin Herrenschmidt ec1868f7d2 register_file: Move GPRs into distributed RAM
The register file is currently implemented as a whole pile of individual
1-bit registers instead of LUT memory which is a huge waste of FPGA
space.

This is caused by the output signal exposing the register file to the
outside world for simulation debug.

This removes that output, and moves the dumping of the register file
to the register file module itself. This saves about 8% of fpga on
the little Arty A7-35T.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
5 years ago
Anton Blanchard b57325ce29 Merge branch 'divider' of https://github.com/paulusmack/microwatt 5 years ago