This implements a cache in fetch1, where each entry stores the address
of a simple branch instruction (b or bc) and the target of the branch.
When fetching sequentially, if the address being fetched matches the
cache entry, then fetching will be redirected to the branch target.
The cache has 1024 entries and is direct-mapped, i.e. indexed by bits
11..2 of the NIA.
The bus from execute1 now carries information about taken and
not-taken simple branches, which fetch1 uses to update the cache.
The cache entry is updated for both taken and not-taken branches, with
the valid bit being set if the branch was taken and cleared if the
branch was not taken.
If fetching is redirected to the branch target then that goes down the
pipe as a predicted-taken branch, and decode1 does not do any static
branch prediction. If fetching is not redirected, then the next
instruction goes down the pipe as normal and decode1 does its static
branch prediction.
In order to make timing, the lookup of the cache is pipelined, so on
each cycle the cache entry for the current NIA + 8 is read. This
means that after a redirect (from decode1 or execute1), only the third
and subsequent sequentially-fetched instructions will be able to be
predicted.
This improves the coremark value on the Arty A7-100 from about 180 to
about 190 (more than 5%).
The BTC is optional. Builds for the Artix 7 35-T part have it off by
default because the extra ~1420 LUTs it takes mean that the design
doesn't fit on the Arty A7-35 board.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Some of the bits in the FPU buses end up as z state. Yosys
flags them, so we may as well clean it up.
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
This adds the skeleton of a floating-point unit and implements the
mffs and mtfsf instructions.
Execute1 sends FP instructions to the FPU and receives busy,
exception, FP interrupt and illegal interrupt signals from it.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This extends the register file so it can hold FPR values, and
implements the FP loads and stores that do not require conversion
between single and double precision.
We now have the FP, FE0 and FE1 bits in MSR. FP loads and stores
cause a FP unavailable interrupt if MSR[FP] = 0.
The FPU facilities are optional and their presence is controlled by
the HAS_FPU generic passed down from the top-level board file. It
defaults to true for all except the A7-35 boards.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This adds "if LOG_LENGTH > 0 generate" to the places in the core
where log output data is latched, so that when LOG_LENGTH = 0 we
don't create the logic to collect the data which won't be stored.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This plumbs the LOG_LENGTH parameter (which controls how many entries
the core log RAM has) up to the top level so that it can be set on
the fusesoc command line and have different default values on
different FPGAs.
It now defaults to 512 entries generally and on the Artix-7 35 parts,
and 2048 on the larger Artix-7 FPGAs. It can be set to 0 if desired.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This implements a simple branch predictor in the decode1 stage. If it
sees that the instruction is b or bc and the branch is predicted to be
taken, it sends a flush and redirect upstream (to icache and fetch1)
to redirect fetching to the branch target. The prediction is sent
downstream with the branch instruction, and execute1 now only sends
a flush/redirect upstream if the prediction was wrong. Unconditional
branches are always predicted to be taken, and conditional branches
are predicted to be taken if and only if the offset is negative.
Branches that take the branch address from a register (bclr, bcctr)
are predicted not taken, as we don't have any way to predict the
branch address.
Since we can now have a mflr being executed immediately after a bl
or bcl, we now track the update to LR in the hazard tracker, using
the second write register field that is used to track RA updates for
update-form loads and stores.
For those branches that update LR but don't write any other result
(i.e. that don't decrementer CTR), we now write back LR in the same
cycle as the instruction rather than taking a second cycle for the
LR writeback.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This means that the busy signal from execute1 (which can be driven
combinatorially from mmu or dcache) now stops at decode1 and doesn't
go on to icache or fetch1. This helps with timing.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This changes the instruction dependency tracking so that we can
generate a "busy" signal from execute1 and loadstore1 which comes
along one cycle later than the current "stall" signal. This will
enable us to signal busy cycles only when we need to from loadstore1.
The "busy" signal from execute1/loadstore1 indicates "I didn't take
the thing you gave me on this cycle", as distinct from the previous
stall signal which meant "I took that but don't give me anything
next cycle". That means that decode2 proactively gives execute1
a new instruction as soon as it has taken the previous one (assuming
there is a valid instruction available from decode1), and that then
sits in decode2's output until execute1 can take it. So instructions
are issued by decode2 somewhat earlier than they used to be.
Decode2 now only signals a stall upstream when its output buffer is
full, meaning that we can fill up bubbles in the upstream pipe while a
long instruction is executing. This gives a small boost in
performance.
This also adds dependency tracking for rA updates by update-form
load/store instructions.
The GPR and CR hazard detection machinery now has one extra stage,
which may not be strictly necessary. Some of the code now really
only applies to PIPELINE_DEPTH=1.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This makes the dcache and icache both be 8kB. This still only uses
one BRAM per way per cache on the Artix-7, since the BRAMs were only
half-used previously.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
The fetch2 stage existed primarily to provide a stash buffer for the
output of icache when a stall occurred. However, we can get the same
effect -- of having the input to decode1 stay unchanged on a stall
cycle -- by using the read enable of the BRAMs in icache, and by
adding logic to keep the outputs unchanged on a clock cycle when
stall_in = 1. This reduces branch and interrupt latency by one
cycle.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This logs 256 bits of data per cycle to a ring buffer in BRAM. The
data collected can be read out through 2 new SPRs or through the
debug interface.
The new SPRs are LOG_ADDR (724) and LOG_DATA (725). LOG_ADDR contains
the buffer write pointer in the upper 32 bits (in units of entries,
i.e. 32 bytes) and the read pointer in the lower 32 bits (in units of
doublewords, i.e. 8 bytes). Reading LOG_DATA gives the doubleword
from the buffer at the read pointer and increments the read pointer.
Setting bit 31 of LOG_ADDR inhibits the trace log system from writing
to the log buffer, so the contents are stable and can be read.
There are two new debug addresses which function similarly to the
LOG_ADDR and LOG_DATA SPRs. The log is frozen while either or both of
the LOG_ADDR SPR bit 31 or the debug LOG_ADDR register bit 31 are set.
The buffer defaults to 2048 entries, i.e. 64kB. The size is set by
the LOG_LENGTH generic on the core_debug module. Software can
determine the length of the buffer because the length is ORed into the
buffer write pointer in the upper 32 bits of LOG_ADDR. Hence the
length of the buffer can be calculated as 1 << (31 - clz(LOG_ADDR)).
There is a program to format the log entries in a somewhat readable
fashion in scripts/fmt_log/fmt_log.c. The log_entry struct in that
file describes the layout of the bits in the log entries.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
icbi currently just resets the icache. This has some nasty side
effects such as also clearing the TLB, but also the wishbone interface.
That means that any ongoing cycle will be dropped.
However, most of our slaves don't handle that well and will continue
sending acks for already issued requests.
Under some circumstances we can thus restart an icache load and get
spurious ack/data from the wishbone left over from the "cancelled"
sequence.
This has broken booting Linux for me.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Use a simple wire. common.vhdl types are better kept for things
local to the core. We can add more wires later if we need to for
HV irqs etc...
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This adds one-cycle latches to the various resets out of the soc and
into the various core modules. It *seems* to help vivado P&R a bit
and has shown to avoid timing violations under some circumstances.
Interestingly those resets never seem to appear in the bad timing
path. It looks like those long resets simply impose placement
constraints that Vivado satisfies at the expense of timing elsewhere.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This adds a direct-mapped TLB to the icache, with 64 entries by default.
Execute1 now sends a "virt_mode" signal from MSR[IR] to fetch1 along
with redirects to indicate whether instruction addresses should be
translated through the TLB, and fetch1 sends that on to icache.
Similarly a "priv_mode" signal is sent to indicate the privilege
mode for instruction fetches. This means that changes to MSR[IR]
or MSR[PR] don't take effect until the next redirect, meaning an
isync, rfid, branch, etc.
The icache uses a hash of the effective address (i.e. next instruction
address) to index the TLB. The hash is an XOR of three fields of the
address; with a 64-entry TLB, the fields are bits 12--17, 18--23 and
24--29 of the address. TLB invalidations simply invalidate the
indexed TLB entry without checking the contents.
If the icache detects a TLB miss with virt_mode=1, it will send a
fetch_failed indication through fetch2 to decode1, which will turn it
into a special OP_FETCH_FAILED opcode with unit=LDST. That will get
sent down to loadstore1 which will currently just raise a Instruction
Storage Interrupt (0x400) exception.
One bit in the PTE obtained from the TLB is used to check whether an
instruction access is allowed -- the privilege bit (bit 3). If bit 3
is 1 and priv_mode=0, then a fetch_failed indication is sent down to
fetch2 and to decode1, which generates an OP_FETCH_FAILED. Any PTEs
with PTE bit 0 (EAA[3]) clear or bit 8 (R) clear should not be put
into the iTLB since such PTEs would not allow execution by any
context.
Tlbie operations get sent from mmu to icache over a new connection.
Unfortunately the privileged instruction tests are broken for now.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This adds a new module to implement an MMU. At the moment it doesn't
do very much. Tlbie instructions now get sent by loadstore1 to mmu,
which sends them to dcache, rather than loadstore1 sending them
directly to dcache. TLB misses from dcache now get sent by loadstore1
to mmu, which currently just returns an error. Loadstore1 then
generates a DSI in response to the error return from mmu.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This adds a path from loadstore1 back to execute1 for reporting
errors, and machinery in execute1 for generating data storage
interrupts at vector 0x300.
If dcache is given two requests in successive cycles and the
first encounters an error (e.g. a TLB miss), it will now cancel
the second request.
Loadstore1 now responds to errors reported by dcache by sending
an exception signal to execute1 and returning to the idle state.
Execute1 then writes SRR0 and SRR1 and jumps to the 0x300 Data
Storage Interrupt vector. DAR and DSISR are held in loadstore1.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This provides commands on the debug interface to read the value of
the MSR or any of the 64 GSPR register file entries. The GSPR values
are read using the B port of the register file in a cycle when
decode2 is not using it.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
An external signal can control whether the core will start
executing at the standard or the alternate reset address.
This will be used when litedram is initialized by microwatt
itself, to route the reset to the built-in init code secondary
block RAM.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
New unified ICP and ICS XICS compliant interrupt controller.
Configurable number of hardware sources.
Fixed hardware source number based on hardware line taken. All
hardware interrupts are a fixed priority. Level interrupts supported
only.
Hardwired to 0xc0004000 in SOC (UART is kept at 0xc0002000).
Signed-off-by: Michael Neuling <mikey@neuling.org>
So that the dcache could in future be used by an MMU, this moves
logic to do with data formatting, rA updates for update-form
instructions, and handling of unaligned loads and stores out of
dcache and into loadstore1. For now, dcache connects only to
loadstore1, and loadstore1 now has the connection to writeback.
Dcache generates a stall signal to loadstore1 which indicates that
the request presented in the current cycle was not accepted and
should be presented again. However, loadstore1 doesn't currently
use it because we know that we can never hit the circumstances
where it might be set.
For unaligned transfers, loadstore1 generates two requests to
dcache back-to-back, and then waits to see two acks back from
dcache (cycles where d_in.valid is true).
Loadstore1 now has a FSM for tracking how many acks we are
expecting from dcache and for doing the rA update cycles when
necessary. Handling for reservations and conditional stores is
still in dcache.
Loadstore1 now generates its own stall signal back to decode2,
so we no longer need the logic in execute1 that generated the stall
for the first two cycles.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This removes the constraint that loads and stores are single-issue,
at the expense of a stall of at least 2 cycles for every load and
store.
To do this, we plumb the existing stall signal that was generated
in dcache to core, where it gets ORed with the stall signal from
execute1. Execute1 generates a stall signal for the first two
cycles of each load and store, and dcache generates the stall
signal in the 3rd and subsequent cycles if it needs to.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This allows us to use the bypass at the input of execute1 for the
address and data operands for loadstore1.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This enables back-to-back execution of integer instructions where
the first instruction writes a GPR and the second reads the same
GPR. This is done with a set of multiplexers at the start of
execute1 which enable any of the three input operands to be taken
from the output of execute1 (i.e. r.e.write_data) rather than the
input from decode2 (i.e. e_in.read_data[123]).
This also requires changes to the hazard detection and handling.
Decode2 generates a signal indicating that the GPR being written
is available for bypass, which is true for instructions that are
executed in execute1 (rather than loadstore1/dcache). The
gpr_hazard module stores this "bypassable" bit, and if the same
GPR needs to be read by a subsequent instruction, it outputs a
"use_bypass" signal rather than generating a stall. The
use_bypass signal is then latched at the output of decode2 and
passed down to execute1 to control the input multiplexer.
At the moment there is no bypass on the inputs to loadstore1, but that
is OK because all load and store instructions are marked as
single-issue.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
With this, the divider is a unit that execute1 sends operands to and
which sends its results back to execute1, which then send them to
writeback. Execute1 now sends a stall signal when it gets a divide
or modulus instruction until it gets a valid signal back from the
divider. Divide and modulus instructions are no longer marked as
single-issue.
The data formatting step that used to be done in decode2 for div
and mod instructions is now done in execute1. We also do the
absolute value operation in that same cycle instead of taking an
extra cycle inside the divider for signed operations with a
negative operand.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
With this, the multiplier isn't a separate pipe that decode2 issues
instructions to, but rather is a unit that execute1 sends operands
to and which sends the result back to execute1, which then sends it
to writeback. Execute1 now sends a stall signal when it gets a
multiply instruction until it gets a valid signal back from the
multiplier.
This all means that we no longer need to mark the multiply
instructions as single-issue.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Right now our test cases fold the SPRs into the GPRs. That makes
debugging fails more difficult than it needs to be, so print
out the CTR, LR and CR.
We still need to print the XER, but that is in two spots in microwatt
and will take some more work.
This also adds many instructions to the tests that we have added
lately including overflow instructions, CR logicals and mt/mfxer.
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
This stores the most common SPRs in the register file.
This includes CTR and LR and a not yet final list of others.
The register file is set to 64 entries for now. Specific types
are defined that can represent a GPR index (gpr_index_t) or
a GPR/SPR index (gspr_index_t) along with conversion functions
between the two.
On order to deal with some forms of branch updating both LR and
CTR, we introduced a delayed update of LR after a branch link.
Note: We currently stall the pipeline on such a delayed branch,
but we could avoid stalling fetch in that specific case as we
know we have a branch delay. We could also limit that to the
specific case where we need to update both CTR and LR.
This allows us to make bcreg, mtspr and mfspr pipelined. decode1
will automatically force the single issue flag on mfspr/mtspr to
a "slow" SPR.
[paulus@ozlabs.org - fix direction of decode2.stall_in]
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Vivado by default tries to flatten the module hierarchy to improve
placement and timing. However this makes debugging timing issues
really hard as the net names in the timing report can be pretty
bogus.
This adds a generic that can be used to control attributes to stop
vivado from flattening the main core components. The resulting design
will have worst timing overall but it will be easier to understand
what the worst timing path are and address them.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We don't yet have a proper snooper for the icache, so for now make
icbi just flush the whole thing
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Adding lines seems to add only little extra as the BRAMs aren't
full, 2 ways is our current comprimise to limit pressure on small
FPGAs. We could go to 64 lines for a little more, but timing is
becoming a bit too right to my linking on the tags/LRU path of
the icache, so let's leave it at 32 for now.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This replaces loadstore2 with a dcache
The dcache unit is losely based on the icache one (same basic cache
layout), but has some significant logic additions to deal with stores,
loads with update, non-cachable accesses and other differences due to
operating in the execution part of the pipeline rather than the fetch
part.
The cache is store-through, though a hit with an existing line will
update the line rather than invalidate it.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Since the condition setting got moved to writeback, execute2 does
nothing aside from wasting a cycle. This removes it.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This adds logic to detect the cases where the quotient of the
division overflows the range of the output representation, and
return all zeroes in those cases, which is what POWER9 does.
To do this, we extend the dividend register by 1 bit and we do
an extra step in the division process to get a 2^64 bit of the
quotient, which ends up in the 'overflow' signal. This catches all
the cases where dividend >= 2^64 * divisor, including the case
where divisor = 0, and the divde/divdeu cases where |RA| >= |RB|.
Then, in the output stage, we also check that the result fits in
the representable range, which depends on whether the division is
a signed division or not, and whether it is a 32-bit or 64-bit
division. If dividend >= 2^64 or the result doesn't fit in the
representable range, write_data is set to 0 and write_cr_data to
0x20000000 (i.e. cr0.eq = 1).
POWER9 sets the top 32 bits of the result to zero for 32-bit signed
divisions, and sets CR0 when RC=1 according to the 64-bit value
(i.e. CR0.LT is always 0 for 32-bit signed divisions, even if the
32-bit result is negative). However, modsw with a negative result
sets the top 32 bits to all 1s. We follow suit.
This updates divider_tb to check the invalid cases as well as the
valid case.
This also fixes a small bug where the reset signal for the divider
was driven from rst when it should have been driven from core_rst.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This adds support for set associativity to the icache. It can still
be direct mapped by setting NUM_WAYS to 1.
The replacement policy uses a simple tree-PLRU for each set.
This is only lightly tested, tests pass but I have to double check
that we are using the ways effectively and not creating duplicates.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We only ever access the cache memory for at most the wishbone bus
width at a time. So having the BRAMs organized as a cache-line-wide
port is a waste of resources.
Instead, use a wishbone-wide memory and store a line as consecutive
rows in the BRAM.
This significantly improves BRAM usage in the FPGA as we can now use
more rows in the BRAM blocks. It also saves a few LUTs and muxes.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The goal is to have the icache fit in BRAM by latching the output
into a register. In order to avoid timing issues , we need to give
the BRAM a full cycle on reads, and thus we souce the BRAM address
directly from fetch1 latched NIA.
(Note: This will be problematic if/when we want to hash the address,
we'll probably be better off having fetch1 latch a fully hashed address
along with the normal one, so the icache can use the former to address
the BRAM and pass the latter along)
One difficulty is that we cannot really stall the icache without adding
more combo logic that would break the "one full cycle" BRAM model. This
means that on stalls from decode, by the time we stall fetch1, it has
already gone to the next address, which the icache is already latching.
We work around this by having a "stash" buffer in fetch2 that will stash
away the icache output on a stall, and override the output of the icache
with the content of the stash buffer when unstalling.
This requires a rewrite of the stop/step debug logic as well. We now
do most of the hard work in fetch1 which makes more sense.
Note: Vivado is still not inferring an built-in output register for the
BRAMs. I don't want to add another cycle... I don't fully understand why
it wouldn't be able to treat current_row as such but clearly it won't. At
least the timing seems good enough now for 100Mhz, possibly more.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The register file is currently implemented as a whole pile of individual
1-bit registers instead of LUT memory which is a huge waste of FPGA
space.
This is caused by the output signal exposing the register file to the
outside world for simulation debug.
This removes that output, and moves the dumping of the register file
to the register file module itself. This saves about 8% of fpga on
the little Arty A7-35T.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This gets the CI going again, but we will want to fix the test
harness since it's useful to be able to debug the core after it
executes an illegal instruction.
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
I'm seeing an issue on my version of ghdl:
core.vhdl:137:24:error: actual expression must be globally static
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
This adds a divider unit, connected to the core in much the same way
that the multiplier unit is connected. The division algorithm is
very simple-minded, taking 64 clock cycles for any division (even
32-bit division instructions).
The decoding is simplified by making use of regularities in the
instruction encoding for div* and mod* instructions. Instead of
having PPC_* encodings from the first-stage decoder for each of the
different div* and mod* instructions, we now just have PPC_DIV and
PPC_MOD, and the inputs to the divider that indicate what sort of
division operation to do are derived from instruction word bits.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This module adds some simple core controls:
reset, stop, start, step
along with icache clear and reading the NIA and core
status bits
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org