This resolves various warnings and critical warnings from Vivado.
In particular, the asynchronous loops in the xilinx hardware RNG were
giving a lot of critical warnings, which proved to be difficult to
suppress, so this instead makes all the xilinx platforms use the
'nonrandom.vhdl' implementation, which always returns an error.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This fixes the following warning:
fetch1.vhdl:293:18⚠️ declaration of "eaa_priv" hides signal "eaa_priv" [-Whide]
variable eaa_priv : std_ulogic;
^
In fact the signal "eaa_priv" is unused, so remove it.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Now that we are translating the fetch effective address to real one
cycle earlier, we can use the real address to index the icache array.
This has the benefit that the set size can be larger than a page,
enabling us to configure the icache to be larger without having to
increase its associativity. Previously the set size was limited to
the page size to avoid aliasing problems. Thus for example a 32kB
icache would need to be 8-way associative, resulting in large numbers
of LUTs being used for tag comparisons in FPGA implementations, and
poor timing. With this change, a 32kB icache can be 1 or 2-way
associative, which means deeper and narrower tag and data RAMs and
fewer tag comparators.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This moves the address translation step for instruction fetches one
cycle earlier, so that it now happens in the fetch1 stage. There is
now a 2-entry mini translation cache ("ERAT", or effective to real
address translation cache) which operates on the output of the
multiplexer that selects the instruction address for the next cycle.
The ERAT consists of two effective address registers and two
corresponding real address registers. They store the page number part
of the addresses for a 4kB page size, which is the smallest page size
supported by the architecture.
If the effective address doesn't match either of the EA registers, and
address translation is enabled, then i_out.req goes low for two cycles
while the iTLB is looked up. Experimentally, this delay results in a
0.1% drop in coremark performance; allowing two cycles for the lookup
results in better timing. The result from the iTLB is placed into the
least recently used ERAT entry and then used to translate the address
as normal. If address translation is not enabled then the EA is used
directly as the real address.
The iTLB structure is the same as it was before; direct mapped,
indexed using a hashed EA.
The "fetch failed" signal, which indicates a TLB miss or protection
violation, is now generated in fetch1 and passed through icache.
When it is asserted, fetch1 goes into a stalled state until a PTE
arrives from the MMU (which gets put into both the iTLB and the ERAT),
or an interrupt or redirect occurs.
Any TLB invalidations from the MMU invalidate the whole ERAT.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This reduces the number of possible sources for the next NIA from 4
down to 3, by routing interrupt vector addresses through the
r_int.next_nia register, as is already done for reset. This adds one
extra cycle of latency when taking interrupts. During this extra cycle,
i_out.req is 0.
Writeback now no longer combines redirects (branches, rfid, isync)
with interrupts; they are presented separately to fetch1.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This adds a next_nia field to the Fetch1ToIcacheType record, which
provides an indication of what will be in the nia field on the next
non-stalled cycle. This is intended to be as fast as possible, being
a selection from two redirect addresses (from writeback and decode1)
or an internal register (r_int.next_nia). Reset addresses and
predicted branch targets come through this internal register.
The rearrangement here has the side effect that we can now use the BTC
on the first instruction after a taken branch, whereas previously the
BTC was only active starting with the second instruction after a taken
branch. This provides a slight improvement in performance.
This also fixes a buglet in icache where it would assert its stall
output when i_in.req was false.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Instead of sending down the predicted taken/not-taken bits with the
target of the branch, we now send them down with the branch itself.
Previously icache adjusted for this by sending the prediction bits to
decode1 without a 1-clock delay while everything else had a 1-clock
delay. Now icache keeps the prediction bits with the rest of the
attributes for the request.
Also fix a buglet in fetch1 where the first address sent out after
reset didn't have .req set. Currently this doesn't cause a problem
because icache doesn't really look at .req.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This fixes a bug in the BTC where entries created for a given address
when MSR[IR] = 0 are used when MSR[IR] = 1 and vice-versa. The fix is
to include r.virt_mode (which mirrors MSR[IR]) in the tag portion of
the BTC.
Fixes: 0fb207be60 ("fetch1: Implement a simple branch target cache", 2020-12-19)
Reported-by: Anton Blanchard <anton@linux.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This fixes a bug which prevents the core from stopping properly. The
same bug was previously fixed in commit e41cb01bca ("fetch1: Fix
debug stop", 2020-12-19) and reintroduced by commit 0fb207be60
("fetch1: Implement a simple branch target cache", 2020-12-19).
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This removes logic that I added some time ago with the thought that it
would enable us to do prefetching in the icache. This logic detects
when the fetch address is an odd multiple of 4 and the next address in
sequence from the previous cycle. In that case the instruction we
want is in the output register of the icache RAM already so there is
no need to do another read or any icache tag or TLB lookup.
However, this logic adds complexity, and removing it improves timing,
so this removes it.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This adds a bit to the BTC to store whether the corresponding branch
instruction was taken last time it was encountered. That lets us pass
a not-taken prediction down to decode1, which for backwards direct
branches inhibits it from redirecting fetch to the target of the
branch. This increases coremark by about 2%.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This moves the logic for redirecting fetching and writing SRR0 and
SRR1 to writeback. The aim is that ultimately units other than
execute1 can send their interrupts to writeback along with their
instruction completions, so that there can be multiple instructions
in flight without needing execute1 to keep track of the address
of each outstanding instruction.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This implements a cache in fetch1, where each entry stores the address
of a simple branch instruction (b or bc) and the target of the branch.
When fetching sequentially, if the address being fetched matches the
cache entry, then fetching will be redirected to the branch target.
The cache has 1024 entries and is direct-mapped, i.e. indexed by bits
11..2 of the NIA.
The bus from execute1 now carries information about taken and
not-taken simple branches, which fetch1 uses to update the cache.
The cache entry is updated for both taken and not-taken branches, with
the valid bit being set if the branch was taken and cleared if the
branch was not taken.
If fetching is redirected to the branch target then that goes down the
pipe as a predicted-taken branch, and decode1 does not do any static
branch prediction. If fetching is not redirected, then the next
instruction goes down the pipe as normal and decode1 does its static
branch prediction.
In order to make timing, the lookup of the cache is pipelined, so on
each cycle the cache entry for the current NIA + 8 is read. This
means that after a redirect (from decode1 or execute1), only the third
and subsequent sequentially-fetched instructions will be able to be
predicted.
This improves the coremark value on the Arty A7-100 from about 180 to
about 190 (more than 5%).
The BTC is optional. Builds for the Artix 7 35-T part have it off by
default because the extra ~1420 LUTs it takes mean that the design
doesn't fit on the Arty A7-35 board.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
The ability to stop the core using the debug interface has been broken
since commit bb4332b7e6b5 ("Remove fetch2 pipeline stage"), which
removed a statement that cleared the valid bit on instructions when
their stop_mark was 1.
Fix this by clearing r.req coming out of fetch1 when r.stop_mark = 1.
This has the effect of making i_out.valid be 0 from the icache. We
also fix a bug in icache.vhdl where it was not honouring i_in.req when
use_previous = 1.
It turns out that the logic in fetch1.vhdl to handle stopping and
restarting was not correct, with the effect that stopping the core
would leave NIA pointing to the last instruction executed, not the
next instruction to be executed. In fact the state machine is
unnecessary and the whole thing can be simplified enormously - we
need to increment NIA whenever stop_in = 0 in the previous cycle.
Fixes: bb4332b7e6b5 ("Remove fetch2 pipeline stage")
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
In 32-bit mode, effective addresses are truncated to 32 bits, both for
instruction fetches and data accesses, and CR0 is set for Rc=1 (record
form) instructions based on the lower 32 bits of the result rather
than all 64 bits.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Big-endian mode affects both instruction fetches and data accesses.
For instruction fetches, we byte-swap each word read from memory when
writing it into the icache data RAM, and use a tag bit to indicate
whether each cache line contains instructions in BE or LE form.
For data accesses, we simply need to invert the existing byte_reverse
signal in BE mode. The only thing to be careful of is to get the sign
bit from the correct place when doing a sign-extending load that
crosses two doublewords of memory.
For now, interrupts unconditionally set MSR[LE]. We will need some
sort of interrupt-little-endian bit somewhere, perhaps in LPCR.
This also fixes a debug report statement in fetch1.vhdl.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This implements the CFAR SPR as a slow SPR stored in 'ctrl'. Taken
branches and rfid update it to the address of the branch or rfid
instruction.
To simplify the logic, this makes rfid use the branch logic to
generate its redirect (requiring SRR0 to come in to execute1 on
the B input and SRR1 on the A input), and the masking of the bottom
2 bits of NIA is moved to fetch1.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This implements a simple branch predictor in the decode1 stage. If it
sees that the instruction is b or bc and the branch is predicted to be
taken, it sends a flush and redirect upstream (to icache and fetch1)
to redirect fetching to the branch target. The prediction is sent
downstream with the branch instruction, and execute1 now only sends
a flush/redirect upstream if the prediction was wrong. Unconditional
branches are always predicted to be taken, and conditional branches
are predicted to be taken if and only if the offset is negative.
Branches that take the branch address from a register (bclr, bcctr)
are predicted not taken, as we don't have any way to predict the
branch address.
Since we can now have a mflr being executed immediately after a bl
or bcl, we now track the update to LR in the hazard tracker, using
the second write register field that is used to track RA updates for
update-form loads and stores.
For those branches that update LR but don't write any other result
(i.e. that don't decrementer CTR), we now write back LR in the same
cycle as the instruction rather than taking a second cycle for the
LR writeback.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
The icache can now detect a hit on a line being refilled from memory,
as we have an array of individual valid bits per row for the line
that is currently being loaded. This enables the request that
initiated the refill to be satisfied earlier, and also enables
following requests to the same cache line to be satisfied before the
line is completely refilled. Furthermore, the refill now starts
at the row that is needed. This should reduce the latency for an
icache miss.
We now get a 'sequential' indication from fetch1, and use that to know
when we can deliver an instruction word using the other half of the
64-bit doubleword that was read last cycle. This doesn't make much
difference at the moment, but it frees up cycles where we could test
whether the next line is present in the cache so that we could
prefetch it if not.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This logs 256 bits of data per cycle to a ring buffer in BRAM. The
data collected can be read out through 2 new SPRs or through the
debug interface.
The new SPRs are LOG_ADDR (724) and LOG_DATA (725). LOG_ADDR contains
the buffer write pointer in the upper 32 bits (in units of entries,
i.e. 32 bytes) and the read pointer in the lower 32 bits (in units of
doublewords, i.e. 8 bytes). Reading LOG_DATA gives the doubleword
from the buffer at the read pointer and increments the read pointer.
Setting bit 31 of LOG_ADDR inhibits the trace log system from writing
to the log buffer, so the contents are stable and can be read.
There are two new debug addresses which function similarly to the
LOG_ADDR and LOG_DATA SPRs. The log is frozen while either or both of
the LOG_ADDR SPR bit 31 or the debug LOG_ADDR register bit 31 are set.
The buffer defaults to 2048 entries, i.e. 64kB. The size is set by
the LOG_LENGTH generic on the core_debug module. Software can
determine the length of the buffer because the length is ORed into the
buffer write pointer in the upper 32 bits of LOG_ADDR. Hence the
length of the buffer can be calculated as 1 << (31 - clz(LOG_ADDR)).
There is a program to format the log entries in a somewhat readable
fashion in scripts/fmt_log/fmt_log.c. The log_entry struct in that
file describes the layout of the bits in the log entries.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This adds a direct-mapped TLB to the icache, with 64 entries by default.
Execute1 now sends a "virt_mode" signal from MSR[IR] to fetch1 along
with redirects to indicate whether instruction addresses should be
translated through the TLB, and fetch1 sends that on to icache.
Similarly a "priv_mode" signal is sent to indicate the privilege
mode for instruction fetches. This means that changes to MSR[IR]
or MSR[PR] don't take effect until the next redirect, meaning an
isync, rfid, branch, etc.
The icache uses a hash of the effective address (i.e. next instruction
address) to index the TLB. The hash is an XOR of three fields of the
address; with a 64-entry TLB, the fields are bits 12--17, 18--23 and
24--29 of the address. TLB invalidations simply invalidate the
indexed TLB entry without checking the contents.
If the icache detects a TLB miss with virt_mode=1, it will send a
fetch_failed indication through fetch2 to decode1, which will turn it
into a special OP_FETCH_FAILED opcode with unit=LDST. That will get
sent down to loadstore1 which will currently just raise a Instruction
Storage Interrupt (0x400) exception.
One bit in the PTE obtained from the TLB is used to check whether an
instruction access is allowed -- the privilege bit (bit 3). If bit 3
is 1 and priv_mode=0, then a fetch_failed indication is sent down to
fetch2 and to decode1, which generates an OP_FETCH_FAILED. Any PTEs
with PTE bit 0 (EAA[3]) clear or bit 8 (R) clear should not be put
into the iTLB since such PTEs would not allow execution by any
context.
Tlbie operations get sent from mmu to icache over a new connection.
Unfortunately the privileged instruction tests are broken for now.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
An external signal can control whether the core will start
executing at the standard or the alternate reset address.
This will be used when litedram is initialized by microwatt
itself, to route the reset to the built-in init code secondary
block RAM.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The goal is to have the icache fit in BRAM by latching the output
into a register. In order to avoid timing issues , we need to give
the BRAM a full cycle on reads, and thus we souce the BRAM address
directly from fetch1 latched NIA.
(Note: This will be problematic if/when we want to hash the address,
we'll probably be better off having fetch1 latch a fully hashed address
along with the normal one, so the icache can use the former to address
the BRAM and pass the latter along)
One difficulty is that we cannot really stall the icache without adding
more combo logic that would break the "one full cycle" BRAM model. This
means that on stalls from decode, by the time we stall fetch1, it has
already gone to the next address, which the icache is already latching.
We work around this by having a "stash" buffer in fetch2 that will stash
away the icache output on a stall, and override the output of the icache
with the content of the stash buffer when unstalling.
This requires a rewrite of the stop/step debug logic as well. We now
do most of the hard work in fetch1 which makes more sense.
Note: Vivado is still not inferring an built-in output register for the
BRAMs. I don't want to add another cycle... I don't fully understand why
it wouldn't be able to treat current_row as such but clearly it won't. At
least the timing seems good enough now for 100Mhz, possibly more.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Do the +4 in a single place. This shouldn't cause any difference
in behaviour as these are sequential variable assignments.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This module adds some simple core controls:
reset, stop, start, step
along with icache clear and reading the NIA and core
status bits
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org