We don't care what the values of TB and DECR are after reset, but we
don't want the X state to propagate to other parts of the chip.
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
Random execution testcases showed that a bdnzl which doesn't branch,
followed immediately by a bdnz, uses the wrong value for CTR for the
bdnz. Decode2 detects the read-after-write hazard on CTR and tells
execute1 to use the bypass path. However, the bdnzl takes two cycles
because it has to write back both CTR and LR, meaning that by the time
the bdnz starts to execute, r.e.write_data no longer contains the CTR
value, but instead contains zero.
To fix this, we make execute1 maintain the written-back value of CTR
in r.e.write_data across the cycle where LR is written back (this is
possible because the LR writeback uses the exc_write_data path).
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Branch instructions which do a redirect and write both CTR and LR were
not doing the write to LR due to a logic error. This fixes it.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
If an instruction fetch results in an instruction TLB miss, an
OP_FETCH_FAILED instruction is sent down the pipe. If the MSR[TE]
field is set for instruction tracing, the core currently considers
that executing the OP_FETCH_FAILED counts as having executed one
instruction and so generates a trace interrupt on the next valid
instruction, meaning that the trace interrupt happens before the
desired instruction rather than after it.
Fix this by not tracing OP_FETCH_FAILED instructions.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This adds the skeleton of a floating-point unit and implements the
mffs and mtfsf instructions.
Execute1 sends FP instructions to the FPU and receives busy,
exception, FP interrupt and illegal interrupt signals from it.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This adds code to loadstore1 to convert between single-precision and
double-precision formats, and implements the lfs* and stfs*
instructions. The conversion processes are described in Power ISA
v3.1 Book 1 sections 4.6.2 and 4.6.3.
These conversions take one cycle, so lfs* and stfs* are one cycle
slower than lfd* and stfd*.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This extends the register file so it can hold FPR values, and
implements the FP loads and stores that do not require conversion
between single and double precision.
We now have the FP, FE0 and FE1 bits in MSR. FP loads and stores
cause a FP unavailable interrupt if MSR[FP] = 0.
The FPU facilities are optional and their presence is controlled by
the HAS_FPU generic passed down from the top-level board file. It
defaults to true for all except the A7-35 boards.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Trace interrupts occur when the MSR[TE] field is non-zero and an
instruction other than rfid has been successfully completed. A trace
interrupt occurs before the next instruction is executed or any
asynchronous interrupt is taken.
Since the trace interrupt is defined to set SRR1 bits depending on
whether the traced instruction is a load or an instruction treated as
a load, or a store or an instruction treated as a store, we need to
make sure the treated-as-a-load instructions (icbi, icbt, dcbt, dcbst,
dcbf) and the treated-as-a-store instructions (dcbtst, dcbz) have the
correct opcodes in decode1. Several of them were previously marked as
OP_NOP.
We don't yet implement the SIAR or SDAR registers, which should be set
by trace interrupts.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
In the cases where we need to override the values from the decode ROMs,
we now do that overriding after the clock edge (eating into decode2's
cycle) rather than before. This helps timing a little.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Load-and-reserve and store-conditional instructions are required to
generate an alignment interrupt (0x600 vector) if their EA is not
aligned. Implement this.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
In 32-bit mode, effective addresses are truncated to 32 bits, both for
instruction fetches and data accesses, and CR0 is set for Rc=1 (record
form) instructions based on the lower 32 bits of the result rather
than all 64 bits.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Big-endian mode affects both instruction fetches and data accesses.
For instruction fetches, we byte-swap each word read from memory when
writing it into the icache data RAM, and use a tag bit to indicate
whether each cache line contains instructions in BE or LE form.
For data accesses, we simply need to invert the existing byte_reverse
signal in BE mode. The only thing to be careful of is to get the sign
bit from the correct place when doing a sign-extending load that
crosses two doublewords of memory.
For now, interrupts unconditionally set MSR[LE]. We will need some
sort of interrupt-little-endian bit somewhere, perhaps in LPCR.
This also fixes a debug report statement in fetch1.vhdl.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
To avoid adding too much logic, this moves the adder used by OP_ADD
out of the case statement in execute1.vhdl so that the result can
be used by OP_ADDG6S as well.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
The addex instruction is like adde but uses the XER[OV] bit for the
carry in and out rather than XER[CA].
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This adds a true random number generator for the Xilinx FPGAs which
uses a set of chaotic ring oscillators to generate random bits and
then passes them through a Linear Hybrid Cellular Automaton (LHCA) to
remove bias, as described in "High Speed True Random Number Generators
in Xilinx FPGAs" by Catalin Baetoniu of Xilinx Inc., in:
https://pdfs.semanticscholar.org/83ac/9e9c1bb3dad5180654984604c8d5d8137412.pdf
This requires adding a .xdc file to tell vivado that the combinatorial
loops that form the ring oscillators are intentional. The same
code should work on other FPGAs as well if their tools can be told to
accept the combinatorial loops.
For simulation, the random.vhdl module gets compiled in, which uses
the pseudorand() function to generate random numbers.
Synthesis using yosys uses nonrandom.vhdl, which always signals an
error, causing darn to return 0xffff_ffff_ffff_ffff.
This adds an implementation of the darn instruction. Darn can return
either raw or conditioned random numbers. On Xilinx FPGAs, reading a
raw random number gives the output of the ring oscillators, and
reading a conditioned random number gives the output of the LHCA.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
These instructions use major opcode 4 and have a third GPR input
operand, so we need a decode table for major opcode 4 and some
plumbing to get the RC register operand read.
The multiply-add instructions use the same insn_type_t values as the
regular multiply instructions, and we distinguish in execute1 by
looking at the major opcode. This turns out to be convenient because
we don't have to add any cases in the code that handles the output of
the multiplier, and it frees up some insn_type_t values.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This also removes OP_MCRXR, as the mcrxr instruction was removed in
version 3.0B of the Power ISA, having been phased-out for the server
architecture since v2.02.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
We now expect the overflow signal from the multiplier to come along
one cycle later than the product.
This breaks up a long combinatorial path and improves timing.
This also changes some uses of v.<field> to r.<field> in the slow
op logic, which should help timing as well.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This makes the interface to the multiplier more general so an instance
of it can be used in the FPU. It now has a 128-bit addend that is
added on to the product. Instead of an input to negate the output,
it now has a "not_result" input to complement the output. Execute1
uses not_result=1 and addend=-1 to get the effect of negating the
output. The interface is defined this way because this is what can
be done easily with the Xilinx DSP slices in xilinx-mult.vhdl.
This also adds clock enable signals to the DSP slices, mostly for the
sake of reducing power consumption.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This adds "if LOG_LENGTH > 0 generate" to the places in the core
where log output data is latched, so that when LOG_LENGTH = 0 we
don't create the logic to collect the data which won't be stored.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This eliminates a dependency of r.f.redirect_nia on the carry out
from the main adder in the case of a conditional trap instruction.
We can set r.f.redirect_nia unconditionally, even if no interrupt
is generated.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This adds a path to allow the CR result of one instruction to be
forwarded to the next instruction, so that sequences such as
cmp; bc can avoid having a 1-cycle bubble.
Forwarding is not available for dot-form (Rc=1) instructions,
since the CR result for them is calculated in writeback. The
decode.output_cr field is used to identify those instructions
that compute the CR result in execute1.
For some reason, the multiply instructions incorrectly had
output_cr = 1 in the decode tables. This fixes that.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This latches the redirect signal inside execute1, so that it is sent
a cycle later to fetch1 (and to decode/icache as flush). This breaks
a long combinatorial chain from the branch and interrupt detection
in execute1 through the redirect/flush signals all the way back to
fetch1, icache and decode.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This implements the CFAR SPR as a slow SPR stored in 'ctrl'. Taken
branches and rfid update it to the address of the branch or rfid
instruction.
To simplify the logic, this makes rfid use the branch logic to
generate its redirect (requiring SRR0 to come in to execute1 on
the B input and SRR1 on the A input), and the masking of the bottom
2 bits of NIA is moved to fetch1.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This reduces the number of different things that are assigned to
the result variable.
- The computations for the popcnt, prty, cmpb and exts instruction
families are moved into the logical unit.
- The result of mfspr from the slow SPRs is computed in 'spr_val'
before being assigned to 'result'.
- Writes to LR as a result of a blr or bclr instruction are done
through the exc_write path to writeback.
This eases timing considerably.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This implements a simple branch predictor in the decode1 stage. If it
sees that the instruction is b or bc and the branch is predicted to be
taken, it sends a flush and redirect upstream (to icache and fetch1)
to redirect fetching to the branch target. The prediction is sent
downstream with the branch instruction, and execute1 now only sends
a flush/redirect upstream if the prediction was wrong. Unconditional
branches are always predicted to be taken, and conditional branches
are predicted to be taken if and only if the offset is negative.
Branches that take the branch address from a register (bclr, bcctr)
are predicted not taken, as we don't have any way to predict the
branch address.
Since we can now have a mflr being executed immediately after a bl
or bcl, we now track the update to LR in the hazard tracker, using
the second write register field that is used to track RA updates for
update-form loads and stores.
For those branches that update LR but don't write any other result
(i.e. that don't decrementer CTR), we now write back LR in the same
cycle as the instruction rather than taking a second cycle for the
LR writeback.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This reduces the number of cycles where loadstore1 asserts its busy
output, leading to increased throughput of loads and stores. Loads
that hit in the cache can now be executed at the rate of one every two
cycles. Stores take 4 cycles assuming the wishbone slave responds
with an ack the cycle after we assert strobe.
To achieve this, the state machine code is split into two parts, one
for when we have an existing instruction in progress, and one for
starting a new instruction. We can now combinatorially clear busy and
start a new instruction in the same cycle that we get a done signal
from the dcache; in other words we are completing one instruction and
potentially writing back results in the same cycle that we start a new
instruction and send its address and data to the dcache.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This changes the instruction dependency tracking so that we can
generate a "busy" signal from execute1 and loadstore1 which comes
along one cycle later than the current "stall" signal. This will
enable us to signal busy cycles only when we need to from loadstore1.
The "busy" signal from execute1/loadstore1 indicates "I didn't take
the thing you gave me on this cycle", as distinct from the previous
stall signal which meant "I took that but don't give me anything
next cycle". That means that decode2 proactively gives execute1
a new instruction as soon as it has taken the previous one (assuming
there is a valid instruction available from decode1), and that then
sits in decode2's output until execute1 can take it. So instructions
are issued by decode2 somewhat earlier than they used to be.
Decode2 now only signals a stall upstream when its output buffer is
full, meaning that we can fill up bubbles in the upstream pipe while a
long instruction is executing. This gives a small boost in
performance.
This also adds dependency tracking for rA updates by update-form
load/store instructions.
The GPR and CR hazard detection machinery now has one extra stage,
which may not be strictly necessary. Some of the code now really
only applies to PIPELINE_DEPTH=1.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This puts the logic that selects which bits of the multiplier result
get written into the destination GPR into execute1, moved out from
multiply.
The multiplier is now expected to do an unsigned multiplication of
64-bit operands, optionally negate the result, detect 32-bit
or 64-bit signed overflow of the result, and return a full 128-bit
result.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This logs 256 bits of data per cycle to a ring buffer in BRAM. The
data collected can be read out through 2 new SPRs or through the
debug interface.
The new SPRs are LOG_ADDR (724) and LOG_DATA (725). LOG_ADDR contains
the buffer write pointer in the upper 32 bits (in units of entries,
i.e. 32 bytes) and the read pointer in the lower 32 bits (in units of
doublewords, i.e. 8 bytes). Reading LOG_DATA gives the doubleword
from the buffer at the read pointer and increments the read pointer.
Setting bit 31 of LOG_ADDR inhibits the trace log system from writing
to the log buffer, so the contents are stable and can be read.
There are two new debug addresses which function similarly to the
LOG_ADDR and LOG_DATA SPRs. The log is frozen while either or both of
the LOG_ADDR SPR bit 31 or the debug LOG_ADDR register bit 31 are set.
The buffer defaults to 2048 entries, i.e. 64kB. The size is set by
the LOG_LENGTH generic on the core_debug module. Software can
determine the length of the buffer because the length is ORed into the
buffer write pointer in the upper 32 bits of LOG_ADDR. Hence the
length of the buffer can be calculated as 1 << (31 - clz(LOG_ADDR)).
There is a program to format the log entries in a somewhat readable
fashion in scripts/fmt_log/fmt_log.c. The log_entry struct in that
file describes the layout of the bits in the log entries.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
By adding logic to decode2 to be able to send the instruction address
down the A input, and making CONST_DX_HI (renamed to CONST_DXHI4) add
4 to the immediate value (easy since the bottom 16 bits were zero),
we can do addpcis using the main adder. This reduces the width of the
result mux and frees up one value in insn_type_t, since we can now use
OP_ADD for addpcis.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This commit adds support for the addpcis instruction from ISA 3.0.
A new input_reg_b_t type, CONST_DX_HI, was added to support the
shifted immediate value used in DX-Form instructions.
Signed-off-by: Shawn Anastasio <shawn@anastas.io>
Use a simple wire. common.vhdl types are better kept for things
local to the core. We can add more wires later if we need to for
HV irqs etc...
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
- Changing use of others in core files to satisfy VCS
- Adding workaround for VCS subtype constraint inconsistencies in common.vhdl
Signed-off-by: Jonathan Balkind <jbalkind@princeton.edu>
Slbia (with IH=7) is used in the Linux kernel to flush the ERATs
(our iTLB/dTLB), so make it do that.
This moves the logic to work out whether to flush a single entry
or the whole TLB from dcache and icache into mmu. We now invalidate
all dTLB and iTLB entries when the AP (actual pagesize) field of
RB is non-zero on a tlbie[l], as well as when IS is non-zero.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This hooks up the connections so that an OP_FETCH_FAILED coming down
to loadstore1 will get sent to the MMU for it to do a radix tree walk
for the instruction address. The MMU then sends the resulting PTE to
the icache module to be installed in the iTLB. If no valid PTE can
be found, the MMU sends an error signal back to loadstore1 which sends
it on to execute1 to generate an ISI.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This adds a direct-mapped TLB to the icache, with 64 entries by default.
Execute1 now sends a "virt_mode" signal from MSR[IR] to fetch1 along
with redirects to indicate whether instruction addresses should be
translated through the TLB, and fetch1 sends that on to icache.
Similarly a "priv_mode" signal is sent to indicate the privilege
mode for instruction fetches. This means that changes to MSR[IR]
or MSR[PR] don't take effect until the next redirect, meaning an
isync, rfid, branch, etc.
The icache uses a hash of the effective address (i.e. next instruction
address) to index the TLB. The hash is an XOR of three fields of the
address; with a 64-entry TLB, the fields are bits 12--17, 18--23 and
24--29 of the address. TLB invalidations simply invalidate the
indexed TLB entry without checking the contents.
If the icache detects a TLB miss with virt_mode=1, it will send a
fetch_failed indication through fetch2 to decode1, which will turn it
into a special OP_FETCH_FAILED opcode with unit=LDST. That will get
sent down to loadstore1 which will currently just raise a Instruction
Storage Interrupt (0x400) exception.
One bit in the PTE obtained from the TLB is used to check whether an
instruction access is allowed -- the privilege bit (bit 3). If bit 3
is 1 and priv_mode=0, then a fetch_failed indication is sent down to
fetch2 and to decode1, which generates an OP_FETCH_FAILED. Any PTEs
with PTE bit 0 (EAA[3]) clear or bit 8 (R) clear should not be put
into the iTLB since such PTEs would not allow execution by any
context.
Tlbie operations get sent from mmu to icache over a new connection.
Unfortunately the privileged instruction tests are broken for now.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
A data segment interrupt (DSegI) occurs when an address to be
translated by the MMU is outside the range of the radix tree
or the top two bits of the address (the quadrant) are 01 or 10.
This is detected in a new state of the MMU state machine, and
is sent back to loadstore1 as an error, which sends it on to
execute1 to generate an interrupt to the 0x380 vector.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This adds logic to the dcache to check the permissions encoded in
the PTE that it gets from the dTLB. The bits that are checked are:
R must be 1
C must be 1 for a store
EAA(0) - if this is 1, MSR[PR] must be 0
EAA(2) must be 1 for a store
EAA(1) | EAA(2) must be 1 for a load
In addition, ATT(0) is used to indicate a cache-inhibited access.
This now implements DSISR bits 36, 38 and 45.
(Bit numbers above correspond to the ISA, i.e. using big-endian
numbering.)
MSR[PR] is now conveyed to loadstore1 for use in permission checking.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This adds a path from loadstore1 back to execute1 for reporting
errors, and machinery in execute1 for generating data storage
interrupts at vector 0x300.
If dcache is given two requests in successive cycles and the
first encounters an error (e.g. a TLB miss), it will now cancel
the second request.
Loadstore1 now responds to errors reported by dcache by sending
an exception signal to execute1 and returning to the idle state.
Execute1 then writes SRR0 and SRR1 and jumps to the 0x300 Data
Storage Interrupt vector. DAR and DSISR are held in loadstore1.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This adds a TLB to dcache, providing the ability to translate
addresses for loads and stores. No protection mechanism has been
implemented yet. The MSR_DR bit controls whether addresses are
translated through the TLB.
The TLB is a fixed-pagesize, set-associative cache. Currently
the page size is 4kB and the TLB is 2-way set associative with 64
entries per set.
This implements the tlbie instruction. RB bits 10 and 11 control
whether the whole TLB is invalidated (if either bit is 1) or just
a single entry corresponding to the effective page number in bits
12-63 of RB.
As an extension until we get a hardware page table walk, a tlbie
instruction with RB bits 9-11 set to 001 will load an entry into
the TLB. The TLB entry value is in RS in the format of a radix PTE.
Currently there is no proper handling of TLB misses. The load or
store will not be performed but no interrupt is generated.
In order to make timing at 100MHz on the Arty A7-100, we compare
the real address from each way of the TLB with the tag from each way
of the cache in parallel (requiring # TLB ways * # cache ways
comparators). Then the result is selected based on which way hit in
the TLB. That avoids a timing path going through the TLB EA
comparators, the multiplexer that selects the RA, and the cache tag
comparators.
The hack where addresses of the form 0xc------- are marked as
cache-inhibited is kept for now but restricted to real-mode accesses.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This arranges for some mfspr and mtspr to get sent to loadstore1
instead of being handled in execute1. In particular, DAR and DSISR
are handled this way. They are therefore "slow" SPRs.
While we're at it, fix the spelling of HEIR and remove mention of
DAR and DSISR from the comments in execute1.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>