Data being written to an SPR by mtspr now comes in to execute2 via
ex1.write_spr_data (renamed from ex1.ramspr_odd_data) rather than
ex1.e.write_data. This eliminates the need for the main result mux in
execute1 to be able to pass the c_in value through. For mfspr, the
no-op behaviour is obtained by selecting ex1.write_spr_data as
spr_result in execute2. We already had ex1.write_spr_data being set
from c_in, so no new logic is required there.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Also select the RS passthrough in the logical unit by default for
mfspr, which is needed for the no-op SPRs and the no-op behaviour
of privileged mfspr to unimplemented SPRs. For slow SPRs the RS
behaviour gets passed through from execute1 to execute2 and
replaced by the correct result in execute2's result mux.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Instead of working out result_sel and subresult_sel in decode2 from
the insn_type, they now come directly from the main decode table in
decode1. This reduces the need for distinct insn_type values and
should enable us to avoid expanding insn_type beyond 6 bits.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
At various points we need to set the X bit if any bit of R which would
be shifted out by a right shift of N bits is a 1. We can do this by
computing R | -R, which contains a 1 in the position of the right-most
1-bit in R and in all positions to the left, and zeroes to the right.
That means we can test for the least-significant N bits being non-zero
by testing whether bit N-1 of (R | -R) is a 1. Doing this uses fewer
LUTs and has better timing than the old method of generating a mask,
ANDing it with R, and testing whether the result is non-zero.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
The path from execute_to_loadstore.valid through to the read enable of
the cache RAM has showed up as a critical path. In fact we can
simplify this by always asserting read enable when not stalled.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Currently, decode2 computes register addresses from the input_reg
fields in the decode table entry and the instruction word. This
duplicates a computation that decode1 has already done based on the
insn_code value. Instead of doing this redundant computation, just
use the register addresses supplied by decode1. This means that the
decode_input_reg_* functions merely compute whether the register
operand is used or not.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This removes the cases in the decode stages which allowed the C
register address to come from the RB field for the hash instructions
(hashst[p], hashchk[p]), and generated a negative immediate value for
the B operand. The motivation is to simpify the logic for the C
register address. Instead the unusual construction of the address for
the hash instructions is handled in the loadstore1_in process, and the
hash computation uses the A and B operands rather than A and C.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
It seems that the Linux kernel executes cpabort on any CPU that
implements ISA v3.1 or later, despite cpabort being optional.
To cope with this, implement cpabort as a no-op.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
The run status LED is off when the core is held in reset (e.g. when
the second core hasn't been started yet).
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Popcount takes two cycles to execute. The computation of the final
popcount value in the second cycle has showed up as a critical path on
the Artix-7, so move one stage of the summation back into the first
cycle.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This reworks the dcache to try and simplify the logic and alleviate
some of the paths that have been showing up as critical paths in
synthesis. An example is a dependency of the req_is_hit signal on
the wishbone ack, which this series removes. Overall this seems to
have reduced LUT usage and improved timing.
This reworks the FPU logic to try and get closer to the point where the
big state machine could be converted into microcode. This means that
as far as possible the state machine should just set control lines, ideally
with as little conditional logic in each state as possible, and that anything
that is considered data should be manipulated outside of the state
machine. This also improves architecture compliance in the area of
exception handling, and alleviates some critical paths.
This implements [U]SIER2, [U]SIER3, [U]MMCR3, HMER and HMEER as
SPRs which return zero when read, and ignore writes. The zero value
is provided via the slow SPR read multiplexer. To avoid increasing
the size of the selector from 4 bits to 5, the (implementation
specific) LOG_ADDR and LOG_DATA SPRs now share a single selector
value.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This removes a dependency of req_is_hit and similar signals on the
wishbone ack input, by removing use_forward_rl, and making idx_reload
not dependent on wr_row_match and wishbone_in.ack. Previously if a
load in r0 hit the doubleword being supplied from memory, that was
treated as a hit and the data was forwarded via a multiplexer
associated with the cache RAM. Now it is called a miss and completed
by the logic in the RELOAD_WAIT_ACK state of the state machine.
The only downside is that now the selection of data source in the
dcache_fast_hit process depends on req_is_hit rather than r1.full.
Overall this change seems to reduce the number of LUTs, and make
timing easier on the ECP-5.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Instead of having TLB invalidation and TLB load requests come through
the dcache main path, these operations are now done in one cycle
entirely based on signals from the MMU, and don't involve the TLB read
path or the dcache state machine at all. So that we know which way of
the TLB to affect for invalidations, loadstore1 now sends down a "TLB
probe" operation for tlbie instructions which goes through the dcache
pipeline and sets the r1.tlb_hit_* fields which are used in the
subsequent invalidation operation from the MMU (if it is a single-page
invalidation). TLB load operations write to the way identified by
r1.victim_way, which was set on the TLB miss that triggered the TLB
reload.
Since we are writing just one way of the TLB tags now, rather than
writing all ways with one way's value changed, we now pad each way to
a multiple of 8 bits so that byte write-enables can be used to select
which way gets written.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This implements the LPCR (Logical Partition Control Register) with 5
read/write bits. The other 59 bits are read-only; two (HR and UPRT)
read as 1 and the rest as 0.
The bits that are implemented are:
* HAIL - enables taking interrupts with relocation on
* LD - enables large decrementer mode
* HEIC - disables external interrupts when set
* LPES - controls how external interrupts are delivered
* HVICE - does nothing at present since there is no source of
Hypervisor Virtualization Interrupts.
This also fixes a bug where MSR[RI] was getting cleared by the
delivery of hypervisor interrupts.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
HFSCR is associated with the LPAR (Logical Partitioning) feature,
which is not required for SFFS designs, so remove it and the
associated logic.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This rearranges the multiplexing of cache read data with forwarded
store data with the aim of shortening the path from the req_hit_ways
signal to the r1.data_out register. The forwarding decisions are now
made for each way independently and the the results then combined
according to which way detected a cache hit.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
With some slight arrangement of the state machine in the dcache_slow
process, we can remove one of the two comparators that detect writes
by other entities to the reservation granule. The state machine now
sets the wishbone cyc signal on the transition from IDLE to DO_STCX
state. Once we see the wishbone stall signal at 0, we consider we
have the wishbone and we can assert stb to do the write provided that
the stcx is to the reservation address and we haven't seen another
write to the reservation granule. We keep the comparator that
compares the snoop address delayed by one cycle, in order to make
timing easier, and the one (or more) cycle delay between cyc and stb
covers that one cycle delay in the kill_rsrv signal.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
The reset was added originally to reduce metavalue warnings in
simulation, is not necessary for correct operation, and showed up as a
critical path in synthesis for the Xilinx Artix-7. Remove it when
doing synthesis; for simulation we set the value read to X rather than
0 in order to catch any use of the previously reset value.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This gets rid of some largish comparators in the dcache_request
process by matching index and way that hit in the cache tags instead
of comparing tag values. That is, some tag comparisons can be
replaced by seeing if both tags hit in the same cache way.
When reloading a cache line, we now set it valid at the beginning of
the reload, so that we get hits to compare. While the reload is still
occurring, accesses to doublewords that haven't yet been read are
indicated with req_is_hit = 0 and req_hit_reload = 1 (i.e. are
considered to be a miss, at least for now).
For the comparison of whether a subsequent access is to the same page
as stores already being performed, in virtual mode (TLB being used) we
now compare the way and index of the hit in the TLB, and in real mode
we compare the effective address. If any new entry has been loaded
into the TLB since the access we're comparing against, then it is
considered to be a different page.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
A dcbz operation to memory that is mapped as non-cacheable in the page
tables doesn't cause an alignment interrupt, but neither was it
implemented properly in the dcache. It does do 8 writes to memory but
it also creates a zero-filled line in the cache.
This fixes it so that dcbz to memory mapped non-cacheable doesn't
write the cache tag or set any line valid. We now have r1.reloading
which is 1 only in RELOAD_WAIT_ACK state, but only if the memory is
cacheable and therefore the cache should be updated (i.e. it is zero
in RELOAD_WAIT_ACK state if we are doing a non-cacheable dcbz).
We can now also remove the code in loadstore1 that checks for
non-cacheable dcbz, which only triggered when doing dcbz in real mode
to an address in the Cxxxxxxx range.
Also remove some unused variables and signals.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Rather than combining the results of the per-way comparators into
an encoded 'hit_way' variable, use the individual results directly
using AND-OR type networks where possible, in order to reduce
utilization and improve timing.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This adds some extra states and transitions so that opsel_a becomes
a function only of the current state.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
The timing path from r.a.class to result showed up as a critical path
on the Artix-7, apparently because of transfers of A, B or C to R in
special cases (e.g. NaN inputs) and the fsel instruction. To
alleviate this, we provide a path via the miscellaneous value
multiplexer from A, B and C to R, selected via opsel_R = RES_MISC and
misc_sel = 111. A new selector opsel_sel selects which of A, B or C
to transfer, using the same encoding as opsel_a. This new selector is
now also used for the result class when rcls_op = RCLS_SEL and for the
result sign when rsgn_op = RSGN_SEL. This reduces the number of
things that opsel_a depends on and eases timing in the main adder
path.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This creates a new fpu_specialcases process that handles most of the
logic that was previously in the DO_NAN_INF and DO_ZERO_DEN states.
What remains of those states, i.e. the handling of denormalized
inputs, is in a new DO_SPECIAL state. The state machine goes into
DO_SPECIAL state after IDLE for any arithmetic operation where an
input is a NaN, infinity, zero or denormalized value. Doing this
means that the rest of the state machine won't try to start any
computation which would need to be overridden by the logic to produce
the result value selected by the fpu_specialcases process.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Instead of calculating v.cr_result in the state machine, we now have
the state machine set a 'cr_op' variable which then controls what
computation the CR data path does to set v.cr_result. The CR data
path also handles updating the XERC result bits for integer operations
(division and modulus).
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
With this, the A input no longer has R as an option but now takes the
rounding constants and the low-order bits of P (used as an adjustment
in the square root algorithm). The B input has either R or zero.
Both inputs can be optionally inverted for subtraction. The select
inputs to the multiplexers now have 3 bits in opsel_a and 1 bit in
opsel_b.
The states which need R to be set now explicitly have set_r := 1 even
though that is the default, essentially for documentation reasons.
Similarly some states set opsel_b <= BIN_R even though that is the
default.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
The various states choose one of four operations (including no-op) to
be done on result_class. Some operations have side-effects on
arith_done or FPSCR. The DO_NAN_INF and DO_ZERO_DEN states still set
result_class directly since their logic is expected to move out to a
separate process later.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
For the various arithmetic operators, we only get to the DO_* states
when the inputs are finite (not zero, infinity or NaN), so we can
replace setting of v.result_class to r.a.class or r.b.class with a
overall setting of it to FINITE in cycle 1 of all those operations.
Also, integer division doesn't need to set the result class since the
result is integer.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Instead of having the various DO_* states (DO_FMUL, DO_FDIV, etc.)
handle checking for denormalized inputs, we now have DO_ZERO_DEN state
check for denormalized inputs and branch to RENORM_{A,B,C} to handle
them.
This also meant some changes were needed in how fsqrt and frsqrte
handled inputs with odd exponent. The DO_FSQRT and DO_FRSQRTE states
were very similar and have been combined into one.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This lets us remove r.opsel_a and is a step towards moving the
handling of exceptional cases out to a separate process.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Most states set opsel_a directly to select the operand for the A input
of the main adder. The exception is the EXC_RESULT state, which uses
r.opsel_a set by the previous cycle to indicate which input operand to
use as the result.
In order to make timing, ensure that the controls that select the
inputs to the main adder (opsel_*, etc.) don't depend on any
complicated functions of the data (such as px_nz, pcmpb_eq, pcmpb_lt,
etc.), but are as far as possible constant for each state. There is
now a control called set_r for whether the result is written to r.r,
which enables us to avoid setting opsel_b or opsel_r conditionally in
some cases.
Also, to avoid a data-dependent setting of msel_2 in IDIV_DODIV state,
the IDIV_NR1 and IDIV_NR2 states have been reworked so that completion
of the required number of iterations is checked in IDIV_NR1 state, and
at that point, if the inverse estimate is < 0.5, we go to IDIV_USE0_5
state in order to use 0.5 as the estimate. This means that in the
normal case, the inverse estimate is already in Y when we get to
IDIV_DODIV state. IDIV_USE0_5 has been reworked to put R (which will
contain 0.5) into Y as the inverse estimate. That means that
IDIV_DODIV state doesn't have any data-dependent logic to put either P
or R into Y.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Since r.x is mostly set from the value in r.r and only once from
anything else (r.b.mantissa), move the check to before the input
multiplexer for the main adder, so it works on r.r rather than
whatever is selected by r.opsel_a.
For the case in DO_FRSP where we have B selected by r.opsel_a, we add
a new state so that we now get B into R and then check the low bits of
R.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Instead use things derived from the instruction in the first cycle,
such as r.is_multiply, r.is_addition, etc.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
The architecture specifies that an invalid operation exception for
signalling NaN (VXSNAN) can occur in the same instructions as an
invalid operation exception for infinity times zero (VXIMZ) in the
case of a multiply-add instruction where B is a signalling NaN, and
one of A and C is infinity and the other is zero. This moves the
invalid operation tests around so as to handle this case correctly.
It also restructures the infinity and NaN cases to simplify the logic
a little.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
By starting out with result_sign = +/- sign of B, we avoid the need to
flip the result sign in a few places.
This also simplifies DO_FMADD state a bit by having DO_ZERO_DEN go to
DO_FMUL state for floating multiply-add where B is zero. (The
RENORM_A2 and RENORM_C2 states already do this.)
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>