Commit Graph

200 Commits (722f239c025e55bb45e477ca70f8f6500d7801b8)

Author SHA1 Message Date
Paul Mackerras 722f239c02 Reimplement quadword loads and stores
This adds implementations of lq, plq, stq, pstq, lqarx and stqcx.

Because register file addresses are now computed in decode1 before we
have the decode table entry for the instruction, we have to check the
icode directly to know when to read register RS|1 before RS (i.e. for
stq and stqcx in LE mode, but not pstq).

For the second instance of the instruction, loadstore1 uses the EA
from the first instance + 8.  It generates an alignment interrupt for
unaligned lqarx and stqcx and for lq in LE mode with an unaligned
address.  (The reason for the latter case is that it writes RT|1
before RT, and if we have RA = RT|1 and the second instance traps, we
will have overwritten RA.)

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
3 weeks ago
Paul Mackerras fa9df33f7e Implement cfuged, pdepd and pextd
This implements the cfuged, pdepd and pextd instructions in a new unit
called bit_sorter (so called because cfuged and pextd can be viewed as
sorting the bits of the mask).

The cnt* instructions and the popcnt* instructions now use the same
OP_COUNTB insn_type so as to free up an insn_type value to use for the
new instructions.

The new instructions are implemented using a slow and simple algorithm
that takes 64 cycles to compute the result.  The ex1 stage is stalled
while this happens, as for a 64-bit multiply, or for a divide when
there is no FPU.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
1 month ago
Paul Mackerras d7d7a3afd4 Implement VRSAVE SPR
VRSAVE is a 32-bit software-use SPR accessible in user mode.  It is
stored in the SPR RAM.  The value read from the RAM is trimmed to 32
bits at the ramspr_read process.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
1 month ago
Paul Mackerras d112a7ad94 Implement scv and rfscv
The main quirk here is that scv sets LR and CTR instead of SRR0 and
SRR1, and likewise rfscv uses LR and CTR.  Also, scv uses a set of 128
interrupt vectors starting at 0x17000.  Fortunately, the layout of the
SPR RAM was already such that LR and CTR were in the even and odd
halves respectively at the same index, so reading or writing LR and
CTR instead of SRR0 and SRR1 is quite easy.

Use of scv is subject to an FSCR bit but not an HFSCR bit.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
1 month ago
Paul Mackerras a88fa9c459 Implement DSCR
The DSCR (Data Stream Control Register) is a user-accessible SPR that
controls aspects of data prefetching.  It has 25 bits of state defined
in the ISA.  This implements the register as a 25 read/write bits that
do nothing, since we don't have any prefetching.

The DSCR is accessible at two SPR numbers, 3 (unprivileged) and 17
(privileged).  Access via these SPR numbers is controlled by an FSCR
bit and an HFSCR bit.  The FSCR bit controls access via SPR 3 in user
mode.  The HFSCR bit controls access via SPR 3 in user mode and either
SPR number in privileged non-hypervisor mode, but since we don't
implement privileged non-hypervisor mode, it does essentially the same
thing as the FSCR bit.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
1 month ago
Paul Mackerras 205c0e2c78 Implement the wait instruction
This implements the behaviour of the 'wait 0' instruction of pausing
execution of instructions until an exception arises.  The exceptions
that terminate a wait are a pending trace exception, external
interrupt request, PMU interrupt request, or decrementer negative
exception.  These exception conditions terminate a wait even if not
enabled to generate an interrupt (e.g. if MSR[EE] is zero).

This is implemented by having execute1 assert its busy_out signal
while the wait state exists.  The wait state is set by the completion
of the wait instruction and cleared by a pending exception.

If the WC operand of the wait instruction is non-zero, indicating wait
for reservation loss or wait for a short period, then the wait
instruction does not wait, but just acts as a no-op.

In order to make space in the insn_type_t type without going over 64
elements, this combines OP_DCBT and OP_ICBT into a single OP_XCBT,
since they were both no-ops (except for their influence on how SRR1 is
set on a trace interrupt, where they were identical).

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
1 month ago
Paul Mackerras 7bc7f335f1 Implement CTRL register
The CTRL register has a single bit called RUN.  It has some unusual
behaviours:

- It can only be written via SPR number 152, which is privileged
- It can only be read via SPR number 136, which is non-privileged
- Reading in problem state (user mode) returns the RUN bit in bit 0,
  but reading in privileged state (hypervisor mode) returns the RUN
  bit in bits 0 and 15.
- Reading SPR 152 in problem state causes a HEAI (illegal instruction)
  interrupt, but reading in privileged state is a no-op; this is the
  same as for an unimplemented SPR.

The RUN bit goes to the PMU and is also plumbed out to drive a LED on
the Arty board.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
1 month ago
Paul Mackerras ff0744b795 execute1: Make CFAR able to be written using mtspr and read using DMI debug
mtspr to CFAR is currently a no-op, which is not what should happen.
Make it set the contents of CFAR.

Also provide access to CFAR via the DMI debug interface as register 0x31.

Fixes: c2da82764f ("core: Implement CFAR register", 2020-06-15)
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
1 month ago
Paul Mackerras d2777dd1dd Generate Hypervisor Emulation Assistance Interrupt for illegal instructions
This implements the HEIR register (Hypervisor Emulation Instruction
Register) and arranges for an illegal instruction to cause a
Hypervisor Emulation Assistance Interrupt (HEAI) at vector 0xE40, and
set HEIR to the illegal instruction.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
1 month ago
Paul Mackerras e3f4ccedec Implement facility unavailable and hypervisor facility unavailable interrupts
This adds the FSCR and HFSCR registers and implements the associated
behaviours of taking a facility unavailable or hypervisor facility
unavailable interrupt if certain actions are attempted while the
relevant [H]FSCR bit is zero.

At present, two FSCR enable bits and three HFSCR enable bits are
implemented.  FSCR has bits for prefixed instructions and accesses to
the TAR register, and HFSCR has those plus a bit that enables access
to floating-point registers and instructions.

FSCR and HFSCR can be accessed through the debug interface using
register addresses 0x2e and 0x2f.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
1 month ago
Paul Mackerras 12a3d76217 Implement hrfid and make MSR[HV] always 1
Implementations without hypervisor/LPAR support are permitted by the
architecture, but should have MSR[HV] forced to be 1 at all times, not
0, and should implement various instructions and registers that are
only accessible in hypervisor mode.

This commit implements MSR[HV] as a constant 1 bit and adds the hrfid
instruction, which behaves exactly the same as rfid except that it
reads HSRR0/1 instead of SRR0/1.  We already have HSRR0/1 and HSPRG0/1
implemented.

When HV=1, Linux expects external interrupts to arrive as hypervisor
interrupts, so this adds support for hypervisor interrupts (i.e.,
those that set HSRR0/1) and makes the external interrupt be a
hypervisor interrupt.  (If we had an LPCR register, the LPES bit would
control this, but we don't.)  The xics test is updated to read HSRR0/1
after an external interrupt.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
1 month ago
Paul Mackerras 6ef9395f10
Remove vestiges of the short (16-bit) multiplier option (#432)
These aren't needed, and should have been removed in d1e8e62fee
("Remove option for "short" 16x16 bit multiplier", 2022-07-19), but
were missed.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
1 month ago
Paul Mackerras 1c4b5def36 Improve timing of redirect_nia going from writeback to fetch1
This gets rid of the adder in writeback that computes redirect_nia.
Instead, the main adder in the ALU is used to compute the branch
target for relative branches.  We now decode b and bc differently
depending on the AA field, generating INSN_brel, INSN_babs, INSN_bcrel
or INSN_bcabs as appropriate.  Each one has a separate entry in the
decode table in decode1; the *rel versions use CIA as the A input.
The bclr/bcctr/bctar and rfid instructions now select ramspr_result
for the main result mux to get the redirect address into
ex1.e.write_data.

For branches which are predicted taken but not actually taken, we need
to redirect to the following instruction.  We also need to do that for
isync.  We do this in the execute2 stage since whether or not to do it
depends on the branch result.  The next_nia computation is moved to
the execute2 stage and comes in via a new leg on the secondary result
multiplexer, making next_nia available ultimately in ex2.e.write_data.
This also means that the next_nia leg of the primary result
multiplexer is gone.  Incrementing last_nia by 4 for sc (so that SRR0
points to the following instruction) is also moved to execute2.

Writing CIA+4 to LR was previously done through the main result
multiplexer.  Now it comes in explicitly in the ramspr write logic.

Overall this removes the br_offset and abs_br fields and the logic to
add br_offset and next_nia, and one leg of the primary result
multiplexer, at the cost of a few extra control signals between
execute1 and execute2 and some multiplexing for the ramspr write side
and an extra input on the secondary result multiplexer.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
1 year ago
Paul Mackerras b50170cd1d Implement byte reversal instructions
This implements the byte-reverse halfword, word and doubleword
instructions: brh, brw, and brd.  These instructions were added to the
ISA in version 3.1.  They use a new OP_BREV insn_type value.  The
logic for these instructions is implemented in logical.vhdl.

In order to avoid going over 64 insn_type values, OP_AND and OP_OR
were combined into OP_LOGIC, which is like OP_AND except that the RS
input can be inverted as well as the RB input.  The various forms of
OR instruction are then implemented using the identity

    a OR b = NOT (NOT a AND NOT b)

The 'is_signed' field of the instruction decode table is used to
indicate that RS should be inverted.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
1 year ago
Paul Mackerras fd8c0000c0 Implement set[n]bc[r] instructions
This implements the setbc, setnbc, setbcr and setnbcr instructions.
Because the insn_type_t type already has 64 elements, this uses the
existing OP_SETB for the new instructions, and has execute1 compute
different results depending on bits 6-9 of the instruction.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
1 year ago
Paul Mackerras c4492c843a Implement interrupts for prefixed instructions
This arranges to generate an illegal instruction type program
interrupt for illegal prefixed instructions, that is, those where the
suffix is not a legal value given the prefix, or the prefix has a
reserved value in the subtype field.  This implementation doesn't
generate an interrupt for the invalid 8LS:D and MLS:D instruction
forms where R = 1 and RA != 0.  (In those cases it uses (RA) as the
addend, i.e. it ignores the R bit.)

This detects the case where the address of an instruction prefix is
equal mod 64 to 60, and generates an alignment interrupt in that case.

This also arranges to set bit 34 of SRR1 when an interrupt occurs due
to a prefixed instruction, for those interrupts where that is required
(i.e. trace, alignment, floating-point unavailable, data storage, data
segment, and most cases of program interrupt).

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2 years ago
Paul Mackerras 7af0e001ad Move insn_codes for mcrfs, mtfsb0/1 and mtfsfi
This moves the insn_code values for mcrfs, mtfsb0/1 and mtfsfi into
the region used for floating-point instructions.  This means that in
no-FPU implementations, they will get turned into illegal instructions
in predecode.  We then don't need the code in execute1 that makes FP
instructions illegal in no-FPU implementations.

We also remove the NONE value for unit_t, since it was only ever used
with insn_type = OP_ILLEGAL, and the check for unit = NONE was
redundant with the check for insn_type = OP_ILLEGAL.  Thus the check
for unit = NONE is no longer needed and is removed here.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2 years ago
Paul Mackerras 6fa468ca3d execute1: Reduce metavalue warnings
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2 years ago
Paul Mackerras b0aa5340b8 execute1: Make it clear that divide logic is not included when HAS_FPU=true
This adds a "not HAS_FPU" condition in a few places to make it obvious
that logic to interface to the divide unit is not included when we
have an FPU.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2 years ago
Paul Mackerras d1e8e62fee Remove option for "short" 16x16 bit multiplier
Now that we have a 33 bit x 33 bit signed multiplier in execute1,
there is really no need for the 16 bit multiplier.  The coremark
results are just as good without it as with it.  This removes the
option for the sake of simplicity.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2 years ago
Paul Mackerras e02d8060ed Change the multiplier interface to support signed multipliers
This adds an 'is_signed' signal to MultiplyInputType to indicate
whether the data1 and data2 fields are to be interpreted as signed or
unsigned numbers.

The 'not_result' field is replaced by a 'subtract' field which
provides a more intuitive interface for requesting that the product be
subtracted from the addend rather than added, i.e. subtract = 1 gives
C - A * B, vs. subtract = 0 giving C + A * B.  (Previously the users
of the multipliers got the same effect by complementing the addend and
setting not_result = 1.)

The is_32bit field is removed because it is no longer used now that we
have a separate 32-bit multiplier.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2 years ago
Paul Mackerras 595a758400 execute1: Add a pipelined 33-bit signed multiplier
This adds a pipelined 33-bit by 33-bit signed multiplier with one
cycle latency to the execute pipeline, and uses it for the mullw,
mulhw and mulhwu instructions.  Because it has one cycle of latency we
can assume that its result is available in the second execute stage
without needing to add busy logic to the second stage.

This adds both a generic version of the multiplier and a
Xilinx-specific version using four DSP slices of the Artix-7.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2 years ago
Paul Mackerras 6db626d245 icache: Log 36 bits of instruction rather than 32
This expands the field in the log buffer that stores the instruction
fetched from the icache to 36 bits, so that we get the insn_code and
illegal instruction indication.  To do this, we reclaim 3 unused bits
from execute1's portion and one other unused bit (previously just set
to 0 in core.vhdl).

This also alters the trigger behaviour to stop after one quarter of
the log buffer has been filled with samples after the trigger, or 256
entries, whichever is less.  This is to ensure that the trigger event
doesn't get overwritten when the log buffer is small.

This updates fmt_log to the new log format.  Valid instructions are
printed as a decimal insn_code value followed by the bottom 26 bits of
the instruction.  Illegal instructions are printed as "ill" followed
by the full 32 bits of the instruction.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2 years ago
Paul Mackerras 26dc1e879c Eliminate use of primary opcode outside of decode1
This changes code that previously looked at the primary opcode (bits
26 to 31) of the instruction to use other methods, in places other
than in stage0 of decode1.

* Extend rc_t to have a new value, RCOE, indicating that the
  instruction has both Rc and OE bits.

* Decode2 now tells execute1 whether the instruction has a third
  operand, used for distinguishing between multiply and multiply-add
  instructions.

* The invert_a field of the decode ROM is overloaded for load/store
  instructions to indicate cache-inhibited loads and stores.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2 years ago
Paul Mackerras 932da4c114 FPU: Simplify IDLE state code
Do more decoding of the instruction ahead of the IDLE state
processing so that the IDLE state code becomes much simpler.
To make the decoding easier, we now use four insn_type_t codes for
floating-point operations rather than two.  This also rearranges the
insn_type_t values a little to get the 4 FP opcode values to differ
only in the bottom 2 bits, and put OP_DIV, OP_DIVE and OP_MOD next to
them.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2 years ago
Paul Mackerras 7a60c118ed loadstore1: Simplify address generation in OP_FETCH_FAILED case
Instead of having a multiplexer in loadstore1 in order to be able to
put the instruction address into v.addr, we now set decode.input_reg_a
to CIA in the decode table entry for OP_FETCH_FAILED.  That means that
the operand selection machinery in decode2 will supply the instruction
address to loadstore1 on the lv.addr1 input and no special case is
needed in loadstore1.  This saves a few LUTs (~40 on the Artix-7).

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2 years ago
Paul Mackerras 939c7e39dd execute1: Fix trace interrupt on sc instruction
This fixes a bug which causes a trace interrupt to store the wrong
value in SRR0 in the case where the instruction that has just
completed is followed by a sc (system call) instruction.  What happens
is that first the traced instruction sets ex1.trace_next.  Then, when
the sc instruction following it comes in, the execute1_actions process
sets v.e.last_nia to next_nia because it is an sc instruction, even
though it is not going to be executed -- we are going to take the
trace interrupt instead.  Then when the trace interrupt is taken, we
incorrectly set SRR0 to the incremented address (the address of the
instruction following the sc).

To fix this, we have execute1_actions set a new flag if the current
instruction is sc, and only set v.e.last_nia to next_nia if we
actually execute the sc (in the "if go = '1'" case).

Fixes: 813e2317bf ("execute1: Restructure to separate out execution of side effects", 2022-06-18)
Reported-by: Anton Blanchard <anton@linux.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2 years ago
Michael Neuling e440db13d7 Metavalue cleanup for execute1.vhdl
Signed-off-by: Michael Neuling <mikey@neuling.org>
3 years ago
Michael Neuling caf458be37 Metavalue cleanup for common.vhdl
This affects other files which have been included here.

Signed-off-by: Michael Neuling <mikey@neuling.org>
3 years ago
Paul Mackerras d0f319290f Restore debug access to SPRs
This provides access to the SPRs via the JTAG DMI interface.  For now
they are still accessed as if they were GPR/FPRs using the same
numbering as before (GPRs at 0 - 0x1f, SPRs at 0x20 - 0x2d, FPRs at
0x40 - 0x5f).

For XER, debug reads now report the full value, not just the bits that
were previously stored in the register file.  The "slow" SPR mux is
not used for debug reads.

Decode2 determines on each cycle whether a debug SPR access will
happen next cycle, based on whether there is a request and whether the
current instruction accesses the SPR RAM.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
3 years ago
Paul Mackerras fdb3ef6874 Finish off taking SPRs out of register file
With this, the register file now contains 64 entries, for 32 GPRs and
32 FPRs, rather than the 128 it had previously.  Several things get
simplified - decode1 no longer has to work out the ispr{1,2,o} values,
decode_input_reg_{a,b,c} no longer have the t = SPR case, etc.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
3 years ago
Paul Mackerras 337b104250 Move LR, CTR and TAR out of the register file
By putting CTR on the odd side and LR and TAR on the even side, we can
read and write CTR for bdnz-style instructions in parallel with
reading LR or TAR for indirect branches and writing LR for branches
with LK=1.  Thus we don't need to double up any of these instructions,
giving a simplification in decode2.

We now have logic for printing LR and CTR at the end of a simulation
in execute1, in addition to the similar logic in register_file and
cr_file.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
3 years ago
Paul Mackerras bc4d02cb0d Start removing SPRs from register file
This starts the process of removing SPRs from the register file by
moving SRR0/1, SPRG0-3, HSRR0/1 and HSPRG0/1 out of the register file
and putting them into execute1.  They are stored in a pair of small
RAM arrays, referred to as "even" and "odd".  The reason for having
two arrays is so that two values can be read and written in each
cycle.  For example, SRR0 and SRR1 can be written in parallel by an
interrupt and read in parallel by the rfid instruction.

The addresses in the RAM which will be accessed are determined in the
decode2 stage.  We have one write address for both sides, but two read
addresses, since in future we will want to be able to read CTR at the
same time as either LR or TAR.

We now have a connection from writeback to execute1 which carries the
partial SRR1 value for an interrupt.  SRR0 comes from the execute
pipeline; we no longer need to carry instruction addresses along the
LSU and FPU pipelines.  Since SRR0 and SRR1 can be written in the same
cycle now, we don't need the little state machine in writeback any
more.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
3 years ago
Paul Mackerras 73cc5167ec Use FPU for division instructions if we have an FPU
- Arrange for XER to be written for OE=1 forms
- Arrange for condition codes to be set for RC=1 forms
  (including correct handling for 32-bit mode)
- Don't instantiate the divider if we have an FPU.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
3 years ago
Paul Mackerras a95f8aab38 FPU: Add integer division logic to FPU
This adds logic to the FPU to accomplish 64-bit integer divisions.
No instruction actually uses this yet.

The algorithm used is to obtain an estimate of the reciprocal of the
divisor using the lookup table and refine it by one to three
iterations of the Newton-Raphson algorithm (the number of iterations
depends on the number of significant bits in the dividend).  Then the
reciprocal is multiplied by the dividend to get the quotient estimate.
The remainder is calculated as dividend - quotient * divisor.  If the
remainder is greater than or equal to the divisor, the quotient is
incremented, or if a modulo operation is being done, the divisor is
subtracted from the remainder.  The inverse estimate after refinement
is good enough that the quotient estimate is always equal to or one
less than the true quotient.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
3 years ago
Paul Mackerras d1850fea29 Track hazards explicitly for XER overflow bits
This provides a mechanism for tracking updates to the XER overflow
bits (SO, OV, OV32) and stalling instructions which need current
values of those bits (mfxer, integer compare instructions, integer
Rc=1 instructions, addex) or which writes carry bits (since all the
XER common bits are written together, if we are writing CA/CA32 we
need up-to-date values of SO/OV/OV32).

This will enable updates to SO/OV/OV32 to be done at other places
besides the ex1 stage.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
3 years ago
Paul Mackerras 9a8a8e50f8 FPU: Add stage-2 stall ability to FPU
This makes the FPU able to stall other units at execute stage 2 and be
stalled by other units (specifically the LSU).

This means that the completion and writeback for an instruction can
now end up being deferred until the second cycle of a following
instruction, i.e. the cycle when the state machine has gone through
IDLE state into one of the DO_* states, which means we need to latch
the destination FPR number, CR mask, etc. from the previous
instruction so that we present the correct information to writeback.

The advantage of this is that we can get rid of the in_progress signal
from the LSU.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
3 years ago
Paul Mackerras ef122868d5 Do CR0 setting for Rc=1 instructions in execute2 instead of writeback
This lets us forward the CR0 result to following instructions that
use CR, meaning they get to issue one cycle earlier.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
3 years ago
Paul Mackerras e030a500e8 Allow integer instructions and load/store instructions to execute together
Execute1 and loadstore1 now send each other stall signals that
indicate that a valid instruction in stage 2 can't complete in this
cycle, and hence any valid instruction in stage 1 in the other unit
can't move to stage 2.  With this in place, an ALU instruction can
move into stage 1 while a LSU instruction is in stage 2.

Since the FPU doesn't yet have a way to stall completion, we can't yet
start FPU instructions while any LSU or ALU instruction is in
progress.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
3 years ago
Paul Mackerras 4b6148ada6 Add a bypass path from the execute2 stage
This enables some instructions to issue earlier and thus improves
performance, at the cost of some extra multiplexers in decode2.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
3 years ago
Paul Mackerras 3510071d9a Add a second execute stage to the pipeline
This adds a second execute stage to the pipeline, in order to match up
the length of the pipeline through loadstore and dcache with the
length through execute1.  This will ultimately enable us to get rid of
the 1-cycle bubble that we currently have when issuing ALU
instructions after one or more LSU instructions.

Most ALU instructions execute in the first stage, except for
count-zeroes and popcount instructions (which take two cycles and do
some of their work in the second stage) and mfspr/mtspr to "slow" SPRs
(TB, DEC, PVR, LOGA/LOGD, CFAR).  Multiply and divide/mod instructions
take several cycles but the instruction stays in the first stage (ex1)
and ex1.busy is asserted until the operation is complete.

There is currently a bypass from the first stage but not the second
stage.  Performance is down somewhat because of that and because this
doesn't yet eliminate the bubble between LSU and ALU instructions.

The forwarding of XER common bits has been changed somewhat because
now there is another pipeline stage between ex1 and the committed
state in cr_file.  The simplest thing for now is to record the last
value written and use that, unless there has been a flush, in which
case the committed state (obtained via e_in.xerc) is used.

Note that this fixes what was previously a benign bug in control.vhdl,
where it was possible for control to forget an instructions dependency
on a value from a previous instruction (a GPR or the CR) if this
instruction writes the value and the instruction gets to the point
where it could issue but is blocked by the busy signal from execute1.
In that situation, control may incorrectly not indicate that a bypass
should be used.  That didn't matter previously because, for ALU and
FPU instructions, there was only one previous instruction in flight
and once the current instruction could issue, the previous instruction
was completing and the correct value would be obtained from
register_file or cr_file.  For loadstore instructions there could be
two being executed, but because there are no bypass paths, failing to
indicate use of a bypass path is fine.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
3 years ago
Paul Mackerras 521a5403a9 execute1: Rename 'r' to 'ex1'
Maybe this will give us slightly better names in critical path reports
and the like.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
3 years ago
Paul Mackerras 813e2317bf execute1: Restructure to separate out execution of side effects
We now have a record that represents the actions taken in executing an
instruction, and a process that computes that for the incoming
instruction.  We no longer have 'current' or 'r.cur_instr', instead
things like the destination register are put into r.e in the first
cycle of an instruction and not reinitialized in subsequent busy
cycles.

For mfspr and mtspr, we now decode "slow" SPR numbers (those SPRs that
are not stored in the register file) to a new "spr_selector" record
in decode1 (excluding those in the loadstore unit).  With this, the
result for mfspr is determined in the data path.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
3 years ago
Paul Mackerras 204fedc63f Move XER low bits out of register file
Besides the overflow and status carry bits, XER has 18 bits which need
to retain the value written by mtxer (in case software wants to
emulate the move-assist instructions (lswi, lswx, stswi, stswx).
Until now these bits (and others) have been stored in the GPR file as
a "fast" SPR, but this causes complications because XER is not really
a fast SPR.

Instead, we now store these 18 bits in the 'ctrl' signal, which exists
in execute1.  This will enable us to simplify the data path in future,
and has the added bonus that with a little bit of plumbing, we can get
the full XER value printed when dumping registers at the end of a
simulation.

Therefore this changes scripts/run_test.sh to remove the greps which
exclude XER from the comparison of actual and expected register
results.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
3 years ago
Anton Blanchard 843361f2be execute1: sub_mux_sel and result_mux_sel are unused
Remove them.

Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
3 years ago
Anton Blanchard a750365ffa Remove some FPGA style signal inits
These don't work on the ASIC flow, so remove them and initialise
them explicitly where required.

Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
3 years ago
Paul Mackerras 2491aa7fc5 core: Make popcnt* take two cycles
This moves the calculation of the result for popcnt* into the
countbits unit, renamed from countzero, so that we can take two cycles
to get the result.  The motivation for this is that the popcnt*
calculation was showing up as a critical path.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
3 years ago
Paul Mackerras 734e4c4a52 core: Add a short multiplier
This adds an optional 16 bit x 16 bit signed multiplier and uses it
for multiply instructions that return the low 64 bits of the product
(mull[dw][o] and mulli, but not maddld) when the operands are both in
the range -2^15 .. 2^15 - 1.   The "short" 16-bit multiplier produces
its result combinatorially, so a multiply that uses it executes in one
cycle.  This improves the coremark result by about 4%, since coremark
does quite a lot of multiplies and they almost all have operands that
fit into 16 bits.

The presence of the short multiplier is controlled by a generic at the
execute1, SOC, core and top levels.  For now, it defaults to off for
all platforms, and can be enabled using the --has_short_mult flag to
fusesoc.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
3 years ago
Paul Mackerras a68921edca core: Fix mcrxrx, addpcis and bpermd
- mcrxrx put the bits in the wrong order

- addpcis was setting CR0 if the instruction bit 0 = 1, which it
  shouldn't

- bpermd was producing 0 always and additionally had the wrong bit
  numbering

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
3 years ago
Paul Mackerras 65c43b488b PMU: Add several more events
This implements most of the architected PMU events.  The ones missing
are mostly the ones that depend on which level of the cache hierarchy
data is fetched from.  The events implemented here, and their raw
event codes, are:

    Floating-point operation completed (100f4)
    Load completed (100fc)
    Store completed (200f0)
    Icache miss (200fc)
    ITLB miss (100f6)
    ITLB miss resolved (400fc)
    Dcache load miss (400f0)
    Dcache load miss resolved (300f8)
    Dcache store miss (300f0)
    DTLB miss (300fc)
    DTLB miss resolved (200f6)
    No instruction available and none being executed (100f8)
    Instruction dispatched (200f2, 300f2, 400f2)
    Taken branch instruction completed (200fa)
    Branch mispredicted (400f6)
    External interrupt taken (200f8)

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
3 years ago