Commit Graph

89 Commits (a9178ed0c151e6096e378777f31739556e0eeb9b)

Author SHA1 Message Date
Benjamin Herrenschmidt f86fb74bfe irq: Simplify xics->core irq input
Use a simple wire. common.vhdl types are better kept for things
local to the core. We can add more wires later if we need to for
HV irqs etc...

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
5 years ago
Jonathan Balkind cc532dd065 Changes for compilation with VCS:
- Changing use of others in core files to satisfy VCS
- Adding workaround for VCS subtype constraint inconsistencies in common.vhdl

Signed-off-by: Jonathan Balkind <jbalkind@princeton.edu>
5 years ago
Paul Mackerras a658766fcf Implement slbia as a dTLB/iTLB flush
Slbia (with IH=7) is used in the Linux kernel to flush the ERATs
(our iTLB/dTLB), so make it do that.

This moves the logic to work out whether to flush a single entry
or the whole TLB from dcache and icache into mmu.  We now invalidate
all dTLB and iTLB entries when the AP (actual pagesize) field of
RB is non-zero on a tlbie[l], as well as when IS is non-zero.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras 01046527ba MMU: Do radix page table walks on iTLB misses
This hooks up the connections so that an OP_FETCH_FAILED coming down
to loadstore1 will get sent to the MMU for it to do a radix tree walk
for the instruction address.  The MMU then sends the resulting PTE to
the icache module to be installed in the iTLB.  If no valid PTE can
be found, the MMU sends an error signal back to loadstore1 which sends
it on to execute1 to generate an ISI.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras 3d4712ad43 Add TLB to icache
This adds a direct-mapped TLB to the icache, with 64 entries by default.
Execute1 now sends a "virt_mode" signal from MSR[IR] to fetch1 along
with redirects to indicate whether instruction addresses should be
translated through the TLB, and fetch1 sends that on to icache.
Similarly a "priv_mode" signal is sent to indicate the privilege
mode for instruction fetches.  This means that changes to MSR[IR]
or MSR[PR] don't take effect until the next redirect, meaning an
isync, rfid, branch, etc.

The icache uses a hash of the effective address (i.e. next instruction
address) to index the TLB.  The hash is an XOR of three fields of the
address; with a 64-entry TLB, the fields are bits 12--17, 18--23 and
24--29 of the address.  TLB invalidations simply invalidate the
indexed TLB entry without checking the contents.

If the icache detects a TLB miss with virt_mode=1, it will send a
fetch_failed indication through fetch2 to decode1, which will turn it
into a special OP_FETCH_FAILED opcode with unit=LDST.  That will get
sent down to loadstore1 which will currently just raise a Instruction
Storage Interrupt (0x400) exception.

One bit in the PTE obtained from the TLB is used to check whether an
instruction access is allowed -- the privilege bit (bit 3).  If bit 3
is 1 and priv_mode=0, then a fetch_failed indication is sent down to
fetch2 and to decode1, which generates an OP_FETCH_FAILED.  Any PTEs
with PTE bit 0 (EAA[3]) clear or bit 8 (R) clear should not be put
into the iTLB since such PTEs would not allow execution by any
context.

Tlbie operations get sent from mmu to icache over a new connection.

Unfortunately the privileged instruction tests are broken for now.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras f6a0d7f9da MMU: Implement data segment interrupts
A data segment interrupt (DSegI) occurs when an address to be
translated by the MMU is outside the range of the radix tree
or the top two bits of the address (the quadrant) are 01 or 10.
This is detected in a new state of the MMU state machine, and
is sent back to loadstore1 as an error, which sends it on to
execute1 to generate an interrupt to the 0x380 vector.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras d47fbf88d1 Implement access permission checks
This adds logic to the dcache to check the permissions encoded in
the PTE that it gets from the dTLB.  The bits that are checked are:

R must be 1
C must be 1 for a store
EAA(0) - if this is 1, MSR[PR] must be 0
EAA(2) must be 1 for a store
EAA(1) | EAA(2) must be 1 for a load

In addition, ATT(0) is used to indicate a cache-inhibited access.

This now implements DSISR bits 36, 38 and 45.

(Bit numbers above correspond to the ISA, i.e. using big-endian
numbering.)

MSR[PR] is now conveyed to loadstore1 for use in permission checking.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras 42d0fcc511 Implement data storage interrupts
This adds a path from loadstore1 back to execute1 for reporting
errors, and machinery in execute1 for generating data storage
interrupts at vector 0x300.

If dcache is given two requests in successive cycles and the
first encounters an error (e.g. a TLB miss), it will now cancel
the second request.

Loadstore1 now responds to errors reported by dcache by sending
an exception signal to execute1 and returning to the idle state.
Execute1 then writes SRR0 and SRR1 and jumps to the 0x300 Data
Storage Interrupt vector.  DAR and DSISR are held in loadstore1.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras 750b3a8e28 dcache: Implement data TLB
This adds a TLB to dcache, providing the ability to translate
addresses for loads and stores.  No protection mechanism has been
implemented yet.  The MSR_DR bit controls whether addresses are
translated through the TLB.

The TLB is a fixed-pagesize, set-associative cache.  Currently
the page size is 4kB and the TLB is 2-way set associative with 64
entries per set.

This implements the tlbie instruction.  RB bits 10 and 11 control
whether the whole TLB is invalidated (if either bit is 1) or just
a single entry corresponding to the effective page number in bits
12-63 of RB.

As an extension until we get a hardware page table walk, a tlbie
instruction with RB bits 9-11 set to 001 will load an entry into
the TLB.  The TLB entry value is in RS in the format of a radix PTE.

Currently there is no proper handling of TLB misses.  The load or
store will not be performed but no interrupt is generated.

In order to make timing at 100MHz on the Arty A7-100, we compare
the real address from each way of the TLB with the tag from each way
of the cache in parallel (requiring # TLB ways * # cache ways
comparators).  Then the result is selected based on which way hit in
the TLB.  That avoids a timing path going through the TLB EA
comparators, the multiplexer that selects the RA, and the cache tag
comparators.

The hack where addresses of the form 0xc------- are marked as
cache-inhibited is kept for now but restricted to real-mode accesses.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras 635e316f9b Pass mtspr/mfspr to MMU-related SPRs down to loadstore1
This arranges for some mfspr and mtspr to get sent to loadstore1
instead of being handled in execute1.  In particular, DAR and DSISR
are handled this way.  They are therefore "slow" SPRs.

While we're at it, fix the spelling of HEIR and remove mention of
DAR and DSISR from the comments in execute1.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras dd2e71930c debug: Provide a way to examine GPRs, fast SPRs and MSR
This provides commands on the debug interface to read the value of
the MSR or any of the 64 GSPR register file entries.  The GSPR values
are read using the B port of the register file in a cycle when
decode2 is not using it.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras 5d282a950c Improve architectural compliance of mfspr and mtspr
Mfspr from an unimplemented SPR should be a no-op in privileged state,
so in this case we need to write back whatever was previously in the
destination register.  For problem state, both mtspr and mfspr to
unimplemented SPRs should cause a program interrupt.

There are special cases in the architecture for SPRs 0, 4 5 and 6
which we still don't implement.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras 8a0a907e2f Implement the extswsli instruction
This mainly required the addition of an entry to the opcode 31 decode
table and a 32-bit sign-extender in the rotator.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras 102fbcfe9a execute1: Fix interrupt delivery during slow instructions
During slow instructions such as multiply or divide, if a decrementer
(or other asynchronous) interrupt becomes pending, it disrupts the
logic that keeps stall asserted until the end of the slow
instruction, and the interrupt logic starts trying to deliver the
interrupt before the slow instruction has finished.

To fix that, make the interrupt logic wait until it sees e_in.valid
set before setting exception to 1.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras 102b304db7 Merge remote-tracking branch 'remotes/origin/master' 5 years ago
Paul Mackerras 167e37d667 Plumb insn_type through to loadstore1
In preparation for adding a TLB to the dcache, this plumbs the
insn_type from execute1 through to loadstore1, so that we can have
other operations besides loads and stores (e.g. tlbie) going to
loadstore1 and thence to the dcache.  This also plumbs the unit field
of the decode ROM from decode2 through to execute1 to simplify the
logic around which ops need to go to loadstore1.

The load and store data formatting are now not conditional on the
op being OP_LOAD or OP_STORE.  This eliminates the inferred latches
clocked by each of the bits of r.op that we were getting previously.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras 74db071067 execute1: Generate privileged instruction interrupts when MSR[PR] = 1
This adds logic to execute1 to check, when MSR[PR] = 1, whether each
instruction arriving to be executed is a privileged instruction.
If it is, a privileged-instruction type program interrupt is generated.
For the mtspr and mfspr instructions, we need to look at bit 20 of the
instruction (bit 4 of the SPR number) to determine if the SPR is
privileged.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras b55c9cc298 execute1: Improve architecture compliance of MSR and related instructions
This makes our treatment of the MSR conform better with the ISA.

- On reset, initialize the MSR to have the SF and LE bits set and
  all the others reset.  For good measure initialize r properly too.

- Fix the bit numbering in msr_copy (the code was using big-endian
  bit numbers, not little-endian).

- Use constants like MSR_EE to index MSR bits instead of expressions
  like '63 - 48', for readability.

- Set MSR[SF, LE] and clear MSR[PR, IR, DR, RI] on interrupts.

- Copy the relevant fields for rfid instead of using msr_copy, because
  the partial function fields of the MSR should be left unchanged,
  not zeroed.  Our implementation of rfid is like the architecture
  description of hrfid, because we don't implement hypervisor mode.

- Return the whole MSR for mfmsr.

- Implement the L field for mtmsrd (L=1 copies just EE and RI).

- For mtmsrd with L=0, leave out the HV, ME and LE bits as per the arch.

- For mtmsrd and rfid, if PR ends up set, then also set EE, IR and DR
  as per the arch.

- A few other minor tidyups (no semantic change).

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Michael Neuling b4f20c20b9 XICS interrupt controller
New unified ICP and ICS XICS compliant interrupt controller.
Configurable number of hardware sources.

Fixed hardware source number based on hardware line taken. All
hardware interrupts are a fixed priority. Level interrupts supported
only.

Hardwired to 0xc0004000 in SOC (UART is kept at 0xc0002000).

Signed-off-by: Michael Neuling <mikey@neuling.org>
5 years ago
Paul Mackerras dc6b1df653 execute1: Don't execute ld/st instruction when taking interrupt
This fixes a bug in the logic where we would still send a load
or store instruction to loadstore1 even though we have decided
to take an asynchronous interrupt.

Reported-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras 56577827d4 Decode attn in the major opcode decode table
This decodes attn using entry 0 of the major_decode_rom_array table
instead of a special case in the decode1_1 process.  This means that
only the major opcode (the top 6 bits) is checked at decode time.
To make sure the instruction is attn not some random illegal pattern,
we now check bits 1-10 of the instruction at execute time and
generate an illegal instruction interrupt if those bits are not
0100000000.

This reduces LUT consumption by 42 LUTs on the Arty A7-100.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras 6f7ef8b1b9 Decode sc in the major opcode decode table
This decodes sc using entry 17 of the major_decode_rom_array table
instead of a special case in the decode1_1 process.  This means that
only the major opcode (the top 6 bits) is checked at decode time.
To make sure that the instruction is sc not scv, we now check bit
1 of the instruction at execute time and generate an illegal
instruction interrupt if it is 0 (indicating scv).  The level field
of the sc instruction is now ignored.

This reduces LUT consumption by 31 LUTs on the Arty A7-100.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras 278ac5e0eb Remove sim_config instruction
It's not used any more, and it's not in the ISA.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras f5f17c24fd execute1: Implement trap instructions properly
This implements the trap instructions (tw, twi, td, tdi) using
much of the same code as is used for the cmp/cmpl instructions.
A 5-bit comparison value is generated, and for cmp/cmpl, the
appropriate 3 bits are used to update the destination CR, and for
trap instructions, the comparison value is ANDed with the TO
field, and an exception is generated if any bit of the result
is 1.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras 381149b2cc Consolidate trap variants under a single OP_TRAP
This replaces OP_TD, OP_TDI, OP_TW and OP_TWI with a single OP_TRAP,
distinguishing the cases by the input_reg_b and is_32bit fields of
the decode ROM.  This adds the twi and td cases to the decode tables.

For now we make all of the trap instructions unconditionally generate
a trap-type program interrupt if the TO field of the instruction is
all ones, and do nothing otherwise.

This reduces the number of values in insn_type_t from 65 to 62,
meaning that an insn_type_t can now be encoded in 6 bits rather
than 7.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras d77033aa92 execute1: Simplify the interrupt logic a little
This makes some simplifications to the interrupt logic which will
help with later commits.

- When irq_valid is set, don't set exception to 1 until we have a
  valid instruction.  That means we can remove the if e_in.valid = '1'
  test from the exception = '1' block.

- Don't assert stall_out on the first cycle of delivering an
  interrupt.  If we do get another instruction in the next cycle,
  nothing will happen because we have ctrl.irq_state set and we
  will just continue writing the interrupt registers.

- Make sure we deliver as many completions as we got instructions,
  otherwise the outstanding instruction count in control.vhdl gets
  out of sync.

- In writeback, make sure all of the other write enables are ignored
  when e_in.exc_write_enable is set.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras fe077a116a Rename OP_MCRF to OP_CROP and trim insn_type_t
OP_MCRF covers the CR logical ops as well as mcrf since commit
c05441bf47 ("Implement CRNOR and friends"), so this renames
OP_MCRF to OP_CROP.  The OP_* values for the individual CR logical
ops (OP_CRAND, etc.) are not used, so remove them from insn_type_t.

No functional change.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras fb8f3da128 Give exceptions a separate path to writeback
This adds separate fields in Execute1ToWritebackType for use in
writing SRR0/1 (and in future other SPRs) on an interrupt.  With
this, we make timing once again on the Arty A7-100 -- previously
we were missing by 0.2ns, presumably due to the result mux being
wider than before.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Michael Neuling 5ef5604b65 Add sc, illegal and decrementer exceptions and some supervisor state
This adds the following exceptions:
 - 0x700 program check (for illegal instructions)
 - 0x900 decrementer
 - 0xc00 system call

This also adds some supervisor state:
 - decremeter
 - msr
(SPRG0/1 and SRR0/1 already exist as fast SPRs)

It also adds some supporting instructions:
 - rfid
 - mtmsrd
 - mfmsr
 - sc

MSR state is added but only EE is used in this patch set. Other bits
are read/written but are not used at all.

This adds a 2 stage state machine to execute1.vhdl. This state machine
allows fast SPRS SRR0/1 to be written in different cycles. This state
machine can be extended later to add DAR and DSISR SPR writing for
more complex exceptions like page faults.

Signed-off-by: Michael Neuling <mikey@neuling.org>
5 years ago
Michael Neuling 594a19de37 Plumb attn instruction through to execute1
Currently we decode attn but we just mark it as an illegal.

This adds a separate case statement in execute 1 for attn to terminate
the core. Illegals also do this currently but we are soon implementing
a 0x700 execption for them.

Signed-off-by: Michael Neuling <mikey@neuling.org>
5 years ago
Paul Mackerras 81369187c0 loadstore1: Add support for cache-inhibited load and store instructions
This adds support for lbzcix, lhzcix, lwzcix, ldcix, stbcix, sthcix,
stwcix and stdcix.  The temporary hack where accesses to addresses of
the form 0xc??????? are made non-cacheable is left in for now to avoid
making existing programs non-functional.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras b349cc891a loadstore1: Move logic from dcache to loadstore1
So that the dcache could in future be used by an MMU, this moves
logic to do with data formatting, rA updates for update-form
instructions, and handling of unaligned loads and stores out of
dcache and into loadstore1.  For now, dcache connects only to
loadstore1, and loadstore1 now has the connection to writeback.

Dcache generates a stall signal to loadstore1 which indicates that
the request presented in the current cycle was not accepted and
should be presented again.  However, loadstore1 doesn't currently
use it because we know that we can never hit the circumstances
where it might be set.

For unaligned transfers, loadstore1 generates two requests to
dcache back-to-back, and then waits to see two acks back from
dcache (cycles where d_in.valid is true).

Loadstore1 now has a FSM for tracking how many acks we are
expecting from dcache and for doing the rA update cycles when
necessary.  Handling for reservations and conditional stores is
still in dcache.

Loadstore1 now generates its own stall signal back to decode2,
so we no longer need the logic in execute1 that generated the stall
for the first two cycles.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras 5d85ede97d dcache: Implement load-reserve and store-conditional instructions
This involves plumbing the (existing) 'reserve' and 'rc' bits in
the decode tables down to dcache, and 'rc' and 'store_done' bits
from dcache to writeback.

It turns out that we had 'RC' set in the 'rc' column for several
ordinary stores and for the attn instruction.  This corrects them
to 'NONE', and sets the 'rc' column to 'ONE' for the conditional
stores.

In writeback we now have logic to set CR0 when the input from dcache
has rc = 1.

In dcache we have the reservation itself, which has a valid bit
and the address down to cache line granularity.  We don't currently
store the reservation length.  For a store conditional which fails,
we set a 'cancel_store' signal which inhibits the write to the
cache and prevents the state machine from starting a bus cycle or
going to the STORE_WAIT_ACK state.  Instead we set r1.stcx_fail
which causes the instruction to complete in the next cycle with
rc=1 and store_done=0.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras 1a244d3470 Remove single-issue constraint for most loads and stores
This removes the constraint that loads and stores are single-issue,
at the expense of a stall of at least 2 cycles for every load and
store.

To do this, we plumb the existing stall signal that was generated
in dcache to core, where it gets ORed with the stall signal from
execute1.  Execute1 generates a stall signal for the first two
cycles of each load and store, and dcache generates the stall
signal in the 3rd and subsequent cycles if it needs to.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras 441160d865 execute1: Use truth table embedded in instruction for CR logical ops
It turns out that CR logical instructions have the truth table of
the operation embedded in the instruction word.  This means that we
can collect the two input operand bits into a 2-bit value and use
that as the index to select the appropriate bit from the instruction
word.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras e08ca4ab8e countzero: Add a register to help make timing
This adds a register in the middle of the countzero computation,
so that we now have two cycles to count leading or trailing zeroes
instead of just one.  Execute1 now outputs a one-cycle stall signal
when it encounters a cntlz* or cnttz* instruction.  With this,
the countzero path no longer fails timing on the Artix-7 at 100MHz.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras 5422007f83 Plumb loadstore1 input from execute1 not decode2
This allows us to use the bypass at the input of execute1 for the
address and data operands for loadstore1.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras b14d982011 execute: Implement bypass from output of execute1 to input
This enables back-to-back execution of integer instructions where
the first instruction writes a GPR and the second reads the same
GPR.  This is done with a set of multiplexers at the start of
execute1 which enable any of the three input operands to be taken
from the output of execute1 (i.e. r.e.write_data) rather than the
input from decode2 (i.e. e_in.read_data[123]).

This also requires changes to the hazard detection and handling.
Decode2 generates a signal indicating that the GPR being written
is available for bypass, which is true for instructions that are
executed in execute1 (rather than loadstore1/dcache).  The
gpr_hazard module stores this "bypassable" bit, and if the same
GPR needs to be read by a subsequent instruction, it outputs a
"use_bypass" signal rather than generating a stall.  The
use_bypass signal is then latched at the output of decode2 and
passed down to execute1 to control the input multiplexer.

At the moment there is no bypass on the inputs to loadstore1, but that
is OK because all load and store instructions are marked as
single-issue.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras 0c714f1be6 execute: Move popcnt and prty instructions into the logical unit
This implements logic in the logical entity to calculate the results
of the popcnt* and prty* instructions.  We now have one insn_type_t
value for the 3 popcnt variants and one for the two prty variants,
using the length field of the decode_rom_t to distinguish between
them.  The implementations in logical.vhdl using recursive
algorithms rather than the simple functions in ppc_fx_insns.vhdl.

This gives a saving of about 140 slice LUTs on the A7-100 and
improves timing slightly.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras d2ca625b3b execute: Do comparisons using the main adder
This handles OP_CMP like a subtraction; the main adder computes
~RA + RB + 1, and the condition codes are computed from the results.
A direct comparison of the two input operands is used to calculate the
EQ bit of the condition result.  The LT and GT bits are computed from
the MSB of the subtraction result, the carry out from the subtraction,
and the MSBs of the operands.  For a 32-bit comparison, the 32-bit
carry and bit 31 of the result and input operands are used; for a
64-bit comparison, the 64-bit carry and bit 63 of the operands and
result are used.

It turns out to be more convenient to use the 'signed' field of
the decode table to distinguish signed from unsigned comparisons,
rather than the insn_type.  Therefore this uses OP_CMP for both
cmp and cmpl, which also has the benefit of reducing the number
of values in insn_type_t.

Doing this saves over 200 slice LUTs on the Arty A7-100 and improves
timing slightly as well.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras d956846667 execute1: Move EXTS* instruction back into execute1
This moves the sign extension done by the extsb, extsh and extsw
instructions back into execute1.  This means that we no longer need
any data formatting in writeback for results coming from execute1,
so this modifies writeback so the data formatter inputs come
directly from the loadstore unit output.  The condition code
updates for RC=1 form instructions are now done on the value from
execute1 rather than the output of the data formatter, which should
help timing.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras c9a2076dd3 execute1: Remember dest GPR, RC, OE, XER for slow operations
For multiply and divide operations, execute1 now records the
destination GPR number, RC and OE from the instruction, and the
XER value.  This means that the multiply and divide units don't
need to record those values and then send them back to execute1.
This makes the interface to those units a bit simpler.  They
simply report an overflow signal along with the result value, and
execute1 takes care of updating XER if necessary.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras 39d18d2738 Make divider hang off the side of execute1
With this, the divider is a unit that execute1 sends operands to and
which sends its results back to execute1, which then send them to
writeback.  Execute1 now sends a stall signal when it gets a divide
or modulus instruction until it gets a valid signal back from the
divider.  Divide and modulus instructions are no longer marked as
single-issue.

The data formatting step that used to be done in decode2 for div
and mod instructions is now done in execute1.  We also do the
absolute value operation in that same cycle instead of taking an
extra cycle inside the divider for signed operations with a
negative operand.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras 2167186b5f Make multiplier hang off the side of execute1
With this, the multiplier isn't a separate pipe that decode2 issues
instructions to, but rather is a unit that execute1 sends operands
to and which sends the result back to execute1, which then sends it
to writeback.  Execute1 now sends a stall signal when it gets a
multiply instruction until it gets a valid signal back from the
multiplier.

This all means that we no longer need to mark the multiply
instructions as single-issue.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Anton Blanchard ad3db18dce Fix a ghdysynth inferred latch error in execute
It should never happen in practise, but ghdlsynth is complaining about
an inferred latch here. Fix it

Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
5 years ago
Anton Blanchard cc8a9e7893 Upper 32 bits of XER should read as 0s
From the architecture:

  bits 0:31 and 35:43 are treated as reserved and return 0s when read
  using mfxer

Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
5 years ago
Tom Vijlbrief c05441bf47 Implement CRNOR and friends
Signed-off-by: Tom Vijlbrief <tvijlbrief@gmail.com>
5 years ago
Benjamin Herrenschmidt e4f475e17f sprs: Store common SPRs in register file
This stores the most common SPRs in the register file.

This includes CTR and LR and a not yet final list of others.

The register file is set to 64 entries for now. Specific types
are defined that can represent a GPR index (gpr_index_t) or
a GPR/SPR index (gspr_index_t) along with conversion functions
between the two.

On order to deal with some forms of branch updating both LR and
CTR, we introduced a delayed update of LR after a branch link.

Note: We currently stall the pipeline on such a delayed branch,
but we could avoid stalling fetch in that specific case as we
know we have a branch delay. We could also limit that to the
specific case where we need to update both CTR and LR.

This allows us to make bcreg, mtspr and mfspr pipelined. decode1
will automatically force the single issue flag on mfspr/mtspr to
a "slow" SPR.

[paulus@ozlabs.org - fix direction of decode2.stall_in]

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Paul Mackerras ec9b27660f execute: Copy XER[SO] to CR for cmp[i] and cmpl[i] instructions
We were copying in XER[SO] for the dot-form instructions but not the
explicit compare instructions.  Fix this.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago
Benjamin Herrenschmidt 501b6daf9b Add basic XER support
The carry is currently internal to execute1. We don't handle any of
the other XER fields.

This creates type called "xer_common_t" that contains the commonly
used XER bits (CA, CA32, SO, OV, OV32).

The value is stored in the CR file (though it could be a separate
module). The rest of the bits will be implemented as a separate
SPR and the two parts reconciled in mfspr/mtspr in latter commits.

We always read XER in decode2 (there is little point not to)
and send it down all pipeline branches as it will be needed in
writeback for all type of instructions when CR0:SO needs to be
updated (such forms exist for all pipeline branches even if we don't
yet implement them).

To avoid having to track XER hazards, we forward it back in EX1. This
assumes that other pipeline branches that can modify it (mult and div)
are running single issue for now.

One additional hazard to beware of is an XER:SO modifying instruction
in EX1 followed immediately by a store conditional. Due to our writeback
latency, the store will go down the LSU with the previous XER value,
thus the stcx. will set CR0:SO using an obsolete SO value.

I doubt there exist any code relying on this behaviour being correct
but we should account for it regardless, possibly by ensuring that
stcx. remain single issue initially, or later by adding some minimal
tracking or moving the LSU into the same pipeline as execute.

Missing some obscure XER affecting instructions like addex or mcrxrx.

[paulus@ozlabs.org - fix CA32 and OV32 for OP_ADD, fix order of
 arguments to set_ov]

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
5 years ago