Data being written to an SPR by mtspr now comes in to execute2 via
ex1.write_spr_data (renamed from ex1.ramspr_odd_data) rather than
ex1.e.write_data. This eliminates the need for the main result mux in
execute1 to be able to pass the c_in value through. For mfspr, the
no-op behaviour is obtained by selecting ex1.write_spr_data as
spr_result in execute2. We already had ex1.write_spr_data being set
from c_in, so no new logic is required there.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Also select the RS passthrough in the logical unit by default for
mfspr, which is needed for the no-op SPRs and the no-op behaviour
of privileged mfspr to unimplemented SPRs. For slow SPRs the RS
behaviour gets passed through from execute1 to execute2 and
replaced by the correct result in execute2's result mux.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This does bperm in the bitsort unit instead of the logical unit, and
no longer tries to do it in a single cycle with eight 64-to-1
multiplexers. Instead it is now a state machine in the bitsort unit,
takes 8 cycles, and only has one 64-to-1 multiplexer. This helps
improve timing and reduces LUT usage.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This implements the byte-reverse halfword, word and doubleword
instructions: brh, brw, and brd. These instructions were added to the
ISA in version 3.1. They use a new OP_BREV insn_type value. The
logic for these instructions is implemented in logical.vhdl.
In order to avoid going over 64 insn_type values, OP_AND and OP_OR
were combined into OP_LOGIC, which is like OP_AND except that the RS
input can be inverted as well as the RB input. The various forms of
OR instruction are then implemented using the identity
a OR b = NOT (NOT a AND NOT b)
The 'is_signed' field of the instruction decode table is used to
indicate that RS should be inverted.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
With this, the register file now contains 64 entries, for 32 GPRs and
32 FPRs, rather than the 128 it had previously. Several things get
simplified - decode1 no longer has to work out the ispr{1,2,o} values,
decode_input_reg_{a,b,c} no longer have the t = SPR case, etc.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This moves the calculation of the result for popcnt* into the
countbits unit, renamed from countzero, so that we can take two cycles
to get the result. The motivation for this is that the popcnt*
calculation was showing up as a critical path.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
- mcrxrx put the bits in the wrong order
- addpcis was setting CR0 if the instruction bit 0 = 1, which it
shouldn't
- bpermd was producing 0 always and additionally had the wrong bit
numbering
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This adds an explicit multiplexer feeding v.e.write_data in execute1,
with the select lines determined in the previous cycle based on the
insn_type. Similarly, for multiply and divide instructions, there is
now an explicit multiplexer.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
To avoid adding too much logic, this moves the adder used by OP_ADD
out of the case statement in execute1.vhdl so that the result can
be used by OP_ADDG6S as well.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
It's not needed for the other ops (popcnt, parity, etc.) and the
logical unit shows up as a critical path from time to time.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This reduces the number of different things that are assigned to
the result variable.
- The computations for the popcnt, prty, cmpb and exts instruction
families are moved into the logical unit.
- The result of mfspr from the slow SPRs is computed in 'spr_val'
before being assigned to 'result'.
- Writes to LR as a result of a blr or bclr instruction are done
through the exc_write path to writeback.
This eases timing considerably.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This implements logic in the logical entity to calculate the results
of the popcnt* and prty* instructions. We now have one insn_type_t
value for the 3 popcnt variants and one for the two prty variants,
using the length field of the decode_rom_t to distinguish between
them. The implementations in logical.vhdl using recursive
algorithms rather than the simple functions in ppc_fx_insns.vhdl.
This gives a saving of about 140 slice LUTs on the A7-100 and
improves timing slightly.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Consolidate and/andc/nand, or/orc/nor and xor/eqv, using a common
invert on the input and output. This saves us about 200 LUTs.
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>