This aims to simplify the logic between the instruction image and
the register file read address ports and reduce the size of the decode
tables. With this patch, the input_reg_a column of the decode tables
can only select RA or zeroes, the input_reg_b column can only select
RB or a constant (0, -1, or an immediate value from the instruction),
and the input_reg_c columns can only select RS or zeroes.
That means that the rotate/shift/logical ops now have their first
input coming in via the input_reg_c column. That means we need to
add a read_data3 field to the Decode2ToExecuteType record, but that
will go away again when we split out the rotate/mask/logical ops to
their own unit.
As a related but not tightly connected change, this patch also sets
the read1_enable signal to the register file be 0 when RA=0 and the
input_reg_a for the instruction is RA_OR_ZERO (previously it was 1).
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
All of the PPC add and subtract instructions, including carrying
and extended versions, do much the same arithmetic operation:
result = (I xor A) + B + C
where A is the value from RA, I provides a logical inversion of A
(i.e. I is 0 or -1), B is either from RB or is a constant 0 or -1,
and C is 0, 1 or the carry bit from XER (CA).
To consolidate all the add/subtract instructions into a single
OP_ADD, we add a column to decode_rom_t to indicate when A should
be inverted, and change the input_carry field to a 3-state selector
to select C in the equation above.
This also adds a new "CONST_M1" value for input_reg_b_t to indicate
that B is a constant -1. This allows us to implement addme and
subfme.
The addex instruction appears not to exist, so the comments referring
to it are removed.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
The register file is currently implemented as a whole pile of individual
1-bit registers instead of LUT memory which is a huge waste of FPGA
space.
This is caused by the output signal exposing the register file to the
outside world for simulation debug.
This removes that output, and moves the dumping of the register file
to the register file module itself. This saves about 8% of fpga on
the little Arty A7-35T.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The const* fields of decode_rom_t drove multiplexers in decode2 that
picked out various instruction fields and put them into the const*
fields of the Decode2ToExecute1Type record, from where they were
used in execute1. However, the code in execute1 can just as easily
use the appropriate fields of the original instruction word, since
that is now available in execute1. This therefore changes the
code to do that, resulting in smaller decode tables.
Suggested-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Instead of doing mfctr, mflr, mftb, mtctr, mtlr as separate ops,
just pass down mfspr and mtspr ops with the spr number and let
execute1 decode which SPR we're addressing. This will help reduce
the number of instruction bits decode1 needs to look at.
In fact we now pass down the whole instruction from decode2 to
execute1. We will need more bits of the instruction in future,
and the tools should just optimize away any that we don't end
up using. Since the 'aa' bit was just a copy of an instruction
bit, we can now remove it from the record.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This moves the negation of negative operands for signed divide and
modulus operations out of the decode2 stage and into the divider.
If either of the operands for a signed divide or modulus operation
is negative, the divider now takes an extra cycle to negate the
operands that are negative.
The interface to the divider now has an 'is_signed' signal rather
than a 'neg_result' signal, and the dividend and divisor can be
negative, so divider_tb had to be updated for the new interface.
The reason for doing this is that one of the worst timing violations
on the Arty A7-100 at 100MHz involved the carry chain in the adders
that did the negation of the dividend and divisor in the decode stage.
Moving the negations to a separate cycle fixes that and also seems to
reduce the total number of slice LUTs used.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This adds a divider unit, connected to the core in much the same way
that the multiplier unit is connected. The division algorithm is
very simple-minded, taking 64 clock cycles for any division (even
32-bit division instructions).
The decoding is simplified by making use of regularities in the
instruction encoding for div* and mod* instructions. Instead of
having PPC_* encodings from the first-stage decoder for each of the
different div* and mod* instructions, we now just have PPC_DIV and
PPC_MOD, and the inputs to the divider that indicate what sort of
division operation to do are derived from instruction word bits.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This module adds some simple core controls:
reset, stop, start, step
along with icache clear and reading the NIA and core
status bits
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org
We only need two write ports for load with update instructions.
Having two write ports just for this instruction is expensive.
For now we will force them to be the only instruction in the
pipeline, and take two cycles of writeback.
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
It simulated fine, but didn't synthesize. Fix some obvious issues
to get us going again.
Fixes: 9fbaea6f08 ("Rework CR file and add forwarding")
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
Right now we continually print all 3 possible GPRs an instruction
may be using. Add signals so we only print GPRs when they are
actually read. This should hopefully optimise away when synthesized.
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
Handle the CR as a single field with per nibble enables. Forward any
writes in the same cycle.
If this proves to be an issue for timing, we may want to revisit
this in the future. For now, it keeps things simple.
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
The decode2 stage was spaghetti code and needed cleaning up.
Create a series of functions to pull fields from a ppc instruction
and also a series of helpers to extract values for the execution
units.
As suggested by Paul, we should pass all signals to the execution
units and only set the valid signal conditionally, which should
use less resources.
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>