This does bperm in the bitsort unit instead of the logical unit, and
no longer tries to do it in a single cycle with eight 64-to-1
multiplexers. Instead it is now a state machine in the bitsort unit,
takes 8 cycles, and only has one 64-to-1 multiplexer. This helps
improve timing and reduces LUT usage.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This implements the cfuged, pdepd and pextd instructions in a new unit
called bit_sorter (so called because cfuged and pextd can be viewed as
sorting the bits of the mask).
The cnt* instructions and the popcnt* instructions now use the same
OP_COUNTB insn_type so as to free up an insn_type value to use for the
new instructions.
The new instructions are implemented using a slow and simple algorithm
that takes 64 cycles to compute the result. The ex1 stage is stalled
while this happens, as for a 64-bit multiply, or for a divide when
there is no FPU.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>