This implements the cfuged, pdepd and pextd instructions in a new unit
called bit_sorter (so called because cfuged and pextd can be viewed as
sorting the bits of the mask).
The cnt* instructions and the popcnt* instructions now use the same
OP_COUNTB insn_type so as to free up an insn_type value to use for the
new instructions.
The new instructions are implemented using a slow and simple algorithm
that takes 64 cycles to compute the result. The ex1 stage is stalled
while this happens, as for a 64-bit multiply, or for a divide when
there is no FPU.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>