How vector elements change size and type
Most vector built ins return the same vector type as the (first)
input parameters, but there are exceptions. Examples include conversions
between types, compares, pack, unpack, merge, and integer multiply
operations.
Converting floats to / from integer types will change the type and sometimes
change the element size as well (double ↔ int and float ↔ long). For
VMX the conversions are always the same size (float ↔ [unsigned] int). But
VSX allows conversion of 64-bit (long or double) to from 32-bit (float or
int) with the inherent size changes. The PowerISA VSX defines a 4-element
vector layout where little endian elements 0, 2 are used for input/output and
elements 1,3 are undefined. The OpenPOWER ABI Appendix A defines
vec_double and vec_float
with even/odd and high/low extensions as program aids. These are not
included in GCC 7 or earlier but are planned for GCC 8.
Compare operations produce either
vector bool <input element type>
(effectively bit masks) or predicates (the condition code for all and
any are represented as an int truth variable). When a predicate compare (i.e.
vec_all_eq, vec_any_gt)
is used in an if statement, the condition code is
used directly in the conditional branch and the int truth value is not
generated.
Pack operations pack integer elements into the next smaller (half)
integer sized elements. Pack operations include signed and unsigned saturate
and unsigned modulo forms. As the packed result will be half the size (in
bits), pack instructions require 2 vectors (256-bits) as input and generate a
single 128-bit vector result.
Unpack operations expand integer elements into the next larger size
elements. The integers are always treated as signed values and sign-extended.
The processor design avoids instructions that return multiple register values.
So the PowerISA defines unpack-high and unpack low forms where instruction
takes (the high or low) half of vector elements and extends them to fill the
vector output. Element order is maintained and an unpack high / low sequence
with the same input vector has the effect of unpacking to a 256-bit result in two
vector registers.
Merge operations resemble shuffling two (vectors) card decks
together, alternating (elements) cards in the result. As we are merging from
2 vectors (256-bits) into 1 vector (128-bits) and the elements do not change
size, we have merge high and merge low instruction forms for each (byte,
halfword and word) integer type. The merge high operations alternate elements
from the (vector register left) high half of the two input vectors. The merge
low operation alternate elements from the (vector register right) low half of
the two input vectors.
For PowerISA 2.07 we added vector merge word even / odd instructions.
Instead of high or low elements the shuffle is from the even or odd number
elements of the two input vectors. Passing the same vector to both inputs to
merge produces splat-like results for each doubleword half, which is handy in
some convert operations.
Integer multiply has the potential to generate twice as many bits in
the product as input. A multiply of 2 int (32-bit) values produces a long
(64-bits). Normal C language * operations ignore this and discard the top
32-bits of the result. However in some computations it useful to preserve the
double product precision for intermediate computation before reducing the final
result back to the original precision.
The PowerISA VMX instruction set took the later approach, i.e., keep all
the product bits until the programmer explicitly asks for the truncated result
(via the pack operation).
So the vector integer multiple are split into even/odd forms across signed and
unsigned byte, halfword and word inputs. This requires two instructions (given
the same inputs) to generate the full vector multiply across 2 vector
registers and 256-bits. Again as POWER processors are super-scalar this pair of
instructions should execute in parallel.
The set of expanded product values can either be used directly in
further (doubled precision) computation or merged/packed into the single single
vector at the smaller bit size. This is what the compiler will generate for C
vector extension multiply of vector integer types.