Crossing lanesWe have seen that, most of the time, vector SIMD units prefer to keep
computations in the same “lane” (element number) as the input elements. The
only exception in the examples so far are the occasional splat (copy one
element to all the other elements of the vector) operations. Splat is an
example of the general category of “permute” operations (Intel would call
this a “shuffle” or “blend”). Permutes selects and rearrange the
elements of (usually) a concatenated pair of vectors and delivers those
selected elements, in a specific order, to a result vector. The selection and
order of elements in the result is controlled by a third vector, either as 3rd
input vector or and immediate field of the instruction.For example the Intel intrisics for
Horizontal Add / Subtract
added with SSE3. These instrinsics add (subtract) adjacent element pairs, across pair of
input vectors, placing the sum of the adjacent elements in the result vector.
For example
_mm_hadd_ps
which implments the operation on float:
Horizontal Add (hadd) provides an incremental vector “sum across”
operation commonly needed in matrix and vector transform math. Horizontal Add
is incremental as you need three hadd instructions to sum across 4 vectors of 4
elements ( 7 for 8 x 8, 15 for 16 x 16, …).The PowerISA does not have a sum-across operation for float or
double. We can user the vector float add instruction after we rearrange the
inputs so that element pairs line up for the horizontal add. For example we
would need to permute the input vectors {1, 2, 3, 4} and {101, 102, 103, 104}
into vectors {2, 4, 102, 104} and {1, 3, 101, 103} before
the vec_add. This
requires two vector permutes to align the elements into the correct lanes for
the vector add (to implement Horizontal Add). The PowerISA provides generalized byte-level vector permute (vperm)
based a vector register pair source as input and a control vector. The control
vector provides 16 indexes (0-31) to select bytes from the concatenated input
vector register pair (VRA, VRB). A more specific set of permutes (pack, unpack,
merge, splat) operations (across element sizes) are encoded as separate
instruction opcodes or instruction immediate fields.Unfortunately only the general vec_perm
can provide the realignment
we need the _mm_hadd_ps operation or any of the int, short variants of hadd.
For example:
This requires two permute control vectors; one to select the even
word elements across __X and __Y,
and another to select the odd word elements
across __X and __Y.
The result of these permutes (vec_perm) are inputs to the
vec_add and completes the add operation. Fortunately the permute required for the double (64-bit) case (IE
_mm_hadd_pd) reduces to the equivalent of vec_mergeh /
vec_mergel doubleword
(which are variants of VSX Permute Doubleword Immediate). So the
implementation of _mm_hadd_pd can be simplified to this:
This eliminates the load of the control vectors required by the
previous example.