You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

#### 111 lines 5.6 KiB Raw Blame History Unescape Escape

 `` `` `
` ` Crossing lanes` ` ` ` We have seen that, most of the time, vector SIMD units prefer to keep ` ` computations in the same “lane” (element number) as the input elements. The ` ` only exception in the examples so far are the occasional splat (copy one ` ` element to all the other elements of the vector) operations. Splat is an ` ` example of the general category of “permute” operations (Intel would call ` ` this a “shuffle” or “blend”). Permutes selects and rearrange the ` ` elements of (usually) a concatenated pair of vectors and delivers those ` ` selected elements, in a specific order, to a result vector. The selection and ` ` order of elements in the result is controlled by a third vector, either as 3rd ` ` input vector or and immediate field of the instruction.` ` For example the Intel intrisics for ` ` Horizontal Add / Subtract ` ` added with SSE3. These instrinsics add (subtract) adjacent element pairs, across pair of ` ` input vectors, placing the sum of the adjacent elements in the result vector. ` ` For example ` ` _mm_hadd_ps  ` ` which implments the operation on float:` ` ` ` Horizontal Add (hadd) provides an incremental vector “sum across” ` ` operation commonly needed in matrix and vector transform math. Horizontal Add ` ` is incremental as you need three hadd instructions to sum across 4 vectors of 4 ` ` elements ( 7 for 8 x 8, 15 for 16 x 16, …).` ` ` ` The PowerISA does not have a sum-across operation for float or ` ` double. We can user the vector float add instruction after we rearrange the ` ` inputs so that element pairs line up for the horizontal add. For example we ` ` would need to permute the input vectors {1, 2, 3, 4} and {101, 102, 103, 104} ` ` into vectors {2, 4, 102, 104} and {1, 3, 101, 103} before ` ` the  vec_add. This ` ` requires two vector permutes to align the elements into the correct lanes for ` ` the vector add (to implement Horizontal Add).  ` ` The PowerISA provides generalized byte-level vector permute (vperm) ` ` based a vector register pair source as input and a control vector. The control ` ` vector provides 16 indexes (0-31) to select bytes from the concatenated input ` ` vector register pair (VRA, VRB). A more specific set of permutes (pack, unpack, ` ` merge, splat) operations (across element sizes) are encoded as separate ` `  instruction opcodes or instruction immediate fields.` ` Unfortunately only the general vec_perm ` ` can provide the realignment ` ` we need the _mm_hadd_ps operation or any of the int, short variants of hadd. ` ` For example:` ` ` ` This requires two permute control vectors; one to select the even ` ` word elements across __X and __Y, ` ` and another to select the odd word elements ` ` across __X and __Y. ` ` The result of these permutes (vec_perm) are inputs to the ` ` vec_add and completes the add operation. ` ` Fortunately the permute required for the double (64-bit) case (IE ` ` _mm_hadd_pd) reduces to the equivalent of vec_mergeh / ` ` vec_mergel  doubleword ` ` (which are variants of  VSX Permute Doubleword Immediate). So the ` ` implementation of _mm_hadd_pd can be simplified to this:` ` ` ` This eliminates the load of the control vectors required by the ` ` previous example.` `
` ``` ``` ``` ```