You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

#### 114 lines 5.7 KiB Raw Permalink Blame History Unescape Escape

 `` `` `
` ` Crossing lanes` ` ` ` Vector SIMD units prefer to keep ` ` computations in the same “lane” (element number) as the input elements. The ` ` only exception in the examples so far are the occasional vector splat (copy one ` ` element to all the other elements of the vector) operations. Splat is an ` ` example of the general category of “permute” operations (Intel would call ` ` this a “shuffle” or “blend”). ` ` Permutes select and rearrange the ` ` elements of an input vector (or a concatenated pair of vectors) and deliver those ` ` selected elements, in a specific order, to a result vector. The selection and ` ` order of elements in the result is controlled by a third operand, either as a 3rd ` ` input vector or as an immediate field of the instruction.` ` For example, consider the Intel intrisics for ` ` Horizontal Add / Subtract ` ` added with SSE3. These instrinsics add (subtract) adjacent element pairs across a pair of ` ` input vectors, placing the sum of the adjacent elements in the result vector. ` ` For example ` ` _mm_hadd_ps  ` ` which implements the operation on float:` ` ` ` Horizontal Add (hadd) provides an incremental vector “sum across” ` ` operation commonly needed in matrix and vector transform math. Horizontal Add ` ` is incremental as you need three hadd instructions to sum across 4 vectors of 4 ` ` elements ( 7 for 8 x 8, 15 for 16 x 16, …).` ` ` ` The PowerISA does not have a sum-across operation for float or ` ` double. We can user the vector float add instruction after we rearrange the ` ` inputs so that element pairs line up for the horizontal add. For example we ` ` would need to permute the input vectors {1, 2, 3, 4} and {101, 102, 103, 104} ` ` into vectors {2, 4, 102, 104} and {1, 3, 101, 103} before ` ` the  vec_add. This ` ` requires two vector permutes to align the elements into the correct lanes for ` ` the vector add (to implement Horizontal Add).  ` ` The PowerISA provides generalized byte-level vector permute (vperm) ` ` based on a vector register pair (32 bytes) source as input and a (16-byte) control vector. ` ` The control ` ` vector provides 16 indexes (0-31) to select bytes from the concatenated input ` ` vector register pair (VRA, VRB). There are also predefined permutes (splat, pack, unpack, ` ` merge) operations (across element sizes) that are encoded as separate ` `  instruction op-codes or instruction immediate fields.` ` Unfortunately only the general vec_perm ` ` can provide the realignment ` ` we need for the _mm_hadd_ps operation or any of the int, short variants of hadd. ` ` For example:` ` ` ` This requires two permute control vectors; one to select the even ` ` word elements across __X and __Y, ` ` and another to select the odd word elements ` ` across __X and __Y. ` ` The results of these permutes (vec_perm) are inputs to the ` ` vec_add that completes the horizontal add operation. ` ` Fortunately the permute required for the double (64-bit) case ` ` (_mm_hadd_pd) reduces to the equivalent of ` ` vec_mergeh / vec_mergel  doubleword ` ` (which are variants of  VSX Permute Doubleword Immediate). So the ` ` implementation of _mm_hadd_pd can be simplified to this:` ` ` ` This eliminates the load of the control vectors required by the ` ` previous example.` `
` ``` ``` ``` ```