Programming-Guides/Porting_Vector_Intrinsics/sec_powerisa_vector_size_ty...

<?xml version="1.0" encoding="UTF-8"?>
<!--
  Copyright (c) 2017 OpenPOWER Foundation

  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License.

-->
<section xmlns="http://docbook.org/ns/docbook"
  xmlns:xi="http://www.w3.org/2001/XInclude"
  xmlns:xlink="http://www.w3.org/1999/xlink"
  version="5.0"
  xml:id="sec_powerisa_vector_size_type">
  <title>How vector elements change size and type</title>

  <para>Most vector built ins return the same vector type as the (first)
  input parameters, but there are exceptions. Examples include conversions
  between types, compares, pack, unpack,  merge, and integer multiply
  operations.</para>

  <para>Converting floats to / from integer types will change the type and sometimes
  change the element size as well (double ↔ int and float ↔ long). For
  VMX the conversions are always the same size (float ↔ [unsigned] int). But
  VSX allows conversion of 64-bit (long or double) to from 32-bit (float or
   int)  with the inherent size changes. The PowerISA VSX defines a 4-element
  vector layout where little endian elements 0, 2 are used for input/output and
  elements 1,3 are undefined. The OpenPOWER ABI Appendix A defines
  <literal>vec_double</literal> and <literal>vec_float</literal>
  with even/odd and high/low extensions as program aids. These are not
  included in GCC 7 or earlier but are planned for GCC 8.</para>

  <para>Compare operations produce either
  <literal>vector bool &lt;</literal>input element type<literal>&gt;</literal>
  (effectively bit masks) or predicates (the condition code for all and
  any are represented as an int truth variable). When a predicate compare (i.e.
  <literal>vec_all_eq</literal>, <literal>vec_any_gt</literal>)
  is used in an if statement,  the condition code is
  used directly in the conditional branch and the int truth value is not
  generated.</para>

  <para>Pack operations pack integer elements into the next smaller (half)
  integer sized elements. Pack operations include signed and unsigned saturate
  and unsigned modulo forms. As the packed result will be half the size (in
  bits), pack instructions require 2 vectors (256-bits) as input and generate a
  single 128-bit vector result.
  <programlisting><![CDATA[vec_vpkudum ({1, 2}, {101, 102}) result={1, 2, 101, 102}]]></programlisting></para>

  <para>Unpack operations expand integer elements into the next larger size
  elements. The integers are always treated as signed values and sign-extended.
  The processor design avoids instructions that return multiple register values.
  So the PowerISA defines unpack-high and unpack low forms where instruction
  takes (the high or low) half of vector elements and extends them to fill the
  vector output. Element order is maintained and an unpack high / low sequence
  with the same input vector has the effect of unpacking to a 256-bit result in two
  vector registers.
  <programlisting><![CDATA[vec_vupkhsw ({1, 2, 3, 4}) result={1, 2}
vec_vupkhsw ({-1, 2, -3, 4}) result={-1, 2}
vec_vupklsw ({1, 2, 3, 4}) result={3, 4}
vec_vupklsw ({-1, 2, -3, 4}) result={-3, 4}]]></programlisting></para>

  <para>Merge operations resemble shuffling two (vectors) card decks
  together, alternating (elements) cards in the result.   As we are merging from
  2 vectors (256-bits) into 1 vector (128-bits) and the elements do not change
  size, we have merge high and merge low instruction forms for each (byte,
  halfword and word) integer type. The merge high operations alternate elements
  from the (vector register left) high half of the two input vectors. The merge
  low operation alternate elements from the (vector register right) low half of
  the two input vectors.</para>

  <para>For PowerISA 2.07 we added vector merge word even / odd instructions.
  Instead of high or low elements the shuffle is from the even or odd number
  elements of the two input vectors. Passing the same vector to both inputs to
  merge produces splat-like results for each doubleword half, which is handy in
  some convert operations.
  <programlisting><![CDATA[vec_mrghd ({1, 2}, {101, 102}) result={1, 101}
vec_mrgld ({1, 2}, {101, 102}) result={2, 102}

vec_vmrghw ({1, 2, 3, 4}, {101, 102, 103, 104}) result={1, 101, 2, 102}
vec_vmrghw ({1, 2, 3, 4}, {1, 2, 3, 4}) result={1, 1, 2, 2}
vec_vmrglw ({1, 2, 3, 4}, {101, 102, 103, 104}) result={3, 103, 4, 104}
vec_vmrglw ({1, 2, 3, 4}, {1, 2, 3, 4}) result={3, 3, 4, 4}


vec_mergee ({1, 2, 3, 4}, {101, 102, 103, 104}) result={1, 101, 3, 103}
vec_mergee ({1, 2, 3, 4}, {1, 2, 3, 4}) result={1, 1, 3, 3}
vec_mergeo ({1, 2, 3, 4}, {101, 102, 103, 104}) result={2, 102, 4, 104}
vec_mergeo ({1, 2, 3, 4}, {1, 2, 3, 4}) result={2, 2, 4, 4}]]></programlisting></para>

  <para>Integer multiply has the potential to generate twice as many bits in
  the product as input. A multiply of 2 int (32-bit) values produces a long
  (64-bits). Normal C language * operations ignore this and discard the top
  32-bits of the result. However  in some computations it useful to preserve the
  double product precision for intermediate computation before reducing the final
  result back to the original precision.</para>

  <para>The PowerISA VMX instruction set took the later approach, i.e., keep all
  the product bits until the programmer explicitly asks for the truncated result
  (via the pack operation).
  So the vector integer multiple are split into even/odd forms across signed and
  unsigned byte, halfword and word inputs. This requires two instructions (given
  the same inputs) to generate the full vector multiply across 2 vector
  registers and 256-bits. Again as POWER processors are super-scalar this pair of
  instructions should execute in parallel.</para>

  <para>The set of expanded product values can either be used directly in
  further (doubled precision) computation or merged/packed into the single single
  vector at the smaller bit size. This is what the compiler will generate for C
  vector extension multiply of vector integer types.</para>

</section>