Programming-Guides/Porting_Vector_Intrinsics/sec_gcc_vector_extensions.xml

<?xml version="1.0" encoding="UTF-8"?>
<!--
  Copyright (c) 2017 OpenPOWER Foundation

  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License.

-->
<section xmlns="http://docbook.org/ns/docbook"
  xmlns:xi="http://www.w3.org/2001/XInclude"
  xmlns:xlink="http://www.w3.org/1999/xlink"
  version="5.0"
  xml:id="sec_gcc_vector_extensions">
  <title>GCC Vector Extensions</title>

  <para>The GCC vector extensions are common syntax but implemented in a
  target specific way. Using the C vector extensions requires the
  <literal>__gnu_inline__</literal>
  attribute to avoid syntax errors in case the user specified  C standard
  compliance (<literal>-std=c90</literal>, <literal>-std=c11</literal>,
  etc) that would normally disallow such
  extensions. </para>

  <para>The GCC implementation for PowerPC64 Little Endian is (mostly)
  functionally compatible with x86_64 vector extension usage. We can use the same
  type definitions (at least for  vector_size (16)), operations, syntax
  <literal>&lt;</literal><emphasis role="bold"><literal>{</literal></emphasis><literal>...</literal><emphasis role="bold"><literal>}</literal></emphasis><literal>&gt;</literal>
  for vector initializers and constants, and array syntax
  <literal>&lt;</literal><emphasis role="bold"><literal>[]</literal></emphasis><literal>&gt;</literal>
  for vector element access. So simple arithmetic / logical operations
  on whole vectors should work as is. </para>

  <para>The caveat is that the interface data type of the Intel Intrinsic may
  not match the data types of the operation, so it may be necessary to cast the
  operands to the specific type for the operation. This also applies to vector
  initializers and accessing vector elements. You need to use the appropriate
  type to get the expected results. Of course this applies to X86_64 as well. For
  example:
  <programlisting><![CDATA[/* Perform the respective operation on the four SPFP values in A and B.  */
extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_add_ps (__m128 __A, __m128 __B)
{
  return (__m128) ((__v4sf)__A + (__v4sf)__B);
}

/* Stores the lower SPFP value.  */
extern __inline void __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_store_ss (float *__P, __m128 __A)
{
  *__P = ((__v4sf)__A)[0];
}]]></programlisting></para>

  <para>Note the cast from the interface type (<literal>__m128</literal>} to the implementation
  type (<literal>__v4sf</literal>, defined in the intrinsic header) for the vector float add (+)
  operation. This is enough for the compiler to select the appropriate vector add
  instruction for the float type. Then the result (which is
  <literal>__v4sf</literal>) needs to be
  cast back to the expected interface type (<literal>__m128</literal>). </para>

  <para>Note also the use of <emphasis>array syntax</emphasis> (<literal>__A)[0]</literal>)
   to extract the lowest
  (left most<footnote><para>Here we are using logical left and logical right
  which will not match the PowerISA register view in Little endian. Logical left
  is the left most element for initializers {left, … , right}, storage order
  and array  order where the left most element is [0].</para></footnote>)
  element of a vector. The cast (<literal>__v4sf</literal>) insures that the compiler knows we are
  extracting the left most 32-bit float. The compiler insures the code generated
  matches the Intel behavior for PowerPC64 Little Endian. </para>

  <para>The code generation is complicated by the fact that PowerISA vector
  registers are Big Endian (element 0 is the left most word of the vector) and
  scalar loads / stores are also to / from the right most word / dword.
  X86 scalar loads / stores are to / from the right most element for the
  XMM vector register.
  The PowerPC64 ELF V2 ABI mimics the X86 Little Endian behavior by placing
  logical element [0] in the right most element of the vector register. </para>

  <para>This may require the compiler to generate additional instructions
  to place the scalar value in the expected position.
  Application code with extensive use of scalar (vs packed) intrinsic loads /
  stores should be flagged for rewrite to C code using existing scalar
  types (float, double, int, long, etc.). The compiler may be able the
  vectorize this scalar code using the native vector SIMD instruction set.</para>

  <para>Another example is the set reverse order:
  <programlisting><![CDATA[/* Create the vector [Z Y X W].  */
extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_set_ps (const float __Z, const float __Y, const float __X, const float __W)
{
  return __extension__ (__m128)(__v4sf){ __W, __X, __Y, __Z };
}

/* Create the vector [W X Y Z].  */
extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_setr_ps (float __Z, float __Y, float __X, float __W)
{
  return __extension__ (__m128)(__v4sf){ __Z, __Y, __X, __W };
}]]></programlisting></para>

  <para>Note the use of <emphasis>initializer syntax</emphasis> used to collect a set of scalars
  into a vector. Code with constant initializer values will generate a vector
  constant of the appropriate endian. However code with variables in the
  initializer can get complicated as it often requires transfers between register
  sets and perhaps format conversions. We can assume that the compiler will
  generate the correct code, but if this class of intrinsics shows up as a hot spot,
  a rewrite to native PPC vector built-ins may be appropriate. For example
  initializer of a variable replicated to all the vector fields might not be
  recognized as a “load and splat” and making this explicit may help the
  compiler generate better code.</para>

</section>