Programming-Guides/Porting_Vector_Intrinsics/sec_handling_avx.xml

<?xml version="1.0" encoding="UTF-8"?>
<!--
  Copyright (c) 2017 OpenPOWER Foundation

  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License.

-->
<section xmlns="http://docbook.org/ns/docbook"
  xmlns:xi="http://www.w3.org/2001/XInclude"
  xmlns:xlink="http://www.w3.org/1999/xlink"
  version="5.0"
  xml:id="sec_handling_avx">
  <title>Dealing with AVX and AVX512</title>

  <para>AVX is a bit easier for PowerISA and the ELF V2 ABI. First we have
  lots (64) of vector registers and a superscalar vector pipeline (can execute
  two or more independent 128-bit vector operations concurrently). Second the ELF
  V2 ABI was designed to pass and return larger aggregates in vector
  registers:</para>

  <itemizedlist>
    <listitem>
      <para>Up to 12 qualified vector arguments can be passed in
      v2–v13.</para>
    </listitem>
    <listitem>
      <para>A qualified vector argument corresponds to:
        <itemizedlist spacing="compact">
          <listitem>
            <para>A vector data type</para>
          </listitem>

          <listitem>
            <para>A member of a homogeneous aggregate of multiple like data types
            passed in up to eight vector registers.</para>
          </listitem>

          <listitem>
            <para>Homogeneous floating-point or vector aggregate return values
            that consist of up to eight registers with up to eight elements will
            be returned in floating-point or vector registers that correspond to
            the parameter registers that would be used if the return value type
            were the first input parameter to a function.</para>
          </listitem>
        </itemizedlist>
      </para>
    </listitem>
  </itemizedlist>

  <para>So the ABI allows for passing up to three structures each
  representing 512-bit vectors and returning such (512-bit) structures all in VMX
  registers. This can be extended further by spilling parameters (beyond 12 X
  128-bit vectors) to the parameter save area, but we should not need that, as
  most intrinsics only use 2 or 3 operands.. Vector registers not needed for
  parameter passing, along with an additional 8 volatile vector registers, are
  available for scratch and local variables. All can be used by the application
  without requiring register spill to the save area. So most intrinsic operations
  on 256- or 512-bit vectors can be held within existing PowerISA vector
  registers. </para>

  <para>For larger functions that might use multiple AVX 256 or 512-bit
  intrinsics and, as a result, push beyond the 20 volatile vector registers, the
  compiler will just allocate non-volatile vector registers by allocating a stack
  frame and spilling non-volatile vector registers to the save area (as needed in
  the function prologue). This frees up to 64 vectors (32 x 256-bit or 16 x
  512-bit structs) for code optimization. </para>

  <para>Based on the specifics of our ISA and ABI we will not use
  <literal>__vector_size__</literal> (32) or (64) in the PowerPC implementation of
  <literal>__m256</literal> and <literal>__m512</literal>
  types. Instead we will typedef structs of 2 or 4 vector (<literal>__vector</literal>) fields. This
  allows efficient handling of these larger data types without requiring new GCC
  language extensions or vector builtins. For example:
<programlisting><![CDATA[/* Internal data types for implementing the AVX in PowerISA intrinsics.  */
typedef struct __v4df
{
  __vector double vd0;
  __vector double vd1;
} __vx4df;

/* The Intel API is flexible enough that we must allow aliasing with other
   vector types, and their scalar components.  */
typedef struct __m256d
{
  __vector double vd0;
  __vector double vd1;
}__attribute__ ((__may_alias__)) __m256d;]]></programlisting></para>

  <para>This requires a different syntax for operations
  where the 128-bit vector chunks are explicitly referenced.
  For example:
  <programlisting><![CDATA[extern __inline __mx256d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm256_add_pd (__m256d __A, __m256d __B)
{
  __m256d temp;
  temp.vd0 = __A.vd0 + __B.vd0;
  temp.vd1 = __A.vd1 + __B.vd1;
  return (temp);
}]]></programlisting></para>

  <para>But this creates a new issue because
  the C language does not allow direct casts between structs.
  This can be an issue where the intrinsic interface type is not the correct type for the operation.
  For example AVX2 integer operations:
  <programlisting><![CDATA[
/* The Intel API is flexible enough that we must allow aliasing with other
   vector types, and their scalar components.  */
typedef struct __m256i
{
  __vector long long vdi0;
  __vector long long vdi1;
} __m256i;

/* Internal data types for implementing the AVX in PowerISA intrinsics.  */
typedef struct __v16hi
{
  __vector short vhi0;
  __vector short vhi1;
} __v16hi;
]]></programlisting></para>

  <para>For the AVX2 intrinsic <literal>_mm256_add_epi16</literal>
  we need to cast the input vectors
  of 64-bit long long (<literal>__m256i</literal>) into vectors of 16-bit short
  (<literal>__v16hi</literal>) before the overloaded add operations.
  Here we need to use a pointer reference cast.
  For example:
  <programlisting><![CDATA[
extern __inline __m256i __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mx256_add_epi16 (__m256i __A, __m256i __B)
{
  __m256i result;
  __v16hi a = *((__v16hi *)&__A);
  __v16hi b = *((__v16hi *)&__B);
  __v16hi c;

  c.vhi0 = a.vhi0 + b.vhi0;
  c.vhi1 = a.vhi1 + b.vhi1;

  result = *((__m256i *)&c);
  return (result);
}]]></programlisting></para>
  <para>As this and related examples are inlined,
  we expect the compiler to recognize this
  is a "nop cast" and avoid generating any additional instructions.</para>

  <para>In the end we should try
  to use the same type names and definitions as the
  GCC X86 intrinsic headers where possible. Where that is not possible we can
  define new typedefs that provide the best mapping to the underlying PowerISA
  hardware.</para>

</section>