You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
164 lines
6.4 KiB
164 lines
6.4 KiB
<?xml version="1.0" encoding="UTF-8"?> |
|
<!-- |
|
Copyright (c) 2017 OpenPOWER Foundation |
|
|
|
Licensed under the Apache License, Version 2.0 (the "License"); |
|
you may not use this file except in compliance with the License. |
|
You may obtain a copy of the License at |
|
|
|
http://www.apache.org/licenses/LICENSE-2.0 |
|
|
|
Unless required by applicable law or agreed to in writing, software |
|
distributed under the License is distributed on an "AS IS" BASIS, |
|
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
|
See the License for the specific language governing permissions and |
|
limitations under the License. |
|
|
|
--> |
|
<section xmlns="http://docbook.org/ns/docbook" |
|
xmlns:xi="http://www.w3.org/2001/XInclude" |
|
xmlns:xlink="http://www.w3.org/1999/xlink" |
|
version="5.0" |
|
xml:id="sec_handling_avx"> |
|
<title>Dealing with AVX and AVX512</title> |
|
|
|
<para>AVX is a bit easier for PowerISA and the ELF V2 ABI. First we have |
|
lots (64) of vector registers and a superscalar vector pipeline (can execute |
|
two or more independent 128-bit vector operations concurrently). Second the ELF |
|
V2 ABI was designed to pass and return larger aggregates in vector |
|
registers:</para> |
|
|
|
<itemizedlist> |
|
<listitem> |
|
<para>Up to 12 qualified vector arguments can be passed in |
|
v2–v13.</para> |
|
</listitem> |
|
<listitem> |
|
<para>A qualified vector argument corresponds to: |
|
<itemizedlist spacing="compact"> |
|
<listitem> |
|
<para>A vector data type</para> |
|
</listitem> |
|
|
|
<listitem> |
|
<para>A member of a homogeneous aggregate of multiple like data types |
|
passed in up to eight vector registers.</para> |
|
</listitem> |
|
|
|
<listitem> |
|
<para>Homogeneous floating-point or vector aggregate return values |
|
that consist of up to eight registers with up to eight elements will |
|
be returned in floating-point or vector registers that correspond to |
|
the parameter registers that would be used if the return value type |
|
were the first input parameter to a function.</para> |
|
</listitem> |
|
</itemizedlist> |
|
</para> |
|
</listitem> |
|
</itemizedlist> |
|
|
|
<para>So the ABI allows for passing up to three structures each |
|
representing 512-bit vectors and returning such (512-bit) structures all in VMX |
|
registers. This can be extended further by spilling parameters (beyond 12 X |
|
128-bit vectors) to the parameter save area, but we should not need that, as |
|
most intrinsics only use 2 or 3 operands.. Vector registers not needed for |
|
parameter passing, along with an additional 8 volatile vector registers, are |
|
available for scratch and local variables. All can be used by the application |
|
without requiring register spill to the save area. So most intrinsic operations |
|
on 256- or 512-bit vectors can be held within existing PowerISA vector |
|
registers. </para> |
|
|
|
<para>For larger functions that might use multiple AVX 256 or 512-bit |
|
intrinsics and, as a result, push beyond the 20 volatile vector registers, the |
|
compiler will just allocate non-volatile vector registers by allocating a stack |
|
frame and spilling non-volatile vector registers to the save area (as needed in |
|
the function prologue). This frees up to 64 vectors (32 x 256-bit or 16 x |
|
512-bit structs) for code optimization. </para> |
|
|
|
<para>Based on the specifics of our ISA and ABI we will not use |
|
<literal>__vector_size__</literal> (32) or (64) in the PowerPC implementation of |
|
<literal>__m256</literal> and <literal>__m512</literal> |
|
types. Instead we will typedef structs of 2 or 4 vector (<literal>__vector</literal>) fields. This |
|
allows efficient handling of these larger data types without requiring new GCC |
|
language extensions or vector builtins. For example: |
|
<programlisting><![CDATA[/* Internal data types for implementing the AVX in PowerISA intrinsics. */ |
|
typedef struct __v4df |
|
{ |
|
__vector double vd0; |
|
__vector double vd1; |
|
} __vx4df; |
|
|
|
/* The Intel API is flexible enough that we must allow aliasing with other |
|
vector types, and their scalar components. */ |
|
typedef struct __m256d |
|
{ |
|
__vector double vd0; |
|
__vector double vd1; |
|
}__attribute__ ((__may_alias__)) __m256d;]]></programlisting></para> |
|
|
|
<para>This requires a different syntax for operations |
|
where the 128-bit vector chunks are explicitly referenced. |
|
For example: |
|
<programlisting><![CDATA[extern __inline __mx256d __attribute__((__gnu_inline__, __always_inline__, __artificial__)) |
|
_mm256_add_pd (__m256d __A, __m256d __B) |
|
{ |
|
__m256d temp; |
|
temp.vd0 = __A.vd0 + __B.vd0; |
|
temp.vd1 = __A.vd1 + __B.vd1; |
|
return (temp); |
|
}]]></programlisting></para> |
|
|
|
<para>But this creates a new issue because |
|
the C language does not allow direct casts between structs. |
|
This can be an issue where the intrinsic interface type is not the correct type for the operation. |
|
For example AVX2 integer operations: |
|
<programlisting><![CDATA[ |
|
/* The Intel API is flexible enough that we must allow aliasing with other |
|
vector types, and their scalar components. */ |
|
typedef struct __m256i |
|
{ |
|
__vector long long vdi0; |
|
__vector long long vdi1; |
|
} __m256i; |
|
|
|
/* Internal data types for implementing the AVX in PowerISA intrinsics. */ |
|
typedef struct __v16hi |
|
{ |
|
__vector short vhi0; |
|
__vector short vhi1; |
|
} __v16hi; |
|
]]></programlisting></para> |
|
|
|
<para>For the AVX2 intrinsic <literal>_mm256_add_epi16</literal> |
|
we need to cast the input vectors |
|
of 64-bit long long (<literal>__m256i</literal>) into vectors of 16-bit short |
|
(<literal>__v16hi</literal>) before the overloaded add operations. |
|
Here we need to use a pointer reference cast. |
|
For example: |
|
<programlisting><![CDATA[ |
|
extern __inline __m256i __attribute__((__gnu_inline__, __always_inline__, __artificial__)) |
|
_mx256_add_epi16 (__m256i __A, __m256i __B) |
|
{ |
|
__m256i result; |
|
__v16hi a = *((__v16hi *)&__A); |
|
__v16hi b = *((__v16hi *)&__B); |
|
__v16hi c; |
|
|
|
c.vhi0 = a.vhi0 + b.vhi0; |
|
c.vhi1 = a.vhi1 + b.vhi1; |
|
|
|
result = *((__m256i *)&c); |
|
return (result); |
|
}]]></programlisting></para> |
|
<para>As this and related examples are inlined, |
|
we expect the compiler to recognize this |
|
is a "nop cast" and avoid generating any additional instructions.</para> |
|
|
|
<para>In the end we should try |
|
to use the same type names and definitions as the |
|
GCC X86 intrinsic headers where possible. Where that is not possible we can |
|
define new typedefs that provide the best mapping to the underlying PowerISA |
|
hardware.</para> |
|
|
|
</section> |
|
|
|
|