|
|
<?xml version="1.0" encoding="UTF-8"?> |
|
|
<!-- |
|
|
Copyright (c) 2017 OpenPOWER Foundation |
|
|
|
|
|
Licensed under the Apache License, Version 2.0 (the "License"); |
|
|
you may not use this file except in compliance with the License. |
|
|
You may obtain a copy of the License at |
|
|
|
|
|
http://www.apache.org/licenses/LICENSE-2.0 |
|
|
|
|
|
Unless required by applicable law or agreed to in writing, software |
|
|
distributed under the License is distributed on an "AS IS" BASIS, |
|
|
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
|
|
See the License for the specific language governing permissions and |
|
|
limitations under the License. |
|
|
|
|
|
--> |
|
|
<section xmlns="http://docbook.org/ns/docbook" |
|
|
xmlns:xi="http://www.w3.org/2001/XInclude" |
|
|
xmlns:xlink="http://www.w3.org/1999/xlink" |
|
|
version="5.0" |
|
|
xml:id="sec_intel_intrinsic_types"> |
|
|
<title>The types used for intrinsics</title> |
|
|
|
|
|
<para>The type system for Intel intrinsics is a little strange. For example |
|
|
from xmmintrin.h: |
|
|
<programlisting><![CDATA[/* The Intel API is flexible enough that we must allow aliasing with other |
|
|
vector types, and their scalar components. */ |
|
|
typedef float __m128 __attribute__ ((__vector_size__ (16), __may_alias__)); |
|
|
|
|
|
/* Internal data types for implementing the intrinsics. */ |
|
|
typedef float __v4sf __attribute__ ((__vector_size__ (16)));]]></programlisting></para> |
|
|
|
|
|
<para>So there is one set of types that are used in the function prototypes |
|
|
of the API, and the internal types that are used in the implementation. Notice |
|
|
the special attribute <literal>__may_alias__</literal>. From the GCC documentation: |
|
|
|
|
|
<blockquote><para> |
|
|
Accesses through pointers to types with this attribute are not subject |
|
|
to type-based alias analysis, but are instead assumed to be able to alias any |
|
|
other type of objects. ... This extension exists to support some vector APIs, |
|
|
in which pointers to one vector type are permitted to alias pointers to a |
|
|
different vector type.</para></blockquote></para> |
|
|
|
|
|
<para>There are a couple of issues here: |
|
|
<itemizedlist spacing="compact"> |
|
|
<listitem> |
|
|
<para>The use of __may_alias__ in the API seems to force the compiler to assume |
|
|
aliasing of any parameter passed by reference.</para> |
|
|
</listitem> |
|
|
<listitem> |
|
|
<para> |
|
|
The GCC vector builtin type system (example above) is slightly different |
|
|
syntax from the original Altivec __vector types. Internally the two typedef forms |
|
|
may represent the same 128-bit vector type, |
|
|
but for early source parsing and overloaded vector builtins they are |
|
|
handled differently.</para> |
|
|
</listitem> |
|
|
<listitem> |
|
|
<para>The data type used at the interface may not be |
|
|
the correct type for the implied operation.</para> |
|
|
</listitem> |
|
|
</itemizedlist> |
|
|
Normally the compiler assumes that parameters of different size do |
|
|
not overlap in storage, which allows more optimization. |
|
|
However parameters for different vector element sizes |
|
|
[char | short | int | long] are all passed and returned as type <literal>__m128i</literal> |
|
|
(defined as vector long long). </para> |
|
|
|
|
|
<para>This may not matter when using x86 built-ins but does matter when |
|
|
the implementation uses C vector extensions or in our case using PowerPC |
|
|
overloaded |
|
|
vector built-ins |
|
|
(<xref linkend="sec_powerisa_vector_intrinsics"/>). |
|
|
For the latter cases the type must be correct for |
|
|
the compiler to generate the correct code for the |
|
|
type (char, short, int, long) |
|
|
(<xref linkend="sec_api_implemented"/>) for |
|
|
overloaded builtin operations. |
|
|
There is also concern that excessive use of |
|
|
<literal>__may_alias__</literal> |
|
|
will limit compiler optimization. We are not sure how important this attribute |
|
|
is to the correct operation of the API. So at a later stage we should |
|
|
experiment with removing it from our implementation for PowerPC.</para> |
|
|
|
|
|
<para>The good news is that PowerISA has good support for 128-bit vectors |
|
|
and (with the addition of VSX) all the required vector data (char, short, int, |
|
|
long, float, double) types. However Intel supports a wider variety of the |
|
|
vector sizes than PowerISA does. This started with the 64-bit MMX vector |
|
|
support that preceded SSE and extends to 256-bit and 512-bit vectors of AVX, |
|
|
AVX2, and AVX512 that followed SSE.</para> |
|
|
|
|
|
<para>Within the GCC Intel intrinsic implementation these are all |
|
|
implemented as vector attribute extensions of the appropriate size ( |
|
|
<literal>__vector_size__</literal> ({8 | 16 | 32, and 64}). For the PowerPC target GCC currently |
|
|
only supports the native <literal>__vector_size__</literal> ( 16 ). These we can support directly |
|
|
in VMX/VSX registers and associated instructions.</para> |
|
|
|
|
|
<para>GCC will compile code with |
|
|
other <literal>__vector_size__</literal> values, but the resulting types are treated as simple |
|
|
arrays of the element type. This does not allow the compiler to use the vector |
|
|
registers for parameter passing and return values. |
|
|
For example this intrinsic from immintrin.h: |
|
|
<programlisting><![CDATA[typedef double __m256d __attribute__ ((__vector_size__ (32), __may_alias__)); |
|
|
|
|
|
extern __inline __m256d __attribute__((__gnu_inline__, __always_inline__, __artificial__)) |
|
|
_mm256_add_pd (__m256d __A, __m256d __B) |
|
|
{ |
|
|
return (__m256d) ((__v4df)__A + (__v4df)__B); |
|
|
} |
|
|
]]></programlisting></para> |
|
|
<para>And test case: |
|
|
<programlisting><![CDATA[__m256d |
|
|
test_mm256_add_pd (__m256d __A, __m256d __B) |
|
|
{ |
|
|
return (_mm256_add_pd (__A, __B)); |
|
|
} |
|
|
]]></programlisting></para> |
|
|
<para>Current GCC generates: |
|
|
<programlisting><![CDATA[0000000000000970 <test_mm256_add_pd>: |
|
|
970: 10 00 20 39 li r9,16 |
|
|
974: 98 26 80 7d lxvd2x vs12,0,r4 |
|
|
978: 98 2e 40 7d lxvd2x vs10,0,r5 |
|
|
97c: 20 00 e0 38 li r7,32 |
|
|
980: f8 ff e1 fb std r31,-8(r1) |
|
|
984: b1 ff 21 f8 stdu r1,-80(r1) |
|
|
988: 30 00 00 39 li r8,48 |
|
|
98c: 98 4e 04 7c lxvd2x vs0,r4,r9 |
|
|
990: 98 4e 65 7d lxvd2x vs11,r5,r9 |
|
|
994: 00 53 8c f1 xvadddp vs12,vs12,vs10 |
|
|
998: 00 00 c1 e8 ld r6,0(r1) |
|
|
99c: 78 0b 3f 7c mr r31,r1 |
|
|
9a0: 00 5b 00 f0 xvadddp vs0,vs0,vs11 |
|
|
9a4: c1 ff c1 f8 stdu r6,-64(r1) |
|
|
9a8: 98 3f 9f 7d stxvd2x vs12,r31,r7 |
|
|
9ac: 98 47 1f 7c stxvd2x vs0,r31,r8 |
|
|
9b0: 98 3e 9f 7d lxvd2x vs12,r31,r7 |
|
|
9b4: 98 46 1f 7c lxvd2x vs0,r31,r8 |
|
|
9b8: 50 00 3f 38 addi r1,r31,80 |
|
|
9bc: f8 ff e1 eb ld r31,-8(r1) |
|
|
9c0: 98 1f 80 7d stxvd2x vs12,0,r3 |
|
|
9c4: 98 4f 03 7c stxvd2x vs0,r3,r9 |
|
|
9c8: 20 00 80 4e blr]]></programlisting></para> |
|
|
|
|
|
<para>The compiler treats the parameters and return value |
|
|
as scalar arrays, which are passed by reference. |
|
|
The operation is vectorized in this case, |
|
|
but the 256-bit result is returned through storage.</para> |
|
|
|
|
|
<para>This is not what we want to see for a simple 4 by double add. |
|
|
It would be better if we can pass and return |
|
|
MMX (<xref linkend="sec_handling_mmx"/>) and AVX (<xref linkend="sec_handling_avx"/>) |
|
|
values as PowerPC registers and avoid the storage references. |
|
|
If we can get the parameter and return values as registers, |
|
|
this example will reduce to: |
|
|
<programlisting><![CDATA[0000000000000970 <test_mx256_add_pd>: |
|
|
970: xvadddp vs34,vs34,vs36 |
|
|
974: xvadddp vs35,vs35,vs37 |
|
|
978: blr]]></programlisting></para> |
|
|
|
|
|
<para>So the PowerISA VMX/VSX facilities and GCC compiler support for |
|
|
128-bit/16-byte vectors and associated vector built-ins |
|
|
are well matched to implementing equivalent X86 SSE intrinsic functions. |
|
|
However implementing the older MMX (64-bit) and the latest |
|
|
AVX (256 / 512-bit) extensions requires more thought and some |
|
|
ingenuity.</para> |
|
|
|
|
|
<xi:include href="sec_handling_mmx.xml"/> |
|
|
<xi:include href="sec_handling_avx.xml"/> |
|
|
|
|
|
</section> |
|
|
|
|
|
|