|
|
|
@ -45,9 +45,17 @@ typedef float __v4sf __attribute__ ((__vector_size__ (16)));]]></programlisting>
@@ -45,9 +45,17 @@ typedef float __v4sf __attribute__ ((__vector_size__ (16)));]]></programlisting>
|
|
|
|
|
<para>There are a couple of issues here: |
|
|
|
|
<itemizedlist spacing="compact"> |
|
|
|
|
<listitem> |
|
|
|
|
<para>The API seems to force the compiler to assume |
|
|
|
|
<para>The use of __may_alias__ in the API seems to force the compiler to assume |
|
|
|
|
aliasing of any parameter passed by reference.</para> |
|
|
|
|
</listitem> |
|
|
|
|
<listitem> |
|
|
|
|
<para> |
|
|
|
|
The GCC vector builtin type system (example above) is slightly different |
|
|
|
|
syntax from the original Altivec __vector types. Internally the two typedef forms |
|
|
|
|
may represent the same 128-bit vector type, |
|
|
|
|
but for early source parsing and overloaded vector builtins they are |
|
|
|
|
handled differently.</para> |
|
|
|
|
</listitem> |
|
|
|
|
<listitem> |
|
|
|
|
<para>The data type used at the interface may not be |
|
|
|
|
the correct type for the implied operation.</para> |
|
|
|
@ -60,13 +68,16 @@ typedef float __v4sf __attribute__ ((__vector_size__ (16)));]]></programlisting>
@@ -60,13 +68,16 @@ typedef float __v4sf __attribute__ ((__vector_size__ (16)));]]></programlisting>
|
|
|
|
|
(defined as vector long long). </para> |
|
|
|
|
|
|
|
|
|
<para>This may not matter when using x86 built-ins but does matter when |
|
|
|
|
the implementation uses C vector extensions or in our case uses PowerPC generic |
|
|
|
|
the implementation uses C vector extensions or in our case using PowerPC |
|
|
|
|
overloaded |
|
|
|
|
vector built-ins |
|
|
|
|
(<xref linkend="sec_powerisa_vector_intrinsics"/>). |
|
|
|
|
For the latter cases the type must be correct for |
|
|
|
|
the compiler to generate the correct type (char, short, int, long) |
|
|
|
|
(<xref linkend="sec_api_implemented"/>) for the generic |
|
|
|
|
builtin operation. There is also concern that excessive use of |
|
|
|
|
the compiler to generate the correct code for the |
|
|
|
|
type (char, short, int, long) |
|
|
|
|
(<xref linkend="sec_api_implemented"/>) for |
|
|
|
|
overloaded builtin operations. |
|
|
|
|
There is also concern that excessive use of |
|
|
|
|
<literal>__may_alias__</literal> |
|
|
|
|
will limit compiler optimization. We are not sure how important this attribute |
|
|
|
|
is to the correct operation of the API. So at a later stage we should |
|
|
|
@ -83,16 +94,76 @@ typedef float __v4sf __attribute__ ((__vector_size__ (16)));]]></programlisting>
@@ -83,16 +94,76 @@ typedef float __v4sf __attribute__ ((__vector_size__ (16)));]]></programlisting>
|
|
|
|
|
implemented as vector attribute extensions of the appropriate size ( |
|
|
|
|
<literal>__vector_size__</literal> ({8 | 16 | 32, and 64}). For the PowerPC target GCC currently |
|
|
|
|
only supports the native <literal>__vector_size__</literal> ( 16 ). These we can support directly |
|
|
|
|
in VMX/VSX registers and associated instructions. GCC will compile code with |
|
|
|
|
in VMX/VSX registers and associated instructions.</para> |
|
|
|
|
|
|
|
|
|
<para>GCC will compile code with |
|
|
|
|
other <literal>__vector_size__</literal> values, but the resulting types are treated as simple |
|
|
|
|
arrays of the element type. This does not allow the compiler to use the vector |
|
|
|
|
registers and vector instructions for these (nonnative) vectors.</para> |
|
|
|
|
registers for parameter passing and return values. |
|
|
|
|
For example this intrinsic from immintrin.h: |
|
|
|
|
<programlisting><![CDATA[typedef double __m256d __attribute__ ((__vector_size__ (32), __may_alias__)); |
|
|
|
|
|
|
|
|
|
extern __inline __m256d __attribute__((__gnu_inline__, __always_inline__, __artificial__)) |
|
|
|
|
_mm256_add_pd (__m256d __A, __m256d __B) |
|
|
|
|
{ |
|
|
|
|
return (__m256d) ((__v4df)__A + (__v4df)__B); |
|
|
|
|
} |
|
|
|
|
]]></programlisting></para> |
|
|
|
|
<para>And test case: |
|
|
|
|
<programlisting><![CDATA[__m256d |
|
|
|
|
test_mm256_add_pd (__m256d __A, __m256d __B) |
|
|
|
|
{ |
|
|
|
|
return (_mm256_add_pd (__A, __B)); |
|
|
|
|
} |
|
|
|
|
]]></programlisting></para> |
|
|
|
|
<para>Current GCC generates: |
|
|
|
|
<programlisting><![CDATA[0000000000000970 <test_mm256_add_pd>: |
|
|
|
|
970: 10 00 20 39 li r9,16 |
|
|
|
|
974: 98 26 80 7d lxvd2x vs12,0,r4 |
|
|
|
|
978: 98 2e 40 7d lxvd2x vs10,0,r5 |
|
|
|
|
97c: 20 00 e0 38 li r7,32 |
|
|
|
|
980: f8 ff e1 fb std r31,-8(r1) |
|
|
|
|
984: b1 ff 21 f8 stdu r1,-80(r1) |
|
|
|
|
988: 30 00 00 39 li r8,48 |
|
|
|
|
98c: 98 4e 04 7c lxvd2x vs0,r4,r9 |
|
|
|
|
990: 98 4e 65 7d lxvd2x vs11,r5,r9 |
|
|
|
|
994: 00 53 8c f1 xvadddp vs12,vs12,vs10 |
|
|
|
|
998: 00 00 c1 e8 ld r6,0(r1) |
|
|
|
|
99c: 78 0b 3f 7c mr r31,r1 |
|
|
|
|
9a0: 00 5b 00 f0 xvadddp vs0,vs0,vs11 |
|
|
|
|
9a4: c1 ff c1 f8 stdu r6,-64(r1) |
|
|
|
|
9a8: 98 3f 9f 7d stxvd2x vs12,r31,r7 |
|
|
|
|
9ac: 98 47 1f 7c stxvd2x vs0,r31,r8 |
|
|
|
|
9b0: 98 3e 9f 7d lxvd2x vs12,r31,r7 |
|
|
|
|
9b4: 98 46 1f 7c lxvd2x vs0,r31,r8 |
|
|
|
|
9b8: 50 00 3f 38 addi r1,r31,80 |
|
|
|
|
9bc: f8 ff e1 eb ld r31,-8(r1) |
|
|
|
|
9c0: 98 1f 80 7d stxvd2x vs12,0,r3 |
|
|
|
|
9c4: 98 4f 03 7c stxvd2x vs0,r3,r9 |
|
|
|
|
9c8: 20 00 80 4e blr]]></programlisting></para> |
|
|
|
|
|
|
|
|
|
<para>The compiler treats the parameters and return value |
|
|
|
|
as scalar arrays, which are passed by reference. |
|
|
|
|
The operation is vectorized in this case, |
|
|
|
|
but the 256-bit result is returned through storage.</para> |
|
|
|
|
|
|
|
|
|
<para>This is not what we want to see for a simple 4 by double add. |
|
|
|
|
It would be better if we can pass and return |
|
|
|
|
MMX (<xref linkend="sec_handling_mmx"/>) and AVX (<xref linkend="sec_handling_avx"/>) |
|
|
|
|
values as PowerPC registers and avoid the storage references. |
|
|
|
|
If we can get the parameter and return values as registers, |
|
|
|
|
this example will reduce to: |
|
|
|
|
<programlisting><![CDATA[0000000000000970 <test_mx256_add_pd>: |
|
|
|
|
970: xvadddp vs34,vs34,vs36 |
|
|
|
|
974: xvadddp vs35,vs35,vs37 |
|
|
|
|
978: blr]]></programlisting></para> |
|
|
|
|
|
|
|
|
|
<para>So the PowerISA VMX/VSX facilities and GCC compiler support for |
|
|
|
|
128-bit/16-byte vectors and associated vector built-ins |
|
|
|
|
are well matched to implementing equivalent X86 SSE intrinsic functions. |
|
|
|
|
However implementing the older MMX (64-bit) and the latest |
|
|
|
|
AVX (256 / 512-bit) extensions requires more thought and some ingenuity.</para> |
|
|
|
|
AVX (256 / 512-bit) extensions requires more thought and some |
|
|
|
|
ingenuity.</para> |
|
|
|
|
|
|
|
|
|
<xi:include href="sec_handling_mmx.xml"/> |
|
|
|
|
<xi:include href="sec_handling_avx.xml"/> |
|
|
|
|