|
|
|
@ -45,9 +45,17 @@ typedef float __v4sf __attribute__ ((__vector_size__ (16)));]]></programlisting>
|
|
|
|
|
<para>There are a couple of issues here:
|
|
|
|
|
<itemizedlist spacing="compact">
|
|
|
|
|
<listitem>
|
|
|
|
|
<para>The API seems to force the compiler to assume
|
|
|
|
|
<para>The use of __may_alias__ in the API seems to force the compiler to assume
|
|
|
|
|
aliasing of any parameter passed by reference.</para>
|
|
|
|
|
</listitem>
|
|
|
|
|
<listitem>
|
|
|
|
|
<para>
|
|
|
|
|
The GCC vector builtin type system (example above) is slightly different
|
|
|
|
|
syntax from the original Altivec __vector types. Internally the two typedef forms
|
|
|
|
|
may represent the same 128-bit vector type,
|
|
|
|
|
but for early source parsing and overloaded vector builtins they are
|
|
|
|
|
handled differently.</para>
|
|
|
|
|
</listitem>
|
|
|
|
|
<listitem>
|
|
|
|
|
<para>The data type used at the interface may not be
|
|
|
|
|
the correct type for the implied operation.</para>
|
|
|
|
@ -60,13 +68,16 @@ typedef float __v4sf __attribute__ ((__vector_size__ (16)));]]></programlisting>
|
|
|
|
|
(defined as vector long long). </para>
|
|
|
|
|
|
|
|
|
|
<para>This may not matter when using x86 built-ins but does matter when
|
|
|
|
|
the implementation uses C vector extensions or in our case uses PowerPC generic
|
|
|
|
|
the implementation uses C vector extensions or in our case using PowerPC
|
|
|
|
|
overloaded
|
|
|
|
|
vector built-ins
|
|
|
|
|
(<xref linkend="sec_powerisa_vector_intrinsics"/>).
|
|
|
|
|
For the latter cases the type must be correct for
|
|
|
|
|
the compiler to generate the correct type (char, short, int, long)
|
|
|
|
|
(<xref linkend="sec_api_implemented"/>) for the generic
|
|
|
|
|
builtin operation. There is also concern that excessive use of
|
|
|
|
|
the compiler to generate the correct code for the
|
|
|
|
|
type (char, short, int, long)
|
|
|
|
|
(<xref linkend="sec_api_implemented"/>) for
|
|
|
|
|
overloaded builtin operations.
|
|
|
|
|
There is also concern that excessive use of
|
|
|
|
|
<literal>__may_alias__</literal>
|
|
|
|
|
will limit compiler optimization. We are not sure how important this attribute
|
|
|
|
|
is to the correct operation of the API. So at a later stage we should
|
|
|
|
@ -83,16 +94,76 @@ typedef float __v4sf __attribute__ ((__vector_size__ (16)));]]></programlisting>
|
|
|
|
|
implemented as vector attribute extensions of the appropriate size (
|
|
|
|
|
<literal>__vector_size__</literal> ({8 | 16 | 32, and 64}). For the PowerPC target GCC currently
|
|
|
|
|
only supports the native <literal>__vector_size__</literal> ( 16 ). These we can support directly
|
|
|
|
|
in VMX/VSX registers and associated instructions. GCC will compile code with
|
|
|
|
|
in VMX/VSX registers and associated instructions.</para>
|
|
|
|
|
|
|
|
|
|
<para>GCC will compile code with
|
|
|
|
|
other <literal>__vector_size__</literal> values, but the resulting types are treated as simple
|
|
|
|
|
arrays of the element type. This does not allow the compiler to use the vector
|
|
|
|
|
registers and vector instructions for these (nonnative) vectors.</para>
|
|
|
|
|
registers for parameter passing and return values.
|
|
|
|
|
For example this intrinsic from immintrin.h:
|
|
|
|
|
<programlisting><![CDATA[typedef double __m256d __attribute__ ((__vector_size__ (32), __may_alias__));
|
|
|
|
|
|
|
|
|
|
extern __inline __m256d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
|
|
|
|
|
_mm256_add_pd (__m256d __A, __m256d __B)
|
|
|
|
|
{
|
|
|
|
|
return (__m256d) ((__v4df)__A + (__v4df)__B);
|
|
|
|
|
}
|
|
|
|
|
]]></programlisting></para>
|
|
|
|
|
<para>And test case:
|
|
|
|
|
<programlisting><![CDATA[__m256d
|
|
|
|
|
test_mm256_add_pd (__m256d __A, __m256d __B)
|
|
|
|
|
{
|
|
|
|
|
return (_mm256_add_pd (__A, __B));
|
|
|
|
|
}
|
|
|
|
|
]]></programlisting></para>
|
|
|
|
|
<para>Current GCC generates:
|
|
|
|
|
<programlisting><![CDATA[0000000000000970 <test_mm256_add_pd>:
|
|
|
|
|
970: 10 00 20 39 li r9,16
|
|
|
|
|
974: 98 26 80 7d lxvd2x vs12,0,r4
|
|
|
|
|
978: 98 2e 40 7d lxvd2x vs10,0,r5
|
|
|
|
|
97c: 20 00 e0 38 li r7,32
|
|
|
|
|
980: f8 ff e1 fb std r31,-8(r1)
|
|
|
|
|
984: b1 ff 21 f8 stdu r1,-80(r1)
|
|
|
|
|
988: 30 00 00 39 li r8,48
|
|
|
|
|
98c: 98 4e 04 7c lxvd2x vs0,r4,r9
|
|
|
|
|
990: 98 4e 65 7d lxvd2x vs11,r5,r9
|
|
|
|
|
994: 00 53 8c f1 xvadddp vs12,vs12,vs10
|
|
|
|
|
998: 00 00 c1 e8 ld r6,0(r1)
|
|
|
|
|
99c: 78 0b 3f 7c mr r31,r1
|
|
|
|
|
9a0: 00 5b 00 f0 xvadddp vs0,vs0,vs11
|
|
|
|
|
9a4: c1 ff c1 f8 stdu r6,-64(r1)
|
|
|
|
|
9a8: 98 3f 9f 7d stxvd2x vs12,r31,r7
|
|
|
|
|
9ac: 98 47 1f 7c stxvd2x vs0,r31,r8
|
|
|
|
|
9b0: 98 3e 9f 7d lxvd2x vs12,r31,r7
|
|
|
|
|
9b4: 98 46 1f 7c lxvd2x vs0,r31,r8
|
|
|
|
|
9b8: 50 00 3f 38 addi r1,r31,80
|
|
|
|
|
9bc: f8 ff e1 eb ld r31,-8(r1)
|
|
|
|
|
9c0: 98 1f 80 7d stxvd2x vs12,0,r3
|
|
|
|
|
9c4: 98 4f 03 7c stxvd2x vs0,r3,r9
|
|
|
|
|
9c8: 20 00 80 4e blr]]></programlisting></para>
|
|
|
|
|
|
|
|
|
|
<para>The compiler treats the parameters and return value
|
|
|
|
|
as scalar arrays, which are passed by reference.
|
|
|
|
|
The operation is vectorized in this case,
|
|
|
|
|
but the 256-bit result is returned through storage.</para>
|
|
|
|
|
|
|
|
|
|
<para>This is not what we want to see for a simple 4 by double add.
|
|
|
|
|
It would be better if we can pass and return
|
|
|
|
|
MMX (<xref linkend="sec_handling_mmx"/>) and AVX (<xref linkend="sec_handling_avx"/>)
|
|
|
|
|
values as PowerPC registers and avoid the storage references.
|
|
|
|
|
If we can get the parameter and return values as registers,
|
|
|
|
|
this example will reduce to:
|
|
|
|
|
<programlisting><![CDATA[0000000000000970 <test_mx256_add_pd>:
|
|
|
|
|
970: xvadddp vs34,vs34,vs36
|
|
|
|
|
974: xvadddp vs35,vs35,vs37
|
|
|
|
|
978: blr]]></programlisting></para>
|
|
|
|
|
|
|
|
|
|
<para>So the PowerISA VMX/VSX facilities and GCC compiler support for
|
|
|
|
|
128-bit/16-byte vectors and associated vector built-ins
|
|
|
|
|
are well matched to implementing equivalent X86 SSE intrinsic functions.
|
|
|
|
|
However implementing the older MMX (64-bit) and the latest
|
|
|
|
|
AVX (256 / 512-bit) extensions requires more thought and some ingenuity.</para>
|
|
|
|
|
AVX (256 / 512-bit) extensions requires more thought and some
|
|
|
|
|
ingenuity.</para>
|
|
|
|
|
|
|
|
|
|
<xi:include href="sec_handling_mmx.xml"/>
|
|
|
|
|
<xi:include href="sec_handling_avx.xml"/>
|
|
|
|
|