You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

121 lines
6.2 KiB

<?xml version="1.0" encoding="UTF-8"?>
Copyright (c) 2017 OpenPOWER Foundation
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
See the License for the specific language governing permissions and
limitations under the License.
<section xmlns=""
<title>GCC Vector Extensions</title>
<para>The GCC vector extensions are common syntax but implemented in a
target specific way. Using the C vector extensions requires the
attribute to avoid syntax errors in case the user specified  C standard
compliance (<literal>-std=c90</literal>, <literal>-std=c11</literal>,
etc) that would normally disallow such
extensions. </para>
<para>The GCC implementation for PowerPC64 Little Endian is (mostly)
functionally compatible with x86_64 vector extension usage. We can use the same
type definitions (at least for  vector_size (16)), operations, syntax
<literal>&lt;</literal><emphasis role="bold"><literal>{</literal></emphasis><literal>...</literal><emphasis role="bold"><literal>}</literal></emphasis><literal>&gt;</literal>
for vector initializers and constants, and array syntax
<literal>&lt;</literal><emphasis role="bold"><literal>[]</literal></emphasis><literal>&gt;</literal>
for vector element access. So simple arithmetic / logical operations
on whole vectors should work as is. </para>
<para>The caveat is that the interface data type of the Intel Intrinsic may
not match the data types of the operation, so it may be necessary to cast the
operands to the specific type for the operation. This also applies to vector
initializers and accessing vector elements. You need to use the appropriate
type to get the expected results. Of course this applies to X86_64 as well. For
<programlisting><![CDATA[/* Perform the respective operation on the four SPFP values in A and B. */
extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_add_ps (__m128 __A, __m128 __B)
return (__m128) ((__v4sf)__A + (__v4sf)__B);
/* Stores the lower SPFP value. */
extern __inline void __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_store_ss (float *__P, __m128 __A)
*__P = ((__v4sf)__A)[0];
<para>Note the cast from the interface type (<literal>__m128</literal>} to the implementation
type (<literal>__v4sf</literal>, defined in the intrinsic header) for the vector float add (+)
operation. This is enough for the compiler to select the appropriate vector add
instruction for the float type. Then the result (which is
<literal>__v4sf</literal>) needs to be
cast back to the expected interface type (<literal>__m128</literal>). </para>
<para>Note also the use of <emphasis>array syntax</emphasis> (<literal>__A)[0]</literal>)
to extract the lowest
(left most<footnote><para>Here we are using logical left and logical right
which will not match the PowerISA register view in Little endian. Logical left
is the left most element for initializers {left, … , right}, storage order
and array  order where the left most element is [0].</para></footnote>)
element of a vector. The cast (<literal>__v4sf</literal>) insures that the compiler knows we are
extracting the left most 32-bit float. The compiler insures the code generated
matches the Intel behavior for PowerPC64 Little Endian. </para>
<para>The code generation is complicated by the fact that PowerISA vector
registers are Big Endian (element 0 is the left most word of the vector) and
scalar loads / stores are also to / from the right most word / dword.
X86 scalar loads / stores are to / from the right most element for the
XMM vector register.
The PowerPC64 ELF V2 ABI mimics the X86 Little Endian behavior by placing
logical element [0] in the right most element of the vector register. </para>
<para>This may require the compiler to generate additional instructions
to place the scalar value in the expected position.
Application code with extensive use of scalar (vs packed) intrinsic loads /
stores should be flagged for rewrite to C code using existing scalar
types (float, double, int, long, etc.). The compiler may be able the
vectorize this scalar code using the native vector SIMD instruction set.</para>
<para>Another example is the set reverse order:
<programlisting><![CDATA[/* Create the vector [Z Y X W]. */
extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_set_ps (const float __Z, const float __Y, const float __X, const float __W)
return __extension__ (__m128)(__v4sf){ __W, __X, __Y, __Z };
/* Create the vector [W X Y Z]. */
extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_setr_ps (float __Z, float __Y, float __X, float __W)
return __extension__ (__m128)(__v4sf){ __Z, __Y, __X, __W };
<para>Note the use of <emphasis>initializer syntax</emphasis> used to collect a set of scalars
into a vector. Code with constant initializer values will generate a vector
constant of the appropriate endian. However code with variables in the
initializer can get complicated as it often requires transfers between register
sets and perhaps format conversions. We can assume that the compiler will
generate the correct code, but if this class of intrinsics shows up as a hot spot,
a rewrite to native PPC vector built-ins may be appropriate. For example
initializer of a variable replicated to all the vector fields might not be
recognized as a “load and splat” and making this explicit may help the
compiler generate better code.</para>