Miscellaneous correction for spelling, grammar and punctuation.

Signed-off-by: Steven Munroe <sjmunroe@us.ibm.com>
pull/69/head
sjmunroe 7 years ago
parent dfa363ccaa
commit 1a2838c6b0

@ -58,10 +58,10 @@
<link xlink:href="https://gcc.gnu.org/onlinedocs/">GCC online documentation</link> <link xlink:href="https://gcc.gnu.org/onlinedocs/">GCC online documentation</link>
</para> </para>
<para> <para>
<link xlink:href="https://gcc.gnu.org/onlinedocs/gcc-6.3.0/gcc/">GCC Manual (GCC 6.3)</link> <link xlink:href="https://gcc.gnu.org/gcc.pdf">GCC Manual</link>
</para> </para>
<para> <para>
<link xlink:href="https://gcc.gnu.org/onlinedocs/gccint/">GCC Internals Manual</link> <link xlink:href="https://gcc.gnu.org/gccint.pdf">GCC Internals Manual</link>
</para> </para>
<para/> <para/>
</section> </section>

@ -22,7 +22,7 @@
xml:id="bk_main"> xml:id="bk_main">


<title>Linux on Power Porting Guide</title> <title>Linux on Power Porting Guide</title>
<subtitle>Vector Intrinsic</subtitle> <subtitle>Vector Intrinsics</subtitle>


<info> <info>
<author> <author>
@ -69,11 +69,11 @@
<revhistory> <revhistory>
<!-- TODO: Update as new revisions created --> <!-- TODO: Update as new revisions created -->
<revision> <revision>
<date>2017-07-26</date> <date>2017-09-14</date>
<revdescription> <revdescription>
<itemizedlist spacing="compact"> <itemizedlist spacing="compact">
<listitem> <listitem>
<para>Revision 0.1 - initial draft from Steve Munroe</para> <para>Revision 0.2 - initial draft from Steve Munroe</para>
</listitem> </listitem>
</itemizedlist> </itemizedlist>
</revdescription> </revdescription>

@ -23,12 +23,14 @@
<title>How do we work this?</title> <title>How do we work this?</title>
<para>The working assumption is to start with the existing GCC headers from <para>The working assumption is to start with the existing GCC headers from
./gcc/config/i386/, then convert them to PowerISA and add them to <literal>./gcc/config/i386/</literal>, then convert them to PowerISA
./gcc/config/rs6000/. I assume we will replicate the existing header structure and add them to <literal>./gcc/config/rs6000/</literal>.
and retain the existing header file and intrinsic names. This also allows us to I assume we will replicate the existing header structure
reuse existing DejaGNU test cases from ./gcc/testsuite/gcc.target/i386, modify and retain the existing header file and intrinsic names.
them as needed for the POWER target, and them to the This also allows us to reuse existing DejaGNU test cases from
./gcc/testsuite/gcc.target/powerpc.</para> <literal>./gcc/testsuite/gcc.target/i386</literal>, modify
them as needed for the POWER target, and add them to
<literal>./gcc/testsuite/gcc.target/powerpc</literal>.</para>


<para>We can be flexible on the sequence that headers/intrinsics and test <para>We can be flexible on the sequence that headers/intrinsics and test
cases are ported.  This should be based on customer need and resolving cases are ported.  This should be based on customer need and resolving
@ -42,8 +44,8 @@
 ./gcc/config/i386/ and copy the header comment (including FSF copyright) down  ./gcc/config/i386/ and copy the header comment (including FSF copyright) down
to any vector typedefs used in the API or implementation. Skip the Intel to any vector typedefs used in the API or implementation. Skip the Intel
intrinsic implementation code for now, but add the ending #end if matching the intrinsic implementation code for now, but add the ending #end if matching the
headers conditional guard against multiple inclusion. You can add  #include headers conditional guard against multiple inclusion. You can add additional
&lt;alternative&gt; as needed. For examples: #include's as needed. For example:
<programlisting><![CDATA[/* Copyright (C) 2003-2017 Free Software Foundation, Inc <programlisting><![CDATA[/* Copyright (C) 2003-2017 Free Software Foundation, Inc
... ...
/* This header provides a best effort implementation of the Intel X86 /* This header provides a best effort implementation of the Intel X86
@ -85,10 +87,13 @@ typedef float __v4sf __attribute__ ((__vector_size__ (16)));
intrinsics. </para> intrinsics. </para>


<para>The <para>The
<link xlink:href="https://gcc.gnu.org/onlinedocs/gccint/Testsuites.html#Testsuites">GCC <link xlink:href="https://gcc.gnu.org/onlinedocs/gccint/Testsuites.html#Testsuites">
testsuite</link> uses the DejaGNU  test framework as documented in the <emphasis role="italic">GCC testsuite</emphasis></link>
<link xlink:href="https://gcc.gnu.org/onlinedocs/gccint/">GNU Compiler Collection (GCC) uses the DejaGNU  test framework as documented in the
Internals</link> manual. GCC adds its own DejaGNU directives and extensions, <link xlink:href="https://gcc.gnu.org/onlinedocs/gccint/">
<emphasis role="italic">GNU Compiler Collection (GCC)
Internals</emphasis></link>
manual. GCC adds its own DejaGNU directives and extensions,
that are embedded in the testsuite source as comments.  Some are platform that are embedded in the testsuite source as comments.  Some are platform
specific and will need to be adjusted for tests that are ported to our specific and will need to be adjusted for tests that are ported to our
platform. For example platform. For example
@ -102,9 +107,9 @@ typedef float __v4sf __attribute__ ((__vector_size__ (16)));
/* { dg-require-effective-target lp64 } */ /* { dg-require-effective-target lp64 } */
/* { dg-require-effective-target p8vector_hw { target powerpc*-*-* } } */]]></programlisting></para> /* { dg-require-effective-target p8vector_hw { target powerpc*-*-* } } */]]></programlisting></para>


<para>Repeat this process until you have equivalent implementations for all <para>Repeat this process until you have equivalent DejaGNU test
the intrinsics in that header and associated test cases that execute without implementations for all the intrinsics in that header and associated
error.</para> test cases that execute without error.</para>


<xi:include href="sec_prefered_methods.xml"/> <xi:include href="sec_prefered_methods.xml"/>
<xi:include href="sec_prepare.xml"/> <xi:include href="sec_prepare.xml"/>

@ -27,19 +27,26 @@
applications, and make them (or equivalents) available for the PowerPC64LE applications, and make them (or equivalents) available for the PowerPC64LE
platform. These X86 intrinsics started with the Intel and Microsoft compilers platform. These X86 intrinsics started with the Intel and Microsoft compilers
but were then ported to the GCC compiler. The GCC implementation is a set of but were then ported to the GCC compiler. The GCC implementation is a set of
headers with inline functions. These inline functions provide a implementation headers with inline functions. These inline functions provide an implementation
mapping from the Intel/Microsoft dialect intrinsic names to the corresponding mapping from the Intel/Microsoft dialect intrinsic names to the corresponding
GCC Intel built-in's or directly via C language vector extension syntax.</para> GCC Intel built-ins or directly via C language vector extension syntax.</para>


<para>The current proposal is to start with the existing X86 GCC intrinsic <para>The current proposal is to start with the existing X86 GCC intrinsic
headers and port them (copy and change the source)  to POWER using C language headers and port them (copy and change the source)  to POWER using C language
vector extensions, VMX and VSX built-ins. Another key assumption is that we vector extensions, VMX and VSX built-ins. Another key assumption is that we
will be able to use many of existing Intel DejaGNU test cases on will be able to use many of the existing Intel DejaGNU test cases in
./gcc/testsuite/gcc.target/i386. This document is intended as a guide to ./gcc/testsuite/gcc.target/i386. This document is intended as a guide to
developers participating in this effort. However this document provides developers participating in this effort. However this document provides
guidance and examples that should be useful to developers who may encounter X86 guidance and examples that should be useful to developers who may encounter X86
intrinsics in code that they are porting to another platform.</para> intrinsics in code that they are porting to another platform.</para>


<note><para>(<emphasis>We have started contributions of X86 intrinsic headers
to the GCC project.</emphasis>) The current status of the project is the BMI
(bmiintrin.h), BMI2 (bmi2intrin.h), MMX (mmintrin.h), and SSE (xmmintrin.h)
intrinsic headers are committed to GCC development trunk for GCC 8.
Work on SSE2 (emmintrin.h) is in progress.
</para></note>

<xi:include href="sec_review_source.xml"/> <xi:include href="sec_review_source.xml"/>


</chapter> </chapter>

@ -22,24 +22,26 @@
xml:id="sec_crossing_lanes"> xml:id="sec_crossing_lanes">
<title>Crossing lanes</title> <title>Crossing lanes</title>
<para>We have seen that, most of the time, vector SIMD units prefer to keep <para>Vector SIMD units prefer to keep
computations in the same “lane” (element number) as the input elements. The computations in the same “lane” (element number) as the input elements. The
only exception in the examples so far are the occasional splat (copy one only exception in the examples so far are the occasional vector splat (copy one
element to all the other elements of the vector) operations. Splat is an element to all the other elements of the vector) operations. Splat is an
example of the general category of “permute” operations (Intel would call example of the general category of “permute” operations (Intel would call
this a “shuffle” or “blend”). Permutes selects and rearrange the this a “shuffle” or “blend”). </para>
elements of (usually) a concatenated pair of vectors and delivers those
<para>Permutes select and rearrange the
elements of an input vector (or a concatenated pair of vectors) and deliver those
selected elements, in a specific order, to a result vector. The selection and selected elements, in a specific order, to a result vector. The selection and
order of elements in the result is controlled by a third vector, either as 3rd order of elements in the result is controlled by a third operand, either as a 3rd
input vector or and immediate field of the instruction.</para> input vector or as an immediate field of the instruction.</para>


<para>For example the Intel intrisics for <para>For example, consider the Intel intrisics for
<link xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_hadd&amp;expand=2757,4767,409,2757">Horizontal Add / Subtract</link> <link xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_hadd&amp;expand=2757,4767,409,2757">Horizontal Add / Subtract</link>
added with SSE3. These instrinsics add (subtract) adjacent element pairs, across pair of added with SSE3. These instrinsics add (subtract) adjacent element pairs across a pair of
input vectors, placing the sum of the adjacent elements in the result vector. input vectors, placing the sum of the adjacent elements in the result vector.
For example For example
<link xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_hadd_ps&amp;expand=2757,4767,409,2757,2757">_mm_hadd_ps</link>   <link xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_hadd_ps&amp;expand=2757,4767,409,2757,2757">_mm_hadd_ps</link>  
which implments the operation on float: which implements the operation on float:
<programlisting><![CDATA[ result[0] = __A[1] + __A[0]; <programlisting><![CDATA[ result[0] = __A[1] + __A[0];
result[1] = __A[3] + __A[2]; result[1] = __A[3] + __A[2];
result[2] = __B[1] + __B[0]; result[2] = __B[1] + __B[0];
@ -60,15 +62,16 @@
the vector add (to implement Horizontal Add).  </para> the vector add (to implement Horizontal Add).  </para>


<para>The PowerISA provides generalized byte-level vector permute (vperm) <para>The PowerISA provides generalized byte-level vector permute (vperm)
based a vector register pair source as input and a control vector. The control based on a vector register pair (32 bytes) source as input and a (16-byte) control vector.
The control
vector provides 16 indexes (0-31) to select bytes from the concatenated input vector provides 16 indexes (0-31) to select bytes from the concatenated input
vector register pair (VRA, VRB). A more specific set of permutes (pack, unpack, vector register pair (VRA, VRB). There are also predefined permutes (splat, pack, unpack,
merge, splat) operations (across element sizes) are encoded as separate merge) operations (across element sizes) that are encoded as separate
 instruction opcodes or instruction immediate fields.</para>  instruction op-codes or instruction immediate fields.</para>


<para>Unfortunately only the general <literal>vec_perm</literal> <para>Unfortunately only the general <literal>vec_perm</literal>
can provide the realignment can provide the realignment
we need the _mm_hadd_ps operation or any of the int, short variants of hadd. we need for the _mm_hadd_ps operation or any of the int, short variants of hadd.
For example: For example:
<programlisting><![CDATA[extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__)) <programlisting><![CDATA[extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_hadd_ps (__m128 __X, __m128 __Y) _mm_hadd_ps (__m128 __X, __m128 __Y)
@ -89,12 +92,12 @@ _mm_hadd_ps (__m128 __X, __m128 __Y)
word elements across <literal>__X</literal> and <literal>__Y</literal>, word elements across <literal>__X</literal> and <literal>__Y</literal>,
and another to select the odd word elements and another to select the odd word elements
across <literal>__X</literal> and <literal>__Y</literal>. across <literal>__X</literal> and <literal>__Y</literal>.
The result of these permutes (<literal>vec_perm</literal>) are inputs to the The results of these permutes (<literal>vec_perm</literal>) are inputs to the
<literal>vec_add</literal> and completes the add operation. </para> <literal>vec_add</literal> that completes the horizontal add operation. </para>


<para>Fortunately the permute required for the double (64-bit) case (IE <para>Fortunately the permute required for the double (64-bit) case
_mm_hadd_pd) reduces to the equivalent of <literal>vec_mergeh</literal> / (<literal>_mm_hadd_pd</literal>) reduces to the equivalent of
<literal>vec_mergel</literal>  doubleword <literal>vec_mergeh</literal> / <literal>vec_mergel</literal>  doubleword
(which are variants of  VSX Permute Doubleword Immediate). So the (which are variants of  VSX Permute Doubleword Immediate). So the
implementation of _mm_hadd_pd can be simplified to this: implementation of _mm_hadd_pd can be simplified to this:
<programlisting><![CDATA[extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__)) <programlisting><![CDATA[extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))

@ -23,12 +23,12 @@
<title>Profound differences</title> <title>Profound differences</title>
<para>We have already mentioned above a number of architectural differences <para>We have already mentioned above a number of architectural differences
that effect porting of codes containing Intel intrinsics to POWER. The fact that affect porting of codes containing Intel intrinsics to POWER. The fact
that Intel supports multiple vector extensions with different vector widths that Intel supports multiple vector extensions with different vector widths
(64, 128, 256, and 512-bits) while the PowerISA only supports vectors of (64, 128, 256, and 512 bits) while the PowerISA only supports vectors of
128-bits is one issue. Another is the difference in how the respective ISAs 128 bits is one issue. Another is the difference in how the respective ISAs
support scalars in vector registers is another.  In the text above we propose support scalars in vector registers.  In the text above we propose
workable alternatives for the PowerPC port. There also differences in the workable alternatives for the PowerPC port. There are also differences in the
handling of floating point exceptions and rounding modes that may impact the handling of floating point exceptions and rounding modes that may impact the
application's performance or behavior.</para> application's performance or behavior.</para>

@ -22,16 +22,16 @@
xml:id="sec_floatingpoint_exceptions"> xml:id="sec_floatingpoint_exceptions">
<title>Floating Point Exceptions</title> <title>Floating Point Exceptions</title>
<para>Nominally both ISAs support the IEEE754 specifications, but there are <para>Nominally both ISAs support the IEEE-754 specifications, but there are
some subtle differences. Both architecture define a status and control register some subtle differences. Both architectures define a status and control register
to record exceptions and enable / disable floating exceptions for program to record exceptions and enable / disable floating point exceptions for program
interrupt or default action. Intel has a MXCSR and PowerISA has a FPSCR which interrupt or default action. Intel has a MXCSR and PowerISA has a FPSCR which
basically do the same thing but with different bit layout. </para> basically do the same thing but with different bit layout. </para>


<para>Intel provides <literal>_mm_setcsr</literal> / <literal>_mm_getcsr</literal> <para>Intel provides <literal>_mm_setcsr</literal> / <literal>_mm_getcsr</literal>
intrinsics to allow direct intrinsic functions to allow direct access to the MXCSR.
access to the MXCSR. In the early days before the OS POSIX run-times where This might have been useful in the early days before the OS run-times were
updated  to manage the MXCSR, this might have been useful. Today this would be updated to manage the MXCSR via the POSIX APIs. Today this would be
highly discouraged with a strong preference to use the POSIX APIs highly discouraged with a strong preference to use the POSIX APIs
(<literal>feclearexceptflag</literal>, (<literal>feclearexceptflag</literal>,
<literal>fegetexceptflag</literal>, <literal>fegetexceptflag</literal>,
@ -44,29 +44,29 @@
might be simpler just to replace these intrinsics with macros that generate might be simpler just to replace these intrinsics with macros that generate
#error.</para> #error.</para>


<para>The Intel MXCSR does have some none (POSIX/IEEE754) standard quirks; <para>The Intel MXCSR does have some non- (POSIX/IEEE754) standard quirks:
Flush-To-Zero and Denormals-Are-Zeros flags. This simplifies the hardware The Flush-To-Zero and Denormals-Are-Zeros flags. This simplifies the hardware
response to what should be a rare condition (underflows where the result can response to what should be a rare condition (underflows where the result can
not be represented in the exponent range and precision of the format) by simply not be represented in the exponent range and precision of the format) by simply
returning a signed 0.0 value. The intrinsic header implementation does provide returning a signed 0.0 value. The intrinsic header implementation does provide
constant masks for <literal>_MM_DENORMALS_ZERO_ON</literal> constant masks for <literal>_MM_DENORMALS_ZERO_ON</literal>
(<literal>&lt;pmmintrin.h&gt;</literal>) and (<literal>&lt;pmmintrin.h&gt;</literal>) and
<literal>_MM_FLUSH_ZERO_ON</literal> (<literal>&lt;xmmintrin.h&gt;</literal>, <literal>_MM_FLUSH_ZERO_ON</literal> (<literal>&lt;xmmintrin.h&gt;</literal>),
so technically it is available to users so technically it is available to users
of the Intel Intrinsics API.</para> of the Intel Intrinsics API.</para>


<para>The VMX Vector facility provides a separate Vector Status and Control <para>The VMX Vector facility provides a separate Vector Status and Control
register (VSCR) with a Non-Java Mode control bit. This control combines the register (VSCR) with a Non-Java Mode control bit. This control combines the
flush-to-zero semantics for floating Point underflow and denormal values. But flush-to-zero semantics for floating point underflow and denormal values. But
this control only applies to VMX vector float instructions and does not apply this control only applies to VMX vector float instructions and does not apply
to VSX scalar floating Point or vector double instructions. The FPSCR does to VSX scalar floating Point or vector double instructions. The FPSCR does
define a Floating-Point non-IEEE mode which is optional in the architecture. define a Floating-Point non-IEEE mode which is optional in the architecture.
This would apply to Scalar and VSX floating-point operations if it was This would apply to Scalar and VSX floating-point operations if it were
implemented. This was largely intended for embedded processors and is not implemented. This was largely intended for embedded processors and is not
implemented in the POWER processor line.</para> implemented in the POWER processor line.</para>


<para>As the flush-to-zero is primarily a performance enhansement and is <para>As the flush-to-zero is primarily a performance enhancement and is
clearly outside the IEEE754 standard, it may be best to simply ignore this clearly outside the IEEE-754 standard, it may be best to simply ignore this
option for the intrinsic port.</para> option for the intrinsic port.</para>


</section> </section>

@ -23,7 +23,7 @@
<title>Floating-point rounding modes</title> <title>Floating-point rounding modes</title>
<para>The Intel (x86 / x86_64) and PowerISA architectures both support the <para>The Intel (x86 / x86_64) and PowerISA architectures both support the
4 IEEE754 rounding modes. Again while the Intel Intrinsic API allows the 4 IEEE-754 rounding modes. Again while the Intel Intrinsic API allows the
application to change rounding modes via updates to the application to change rounding modes via updates to the
<literal>MXCSR</literal> it is a bad idea <literal>MXCSR</literal> it is a bad idea
and should be replaced with the POSIX APIs (<literal>fegetround</literal> and and should be replaced with the POSIX APIs (<literal>fegetround</literal> and

@ -23,7 +23,7 @@
<title>GCC Vector Extensions</title> <title>GCC Vector Extensions</title>
<para>The GCC vector extensions are common syntax but implemented in a <para>The GCC vector extensions are common syntax but implemented in a
target specific way. Using the C vector extensions require the target specific way. Using the C vector extensions requires the
<literal>__gnu_inline__</literal> <literal>__gnu_inline__</literal>
attribute to avoid syntax errors in case the user specified  C standard attribute to avoid syntax errors in case the user specified  C standard
compliance (<literal>-std=c90</literal>, <literal>-std=c11</literal>, compliance (<literal>-std=c90</literal>, <literal>-std=c11</literal>,
@ -78,10 +78,18 @@ _mm_store_ss (float *__P, __m128 __A)


<para>The code generation is complicated by the fact that PowerISA vector <para>The code generation is complicated by the fact that PowerISA vector
registers are Big Endian (element 0 is the left most word of the vector) and registers are Big Endian (element 0 is the left most word of the vector) and
X86 scalar stores are from the left most (work/dword) for the vector register. scalar loads / stores are also to / from the right most word / dword.
X86 scalar loads / stores are to / from the right most element for the
XMM vector register.
The PowerPC64 ELF V2 ABI mimics the X86 Little Endian behavior by placing
logical element [0] in the right most element of the vector register. </para>

<para>This may require the compiler to generate additional instructions
to place the scalar value in the expected position.
Application code with extensive use of scalar (vs packed) intrinsic loads / Application code with extensive use of scalar (vs packed) intrinsic loads /
stores should be flagged for rewrite to native PPC code using exisiing scalar stores should be flagged for rewrite to C code using existing scalar
types (float, double, int, long, etc.). </para> types (float, double, int, long, etc.). The compiler may be able the
vectorize this scalar code using the native vector SIMD instruction set.</para>


<para>Another example is the set reverse order: <para>Another example is the set reverse order:
<programlisting><![CDATA[/* Create the vector [Z Y X W]. */ <programlisting><![CDATA[/* Create the vector [Z Y X W]. */
@ -103,7 +111,7 @@ _mm_setr_ps (float __Z, float __Y, float __X, float __W)
constant of the appropriate endian. However code with variables in the constant of the appropriate endian. However code with variables in the
initializer can get complicated as it often requires transfers between register initializer can get complicated as it often requires transfers between register
sets and perhaps format conversions. We can assume that the compiler will sets and perhaps format conversions. We can assume that the compiler will
generate the correct code, but if this class of intrinsics shows up a hot spot, generate the correct code, but if this class of intrinsics shows up as a hot spot,
a rewrite to native PPC vector built-ins may be appropriate. For example a rewrite to native PPC vector built-ins may be appropriate. For example
initializer of a variable replicated to all the vector fields might not be initializer of a variable replicated to all the vector fields might not be
recognized as a “load and splat” and making this explicit may help the recognized as a “load and splat” and making this explicit may help the

@ -23,7 +23,7 @@
<title>Dealing with AVX and AVX512</title> <title>Dealing with AVX and AVX512</title>
<para>AVX is a bit easier for PowerISA and the ELF V2 ABI. First we have <para>AVX is a bit easier for PowerISA and the ELF V2 ABI. First we have
lots (64) of vector registers and a super scalar vector pipe-line (can execute lots (64) of vector registers and a superscalar vector pipeline (can execute
two or more independent 128-bit vector operations concurrently). Second the ELF two or more independent 128-bit vector operations concurrently). Second the ELF
V2 ABI was designed to pass and return larger aggregates in vector V2 ABI was designed to pass and return larger aggregates in vector
registers:</para> registers:</para>
@ -35,7 +35,7 @@
</listitem> </listitem>
<listitem> <listitem>
<para>A qualified vector argument corresponds to: <para>A qualified vector argument corresponds to:
<itemizedlist> <itemizedlist spacing="compact">
<listitem> <listitem>
<para>A vector data type</para> <para>A vector data type</para>
</listitem> </listitem>
@ -58,7 +58,7 @@
</itemizedlist> </itemizedlist>


<para>So the ABI allows for passing up to three structures each <para>So the ABI allows for passing up to three structures each
representing 512-bit vectors and returning such (512-bit) structure all in VMX representing 512-bit vectors and returning such (512-bit) structures all in VMX
registers. This can be extended further by spilling parameters (beyond 12 X registers. This can be extended further by spilling parameters (beyond 12 X
128-bit vectors) to the parameter save area, but we should not need that, as 128-bit vectors) to the parameter save area, but we should not need that, as
most intrinsics only use 2 or 3 operands.. Vector registers not needed for most intrinsics only use 2 or 3 operands.. Vector registers not needed for
@ -79,7 +79,7 @@
<literal>__vector_size__</literal> (32) or (64) in the PowerPC implementation of <literal>__vector_size__</literal> (32) or (64) in the PowerPC implementation of
<literal>__m256</literal> and <literal>__m512</literal> <literal>__m256</literal> and <literal>__m512</literal>
types. Instead we will typedef structs of 2 or 4 vector (<literal>__m128</literal>) fields. This types. Instead we will typedef structs of 2 or 4 vector (<literal>__m128</literal>) fields. This
allows efficient handling of these larger data types without require new GCC allows efficient handling of these larger data types without requiring new GCC
language extensions. </para> language extensions. </para>


<para>In the end we should use the same type names and definitions as the <para>In the end we should use the same type names and definitions as the

@ -22,7 +22,7 @@
xml:id="sec_handling_mmx"> xml:id="sec_handling_mmx">
<title>Dealing with MMX</title> <title>Dealing with MMX</title>
<para>MMX is actually the hard case. The <literal>__m64</literal> <para>MMX is actually the harder case. The <literal>__m64</literal>
type supports SIMD vector type supports SIMD vector
int types (char, short, int, long).  The  Intel API defines   int types (char, short, int, long).  The  Intel API defines  
<literal>__m64</literal> as: <literal>__m64</literal> as:
@ -32,23 +32,23 @@
GCC) and we would prefer to use a native PowerISA type that can be passed in a GCC) and we would prefer to use a native PowerISA type that can be passed in a
single register.  The PowerISA Rotate Under Mask instructions can easily single register.  The PowerISA Rotate Under Mask instructions can easily
extract and insert integer fields of a General Purpose Register (GPR). This extract and insert integer fields of a General Purpose Register (GPR). This
implies that MMX integer types can be handled as a internal union of arrays for implies that MMX integer types can be handled as an internal union of arrays for
the supported element types. So an 64-bit unsigned long long is the best type the supported element types. So a 64-bit unsigned long long is the best type
for parameter passing and return values. Especially for the 64-bit (_si64) for parameter passing and return values, especially for the 64-bit (_si64)
operations as these normally generate a single PowerISA instruction.</para> operations as these normally generate a single PowerISA instruction.</para>


<para>The SSE extensions include some convert operations for <para>The SSE extensions include some copy / convert operations for
<literal>_m128</literal> to / <literal>_m128</literal> to /
from <literal>_m64</literal> and this includes some int to / from float conversions. However in from <literal>_m64</literal> and this includes some int to / from float conversions. However in
these cases the float operands always reside in SSE (XMM) registers (which these cases the float operands always reside in SSE (XMM) registers (which
match the PowerISA vector registers) and the MMX registers only contain integer match the PowerISA vector registers) and the MMX registers only contain integer
values. POWER8 (PowerISA-2.07) has direct move instructions between GPRs and values. POWER8 (PowerISA-2.07) has direct move instructions between GPRs and
VSRs. So these transfers are normally a single instruction and any conversions VSRs. So these transfers are normally a single instruction and any conversions
can be handed in the vector unit.</para> can be handled in the vector unit.</para>


<para>When transferring a <literal>__m64</literal> value to a vector register we should also <para>When transferring a <literal>__m64</literal> value to a vector register we should also
execute a xxsplatd instruction to insure there is valid data in all four execute a xxsplatd instruction to insure there is valid data in all four
element lanes before doing floating point operations. This avoids generating float element lanes before doing floating point operations. This avoids causing
extraneous floating point exceptions that might be generated by uninitialized extraneous floating point exceptions that might be generated by uninitialized
parts of the vector. The top two lanes will have the floating point results parts of the vector. The top two lanes will have the floating point results
that are in position for direct transfer to a GPR or stored via Store Float that are in position for direct transfer to a GPR or stored via Store Float
@ -57,7 +57,8 @@
form.</para> form.</para>


<para>Also for the smaller element sizes and higher element counts (MMX <para>Also for the smaller element sizes and higher element counts (MMX
<literal>_pi8</literal> and <literal>_p16</literal> types) the number of  Rotate Under Mask instructions required to <literal>_pi8</literal> and <literal>_p16</literal> types)
the number of  Rotate Under Mask instructions required to
disassemble the 64-bit <literal>__m64</literal> disassemble the 64-bit <literal>__m64</literal>
into elements, perform the element calculations, into elements, perform the element calculations,
and reassemble the elements in a single <literal>__m64</literal> and reassemble the elements in a single <literal>__m64</literal>

@ -52,9 +52,9 @@
document, Chapter 6. Vector Programming Interfaces and Appendix A. Predefined document, Chapter 6. Vector Programming Interfaces and Appendix A. Predefined
Functions for Vector Programming.</para> Functions for Vector Programming.</para>


<para>Another useful document is the original <link xlink:href="http://www.nxp.com/assets/documents/data/en/reference-manuals/ALTIVECPEM.pdf">Altivec Technology Programers Interface Manual</link> <para>Another useful document is the original <link xlink:href="http://www.nxp.com/assets/documents/data/en/reference-manuals/ALTIVECPEM.pdf">Altivec Technology Programmers Interface Manual</link>
with a  user friendly structure and many helpful diagrams. But alas the PIM does does not with a user friendly structure and many helpful diagrams. But alas the PIM does does not
cover the resent PowerISA (power7,  power8, and power9) enhancements.</para> cover the recent PowerISA (power7,  power8, and power9) enhancements.</para>


</section> </section>



@ -53,8 +53,8 @@
are a lot of scalar operations on a single float, double, or long long type. In are a lot of scalar operations on a single float, double, or long long type. In
effect these are scalars that can take advantage of the larger (xmm) register effect these are scalars that can take advantage of the larger (xmm) register
space. Also in the Intel 32-bit architecture they provided IEEE754 float and space. Also in the Intel 32-bit architecture they provided IEEE754 float and
double types, and 64-bit integers that did not exist or where hard to implement double types, and 64-bit integers that did not exist or were hard to implement
in the base i386/387 instruction set. These scalar operation use a suffix in the base i386/387 instruction set. These scalar operations use a suffix
starting with '_s' (<literal>_sd</literal> for scalar double float, starting with '_s' (<literal>_sd</literal> for scalar double float,
<literal>_ss</literal> scalar float, and <literal>_si64</literal> <literal>_ss</literal> scalar float, and <literal>_si64</literal>
for scalar long long).</para> for scalar long long).</para>
@ -72,7 +72,7 @@
of 4 32-bit integers.</para> of 4 32-bit integers.</para>


<para>The GCC  builtins for the <para>The GCC  builtins for the
<link xlink:href="https://gcc.gnu.org/onlinedocs/gcc-6.3.0/gcc/x86-Built-in-Functions.html#x86-Built-in-Functions">i386.target</link>, <link xlink:href="https://gcc.gnu.org/onlinedocs/gcc-6.3.0/gcc/x86-Built-in-Functions.html#x86-Built-in-Functions">i386.target</link>
(includes x86 and x86_64) are not (includes x86 and x86_64) are not
the same as the Intel Intrinsics. While they have similar intent and cover most the same as the Intel Intrinsics. While they have similar intent and cover most
of the same functions, they use a different naming (prefixed with of the same functions, they use a different naming (prefixed with
@ -83,10 +83,10 @@ v4hi __builtin_ia32_paddw (v4hi, v4hi)
v2si __builtin_ia32_paddd (v2si, v2si) v2si __builtin_ia32_paddd (v2si, v2si)
v2di __builtin_ia32_paddq (v2di, v2di)]]></programlisting></para> v2di __builtin_ia32_paddq (v2di, v2di)]]></programlisting></para>


<para>Note: A key difference between GCC builtins for i386 and Powerpc is <note><para>A key difference between GCC built-ins for i386 and PowerPC is
that the x86 builtins have different names of each operation and type while the that the x86 built-ins have different names of each operation and type while the
powerpc altivec builtins tend to have a single generatic builtin for  each PowerPC altivec built-ins tend to have a single generic built-in for  each
operation, across a set of compatible operand types. </para> operation, across a set of compatible operand types. </para></note>


<para>In GCC the Intel Intrinsic header (*intrin.h) files are implemented <para>In GCC the Intel Intrinsic header (*intrin.h) files are implemented
as a set of inline functions using the Intel Intrinsic API names and types. as a set of inline functions using the Intel Intrinsic API names and types.
@ -106,7 +106,7 @@ _mm_add_sd (__m128d __A, __m128d __B)
}]]></programlisting></para> }]]></programlisting></para>


<para>Note that the   <para>Note that the  
<emphasis role="bold"><literal>_mm_add_pd</literal></emphasis> is implemented direct as C vector <emphasis role="bold"><literal>_mm_add_pd</literal></emphasis> is implemented direct as GCC C vector
extension code., while extension code., while
<emphasis role="bold"><literal>_mm_add_sd</literal></emphasis> is implemented via the GCC builtin <emphasis role="bold"><literal>_mm_add_sd</literal></emphasis> is implemented via the GCC builtin
<emphasis role="bold"><literal>__builtin_ia32_addsd</literal></emphasis>. From the <emphasis role="bold"><literal>__builtin_ia32_addsd</literal></emphasis>. From the

@ -23,10 +23,10 @@
<title>The structure of the intrinsic includes</title> <title>The structure of the intrinsic includes</title>
<para>The GCC x86 intrinsic functions for vector were initially grouped by <para>The GCC x86 intrinsic functions for vector were initially grouped by
technology (MMX and SSE), which starts with MMX continues with SSE through technology (MMX and SSE), which starts with MMX and continues with SSE through
SSE4.1 stacked like a set of Russian dolls.</para> SSE4.1 stacked like a set of Russian dolls.</para>


<para>Basically each higher layer include, needs typedefs and helper macros <para>Basically each higher layer include needs typedefs and helper macros
defined by the lower level intrinsic includes. mm_malloc.h simply provides defined by the lower level intrinsic includes. mm_malloc.h simply provides
wrappers for posix_memalign and free. Then it gets a little weird, starting wrappers for posix_memalign and free. Then it gets a little weird, starting
with the crypto extensions: with the crypto extensions:
@ -34,8 +34,8 @@
<programlisting><![CDATA[wmmintrin.h (AES) includes emmintrin.h]]></programlisting></para> <programlisting><![CDATA[wmmintrin.h (AES) includes emmintrin.h]]></programlisting></para>
<para>For AVX, AVX2, and AVX512 they must have decided <para>For AVX, AVX2, and AVX512 they must have decided
that the Russian Dolls thing was getting out of hand. AVX et all is split that the Russian Dolls thing was getting out of hand. AVX et al. is split
across 14 files across 14 files:
<programlisting><![CDATA[#include <avxintrin.h> <programlisting><![CDATA[#include <avxintrin.h>
#include <avx2intrin.h> #include <avx2intrin.h>
@ -53,25 +53,25 @@
#include <avx512vbmiintrin.h> #include <avx512vbmiintrin.h>
#include <avx512vbmivlintrin.h>]]></programlisting> #include <avx512vbmivlintrin.h>]]></programlisting>
but they do not want the applications include these but they do not want the applications to include these
individually.</para> individually.</para>
<para>So <emphasis role="bold">immintrin.h</emphasis> includes everything Intel vector, include all the <para>So <emphasis role="bold">immintrin.h</emphasis> includes everything Intel vector, including all the
AVX, AES, SSE and MMX flavors. AVX, AES, SSE, and MMX flavors.
<programlisting><![CDATA[#ifndef _IMMINTRIN_H_INCLUDED <programlisting><![CDATA[#ifndef _IMMINTRIN_H_INCLUDED
# error "Never use <avxintrin.h> directly; include <immintrin.h> instead." # error "Never use <avxintrin.h> directly; include <immintrin.h> instead."
#endif]]></programlisting></para> #endif]]></programlisting></para>


<para>So what is the net? The include structure provides some strong clues <para>So why is this interesting? The include structure provides some strong clues
about the order that we should approach this effort.  For example if you need about the order that we should approach this effort.  For example if you need
to intrinsic from SSE4 (smmintrin.h) we are likely to need to type definitions to use intrinsics from SSE4 (smmintrin.h) you are likely to need to type definitions
from SSE (emmintrin.h). So a bottoms up (MMX, SSE, SSE2, …) approach seems from SSE (emmintrin.h). So a bottoms up (MMX, SSE, SSE2, …) approach seems
like the best plan of attack. Also saving the AVX parts for latter make sense, like the best plan of attack. Also saving the AVX parts for later make sense,
as most are just wider forms of operations that already exists in SSE.</para> as most are just wider forms of operations that already exist in SSE.</para>


<para>We should use the same include structure to implement our PowerISA <para>We should use the same include structure to implement our PowerISA
equivalent API headers. This will make porting easier (drop-in replacement) and equivalent API headers. This will make porting easier (drop-in replacement) and
should get the application running quickly on POWER. Then we are in a position should get the application running quickly on POWER. Then we will be in a position
to profile and analyze the resulting application. This will show any hot spots to profile and analyze the resulting application. This will show any hot spots
where the simple one-to-one transformation results in bottlenecks and where the simple one-to-one transformation results in bottlenecks and
additional tuning is needed. For these cases we should improve our tools (SDK additional tuning is needed. For these cases we should improve our tools (SDK

@ -42,28 +42,35 @@ typedef float __v4sf __attribute__ ((__vector_size__ (16)));]]></programlisting>
in which pointers to one vector type are permitted to alias pointers to a in which pointers to one vector type are permitted to alias pointers to a
different vector type.</para></blockquote></para> different vector type.</para></blockquote></para>
<para>So there are a <para>There are a couple of issues here:
couple of issues here: 1)  the API seem to force the compiler to assume <itemizedlist spacing="compact">
aliasing of any parameter passed by reference. Normally the compiler assumes <listitem>
that parameters of different size do not overlap in storage, which allows more <para>The API seems to force the compiler to assume
optimization. 2) the data type used at the interface may not be the correct aliasing of any parameter passed by reference.</para>
type for the implied operation. So parameters of type </listitem>
<literal>__m128i</literal> (which is defined <listitem>
as vector long long) is also used for parameters and return values of vector <para>The data type used at the interface may not be
[char | short | int ]. </para> the correct type for the implied operation.</para>

</listitem>
<para>This may not matter when using x86 built-in's but does matter when </itemizedlist>
the implementation uses C vector extensions or in our case use PowerPC generic Normally the compiler assumes that parameters of different size do
not overlap in storage, which allows more optimization.
However parameters for different vector element sizes
[char | short | int | long] are all passed and returned as type <literal>__m128i</literal>
(defined as vector long long). </para>

<para>This may not matter when using x86 built-ins but does matter when
the implementation uses C vector extensions or in our case uses PowerPC generic
vector built-ins vector built-ins
(<xref linkend="sec_powerisa_vector_intrinsics"/>). (<xref linkend="sec_powerisa_vector_intrinsics"/>).
For the later cases the type must be correct for For the latter cases the type must be correct for
the compiler to generate the correct type (char, short, int, long) the compiler to generate the correct type (char, short, int, long)
(<xref linkend="sec_api_implemented"/>) for the generic (<xref linkend="sec_api_implemented"/>) for the generic
builtin operation. There is also concern that excessive use of builtin operation. There is also concern that excessive use of
<literal>__may_alias__</literal> <literal>__may_alias__</literal>
will limit compiler optimization. We are not sure how important this attribute will limit compiler optimization. We are not sure how important this attribute
is to the correct operation of the API.  So at a later stage we should is to the correct operation of the API.  So at a later stage we should
experiment with removing it from our implementation for PowerPC</para> experiment with removing it from our implementation for PowerPC.</para>


<para>The good news is that PowerISA has good support for 128-bit vectors <para>The good news is that PowerISA has good support for 128-bit vectors
and (with the addition of VSX) all the required vector data (char, short, int, and (with the addition of VSX) all the required vector data (char, short, int,
@ -76,11 +83,16 @@ typedef float __v4sf __attribute__ ((__vector_size__ (16)));]]></programlisting>
implemented as vector attribute extensions of the appropriate  size (   implemented as vector attribute extensions of the appropriate  size (  
<literal>__vector_size__</literal> ({8 | 16 | 32, and 64}). For the PowerPC target  GCC currently <literal>__vector_size__</literal> ({8 | 16 | 32, and 64}). For the PowerPC target  GCC currently
only supports the native <literal>__vector_size__</literal> ( 16 ). These we can support directly only supports the native <literal>__vector_size__</literal> ( 16 ). These we can support directly
in VMX/VSX registers and associated instructions. The GCC will compile with in VMX/VSX registers and associated instructions. GCC will compile code with
other   <literal>__vector_size__</literal> values, but the resulting types are treated as simple other   <literal>__vector_size__</literal> values, but the resulting types are treated as simple
arrays of the element type. This does not allow the compiler to use the vector arrays of the element type. This does not allow the compiler to use the vector
registers and vector instructions for these (nonnative) vectors.   So what is registers and vector instructions for these (nonnative) vectors.</para>
a programmer to do?</para>
<para>So the PowerISA VMX/VSX facilities and GCC compiler support for
128-bit/16-byte vectors and associated vector built-ins
are well matched to implementing equivalent X86 SSE intrinsic functions.
However implementing the older MMX (64-bit) and the latest
AVX (256 / 512-bit) extensions requires more thought and some ingenuity.</para>
<xi:include href="sec_handling_mmx.xml"/> <xi:include href="sec_handling_mmx.xml"/>
<xi:include href="sec_handling_avx.xml"/> <xi:include href="sec_handling_avx.xml"/>

@ -27,23 +27,23 @@
converts a packed vector double into converts a packed vector double into
a packed vector single float. Since only 2 doubles fit into a 128-bit vector a packed vector single float. Since only 2 doubles fit into a 128-bit vector
only 2 floats are returned and occupy only half (64-bits) of the XMM register. only 2 floats are returned and occupy only half (64-bits) of the XMM register.
For this intrinsic the 64-bit are packed into the logical left half of the For this intrinsic the 64 bits are packed into the logical left half of the result
registers and the logical right half of the register is set to zero (as per the register and the logical right half of the register is set to zero (as per the
Intel <literal>cvtpd2ps</literal> instruction).</para> Intel <literal>cvtpd2ps</literal> instruction).</para>


<para>The PowerISA provides the VSX Vector round and Convert <para>The PowerISA provides the VSX Vector round and Convert
Double-Precision to Single-Precision format (xvcvdpsp) instruction. In the ABI Double-Precision to Single-Precision format (xvcvdpsp) instruction. In the ABI
this is <literal>vec_floato</literal> (vector double) .   this is <literal>vec_floato</literal> (vector double).  
This instruction convert each double This instruction converts each double
element then transfers converted element 0 to float element 1, and converted element, then transfers converted element 0 to float element 1, and converted
element 1 to float element 3. Float elements 0 and 2 are undefined (the element 1 to float element 3. Float elements 0 and 2 are undefined (the
hardware can do what ever). This does not match the expected results for hardware can do whatever). This does not match the expected results for
<literal>_mm_cvtpd_ps</literal>. <literal>_mm_cvtpd_ps</literal>.
<programlisting><![CDATA[vec_floato ({1.0, 2.0}) result = {<undefined>, 1.0, <undefined>, 2.0} <programlisting><![CDATA[vec_floato ({1.0, 2.0}) result = {<undefined>, 1.0, <undefined>, 2.0}
_mm_cvtpd_ps ({1.0, 2.0}) result = {1.0, 2.0, 0.0, 0.0}]]></programlisting></para> _mm_cvtpd_ps ({1.0, 2.0}) result = {1.0, 2.0, 0.0, 0.0}]]></programlisting></para>


<para>So we need to re-position the results to word elements 0 and 2, which <para>So we need to re-position the results to word elements 0 and 2, which
allows a pack operation to deliver the correct format. Here the merge odd allows a pack operation to deliver the correct format. Here the merge-odd
splats element 1 to 0 and element 3 to 2. The Pack operation combines the low splats element 1 to 0 and element 3 to 2. The Pack operation combines the low
half of each doubleword from the vector result and vector of zeros to generate half of each doubleword from the vector result and vector of zeros to generate
the require format. the require format.
@ -53,8 +53,6 @@ _mm_cvtpd_ps (__m128d __A)
__v4sf result; __v4sf result;
__v4si temp; __v4si temp;
const __v4si vzero = {0,0,0,0}; const __v4si vzero = {0,0,0,0};
// temp = (__v4si)vec_floato (__A); /* GCC 8 */

__asm__( __asm__(
"xvcvdpsp %x0,%x1;\n" "xvcvdpsp %x0,%x1;\n"
: "=wa" (temp) : "=wa" (temp)
@ -66,9 +64,9 @@ _mm_cvtpd_ps (__m128d __A)
return (result); return (result);
}]]></programlisting></para> }]]></programlisting></para>


<para>This  technique is also used to implement   <para>This technique is also used to implement  
<link xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_cvttpd_epi32&amp;expand=1624,1859">_mm_cvttpd_epi32</link> <link xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_cvttpd_epi32&amp;expand=1624,1859">_mm_cvttpd_epi32</link>
which converts a packed vector double in to a packed vector int. The PowerISA instruction which converts a packed vector double into a packed vector int. The PowerISA instruction
<literal>xvcvdpsxws</literal> uses a similar layout for the result as <literal>xvcvdpsxws</literal> uses a similar layout for the result as
<literal>xvcvdpsp</literal> and requires the same fix up.</para> <literal>xvcvdpsp</literal> and requires the same fix up.</para>



@ -54,15 +54,16 @@ _mm_load_sd (double const *__P)
to combine __F and 0.0).  Other examples may generate sub-optimal code and to combine __F and 0.0).  Other examples may generate sub-optimal code and
justify a rewrite to PowerISA scalar or vector code (<link justify a rewrite to PowerISA scalar or vector code (<link
xlink:href="https://gcc.gnu.org/onlinedocs/gcc-6.3.0/gcc/PowerPC-AltiVec_002fVSX-Built- xlink:href="https://gcc.gnu.org/onlinedocs/gcc-6.3.0/gcc/PowerPC-AltiVec_002fVSX-Built-
in-Functions.html#PowerPC-AltiVec_002fVSX-Built-in-Functions">GCC PowerPC in-Functions.html#PowerPC-AltiVec_002fVSX-Built-in-Functions">
AltiVec Built-in Functions</link> or inline assembler). </para> <emphasis role="italic">GCC PowerPC AltiVec Built-in Functions</emphasis></link>
or inline assembler). </para>


<para>Net: try using the existing C code if you can, but check on what the <note><para>Try using the existing C code if you can, but check on what the
compiler generates.  If the generated code is horrendous, it may be worth the compiler generates.  If the generated code is horrendous, it may be worth the
effort to write a PowerISA specific equivalent. For codes making extensive use effort to write a PowerISA specific equivalent. For codes making extensive use
of MMX or SSE scalar intrinsics you will be better off rewriting to use of MMX or SSE scalar intrinsics you will be better off rewriting to use
standard C scalar types and letting the the GCC compiler handle the details standard C scalar types and letting the GCC compiler handle the details
(see <link linkend="sec_prefered_methods"/>).</para> (see <xref linkend="sec_prefered_methods"/>).</para></note>


</section> </section>



@ -23,8 +23,8 @@
<title>Packed vs scalar intrinsics</title> <title>Packed vs scalar intrinsics</title>
<para>So what is actually going on here? The vector code is clear enough if <para>So what is actually going on here? The vector code is clear enough if
you know that '+' operator is applied to each vector element. The the intent of you know that the '+' operator is applied to each vector element. The intent of
the builtin is a little less clear, as the GCC documentation for the X86 built-in is a little less clear, as the GCC documentation for
<literal>__builtin_ia32_addsd</literal> is not very <literal>__builtin_ia32_addsd</literal> is not very
helpful (nonexistent). So perhaps the helpful (nonexistent). So perhaps the
<link xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_add_pd&amp;expand=97">Intel Intrinsic Guide</link> <link xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_add_pd&amp;expand=97">Intel Intrinsic Guide</link>
@ -54,7 +54,7 @@
<itemizedlist> <itemizedlist>
<listitem> <listitem>
<para>The vector bit and field numbering is different (reversed). <para>The vector bit and field numbering is different (reversed).
<itemizedlist> <itemizedlist spacing="compact">
<listitem> <listitem>
<para>For Intel the scalar is always placed in the low order (right most) <para>For Intel the scalar is always placed in the low order (right most)
bits of the XMM register (and the low order address for load and store).</para> bits of the XMM register (and the low order address for load and store).</para>
@ -62,13 +62,13 @@
<listitem> <listitem>
<para>For PowerISA and VSX, scalar floating point operations and Floating <para>For PowerISA and VSX, scalar floating point operations and Floating
Point Registers (FPRs) are on the low numbered bits which is the left hand Point Registers (FPRs) are in the low numbered bits which is the left hand
side of the vector / scalar register (VSR). </para> side of the vector / scalar register (VSR). </para>
</listitem> </listitem>


<listitem> <listitem>
<para>For the PowerPC64 ELF V2 little endian ABI we also make point of <para>For the PowerPC64 ELF V2 little endian ABI we also make a point of
making the GCC vector extensions and vector built ins, appear to be little making the GCC vector extensions and vector built-ins, appear to be little
endian. So vector element 0 corresponds to the low order address and low endian. So vector element 0 corresponds to the low order address and low
order (right hand) bits of the vector register (VSR).</para> order (right hand) bits of the vector register (VSR).</para>
</listitem> </listitem>
@ -77,7 +77,7 @@
<listitem> <listitem>
<para>The handling of the non-scalar part of the register for scalar <para>The handling of the non-scalar part of the register for scalar
operations are different. operations are different.
<itemizedlist> <itemizedlist spacing="compact">
<listitem> <listitem>
<para>For Intel ISA the scalar operations either leaves the high order part <para>For Intel ISA the scalar operations either leaves the high order part
of the XMM vector unchanged or in some cases force it to 0.0.</para> of the XMM vector unchanged or in some cases force it to 0.0.</para>
@ -94,7 +94,7 @@
<para>To minimize confusion and use consistent nomenclature, I will try to <para>To minimize confusion and use consistent nomenclature, I will try to
use the terms logical left and logical right elements based on the order they use the terms logical left and logical right elements based on the order they
apprear in a C vector initializers and element index order. So in the vector apprear in a C vector initializers and element index order. So in the vector
<literal>(__v2df){1.0, 20.}</literal>, The value 1.0 is the in the logical left element [0] and <literal>(__v2df){1.0, 2.0}</literal>, The value 1.0 is the in the logical left element [0] and
the value 2.0 is logical right element [1].</para> the value 2.0 is logical right element [1].</para>


<para>So lets look at how to implement these intrinsics for the PowerISA. <para>So lets look at how to implement these intrinsics for the PowerISA.
@ -119,7 +119,7 @@ _mm_add_sd (__m128d __A, __m128d __B)
compiler generates the following code for PPC64LE target.:</para> compiler generates the following code for PPC64LE target.:</para>


<para>The packed vector double generated the corresponding VSX vector <para>The packed vector double generated the corresponding VSX vector
double add (xvadddp). But the scalar implementation is bit more complicated. double add (xvadddp). But the scalar implementation is a bit more complicated.
<programlisting><![CDATA[0000000000000720 <test_add_pd>: <programlisting><![CDATA[0000000000000720 <test_add_pd>:
720: 07 1b 42 f0 xvadddp vs34,vs34,vs35 720: 07 1b 42 f0 xvadddp vs34,vs34,vs35
... ...
@ -149,7 +149,7 @@ _mm_add_sd (__m128d __A, __m128d __B)
element (copied to itself).<footnote><para>Fun element (copied to itself).<footnote><para>Fun
fact: The vector registers in PowerISA are decidedly Big Endian. But we decided fact: The vector registers in PowerISA are decidedly Big Endian. But we decided
to make the PPC64LE ABI behave like a Little Endian system to make application to make the PPC64LE ABI behave like a Little Endian system to make application
porting easier. This require the compiler to manipulate the PowerISA vector porting easier. This requires the compiler to manipulate the PowerISA vector
instrinsic behind the the scenes to get the correct Little Endian results. For instrinsic behind the the scenes to get the correct Little Endian results. For
example the element selector [0|1] for <literal>vec_splat</literal> and the example the element selector [0|1] for <literal>vec_splat</literal> and the
generation of <literal>vec_mergeh</literal> vs <literal>vec_mergel</literal> generation of <literal>vec_mergeh</literal> vs <literal>vec_mergel</literal>
@ -161,9 +161,9 @@ _mm_add_sd (__m128d __A, __m128d __B)
opportunity to optimize the whole function. </para> opportunity to optimize the whole function. </para>


<para>Now we can look at a slightly more interesting (complicated) case. <para>Now we can look at a slightly more interesting (complicated) case.
Square root (<literal>sqrt</literal>) is not a arithmetic operator in C and is usually handled Square root (<literal>sqrt</literal>) is not an arithmetic operator in C and is usually handled
with a library call or a compiler builtin. We really want to avoid a library with a library call or a compiler builtin. We really want to avoid a library
calls and want to avoid any unexpected side effects. As you see below the call and want to avoid any unexpected side effects. As you see below the
implementation of implementation of
<link xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_sqrt_pd&amp;expand=4926"><literal>_mm_sqrt_pd</literal></link> and <link xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_sqrt_pd&amp;expand=4926"><literal>_mm_sqrt_pd</literal></link> and
<link xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_sqrt_sd&amp;expand=4926,4956"><literal>_mm_sqrt_sd</literal></link> <link xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_sqrt_sd&amp;expand=4926,4956"><literal>_mm_sqrt_sd</literal></link>
@ -197,13 +197,14 @@ _mm_sqrt_sd (__m128d __A, __m128d __B)
external library dependency for what should be only a few inline instructions. external library dependency for what should be only a few inline instructions.
So this is not a good option.</para> So this is not a good option.</para>


<para>Thinking outside the box; we do have an inline intrinsic for a <para>Thinking outside the box: we do have an inline intrinsic for a
(packed) vector double sqrt, that we just implemented. However we need to (packed) vector double sqrt that we just implemented. However we need to
insure the other half of <literal>__B</literal> (<literal>__B[1]</literal>) insure the other half of <literal>__B</literal> (<literal>__B[1]</literal>)
does not cause an harmful side effects does not cause any harmful side effects
(like raising exceptions for NAN or  negative values). The simplest solution (like raising exceptions for NAN or  negative values). The simplest solution
is to splat <literal>__B[0]</literal> to both halves of a temporary value before taking the is to vector splat <literal>__B[0]</literal> to both halves of a temporary
<literal>vec_sqrt</literal>. Then this result can be combined with <literal>__A[1]</literal> to return the final value before taking the <literal>vec_sqrt</literal>.
Then this result can be combined with <literal>__A[1]</literal> to return the final
result. For example: result. For example:
<programlisting><![CDATA[extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__)) <programlisting><![CDATA[extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_sqrt_pd (__m128d __A) _mm_sqrt_pd (__m128d __A)
@ -228,8 +229,8 @@ _mm_sqrt_sd (__m128d __A, __m128d __B)
to combine the final result. You could also use the <literal>{c[0], __A[1]}</literal> to combine the final result. You could also use the <literal>{c[0], __A[1]}</literal>
initializer instead of <literal>_mm_setr_pd</literal>.</para> initializer instead of <literal>_mm_setr_pd</literal>.</para>


<para>Now we can look at vector and scalar compares that add there own <para>Now we can look at vector and scalar compares that add their own
complication: For example, the Intel Intrinsic Guide for complications: For example, the Intel Intrinsic Guide for
<link xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_cmpeq_pd&amp;expand=779,788,779"><literal>_mm_cmpeq_pd</literal></link> <link xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_cmpeq_pd&amp;expand=779,788,779"><literal>_mm_cmpeq_pd</literal></link>
describes comparing double elements [0|1] and returning describes comparing double elements [0|1] and returning
either 0s for not equal and 1s (<literal>0xFFFFFFFFFFFFFFFF</literal> either 0s for not equal and 1s (<literal>0xFFFFFFFFFFFFFFFF</literal>
@ -242,9 +243,9 @@ _mm_sqrt_sd (__m128d __A, __m128d __B)
the final vector result.</para> the final vector result.</para>


<para>The packed vector implementation for PowerISA is simple as VSX <para>The packed vector implementation for PowerISA is simple as VSX
provides the equivalent instruction and GCC provides the provides the equivalent instruction and GCC provides the builtin
<literal>vec_cmpeq</literal> builtin <literal>vec_cmpeq</literal> supporting the vector double type.
supporting the vector double type. The technique of using scalar comparison However the technique of using scalar comparison
operators on the <literal>__A[0]</literal> and <literal>__B[0]</literal> operators on the <literal>__A[0]</literal> and <literal>__B[0]</literal>
does not work as the C comparison operators does not work as the C comparison operators
return 0 or 1 results while we need the vector select mask (effectively 0 or return 0 or 1 results while we need the vector select mask (effectively 0 or
@ -253,10 +254,10 @@ _mm_sqrt_sd (__m128d __A, __m128d __B)
banks.</para> banks.</para>


<para>In this case we are better off using explicit vector built-ins for <para>In this case we are better off using explicit vector built-ins for
<literal>_mm_add_sd</literal> as and example. We can use <literal>vec_splat</literal> <literal>_mm_add_sd</literal> and <literal>_mm_sqrt_sd</literal> as examples.
from element [0] to temporaries We can use <literal>vec_splat</literal> from element [0] to temporaries
where we can safely use <literal>vec_cmpeq</literal> to generate the expect selector mask. Note where we can safely use <literal>vec_cmpeq</literal> to generate the expected selector mask. Note
that the <literal>vec_cmpeq</literal> returns a bool long type so we need the cast the result back that the <literal>vec_cmpeq</literal> returns a bool long type so we need to cast the result back
to <literal>__v2df</literal>. Then use the to <literal>__v2df</literal>. Then use the
<literal>(__m128d){c[0], __A[1]}</literal> initializer to combine the <literal>(__m128d){c[0], __A[1]}</literal> initializer to combine the
comparison result with the original <literal>__A[1]</literal> input and cast to the require comparison result with the original <literal>__A[1]</literal> input and cast to the require
@ -283,20 +284,6 @@ _mm_cmpeq_sd(__m128d __A, __m128d __B)
return ((__m128d){c[0], __A[1]}); return ((__m128d){c[0], __A[1]});
}]]></programlisting></para> }]]></programlisting></para>


<para>Now lets look at a similar example that adds some surprising
complexity. This is the compare not equal case so we should be able to find the
equivalent vec_cmpne builtin:
<programlisting><![CDATA[extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_cmpneq_pd (__m128d __A, __m128d __B)
{
return (__m128d)__builtin_ia32_cmpneqpd ((__v2df)__A, (__v2df)__B);
}

extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_cmpneq_sd (__m128d __A, __m128d __B)
{
return (__m128d)__builtin_ia32_cmpneqsd ((__v2df)__A, (__v2df)__B);
}]]></programlisting></para>


</section> </section>



@ -31,13 +31,13 @@
<xref linkend="sec_packed_vs_scalar_intrinsics"/>). <xref linkend="sec_packed_vs_scalar_intrinsics"/>).
This requires additional PowerISA instructions This requires additional PowerISA instructions
to preserve the non-scalar portion of the vector registers. This may or may not to preserve the non-scalar portion of the vector registers. This may or may not
be important to the logic of the program being ported, but we have handle the be important to the logic of the program being ported, but we have to handle the
case where it is.</para> case where it is.</para>


<para>This is where the context of now the intrinsic is used starts to <para>This is where the context of how the intrinsic is used starts to
matter. If the scalar intrinsics are used within a larger program the compiler matter. If the scalar intrinsics are used within a larger program the compiler
may be able to eliminate the redundant register moves as the results are never may be able to eliminate the redundant register moves as the results are never
used. In the other cases common set up (like permute vectors or bit masks) can used. In other cases common set up (like permute vectors or bit masks) can
be common-ed up and hoisted out of the loop. So it is very important to let the be common-ed up and hoisted out of the loop. So it is very important to let the
compiler do its job with higher optimization levels (<literal>-O3</literal>, compiler do its job with higher optimization levels (<literal>-O3</literal>,
<literal>-funroll-loops</literal>).</para> <literal>-funroll-loops</literal>).</para>

@ -23,17 +23,17 @@
<title>Using MMX intrinsics</title> <title>Using MMX intrinsics</title>
<para>MMX was the first and oldest SIMD extension and initially filled a <para>MMX was the first and oldest SIMD extension and initially filled a
need for wider (64-bit) integer and additional register. This is back when need for wider (64-bit) integer and additional registers. This is back when
processors were 32-bit and 8 x 32-bit registers was starting to cramp our processors were 32-bit and 8 x 32-bit registers was starting to cramp our
programming style. Now 64-bit processors, larger register sets, and 128-bit (or programming style. Now 64-bit processors, larger register sets, and 128-bit (or
larger) vector SIMD extensions are common. There is simply no good reasons larger) vector SIMD extensions are common. There is simply no good reason to
write new code using the (now) very limited MMX capabilities. </para> write new code using the (now) very limited MMX capabilities. </para>


<para>We recommend that existing MMX codes be rewritten to use the newer <para>We recommend that existing MMX codes be rewritten to use the newer
SSE  and VMX/VSX intrinsics or using the more portable GCC  builtin vector SSE and VMX/VSX intrinsics or using the more portable GCC  builtin vector
support or in the case of si64 operations use C scalar code. The MMX si64 support or in the case of si64 operations use C scalar code. The MMX si64
scalars which are just (64-bit) operations on long long int types and any scalars are just (64-bit) operations on long long int types and any
modern C compiler can handle this type. The char short in SIMD operations modern C compiler can handle this type. The char / short in SIMD operations
should all be promoted to 128-bit SIMD operations on GCC builtin vectors. Both should all be promoted to 128-bit SIMD operations on GCC builtin vectors. Both
will improve cross platform portability and performance.</para> will improve cross platform portability and performance.</para>

@ -22,23 +22,23 @@
xml:id="sec_performance_sse"> xml:id="sec_performance_sse">
<title>Using SSE float and double scalars</title> <title>Using SSE float and double scalars</title>
<para>SSE scalar float / double intrinsics  “hand” optimization is no <para>For SSE scalar float / double intrinsics,  “hand” optimization is no
longer necessary. This was important, when SSE was initially introduced, and longer necessary. This was important, when SSE was initially introduced, and
compiler support was limited or nonexistent.  Also SSE scalar float / double compiler support was limited or nonexistent.  Also SSE scalar float / double
provided additional (16) registers and IEEE754 compliance, not available from provided additional (16) registers and IEEE-754 compliance, not available from
the 8087 floating point architecture that preceded it. So application the 8087 floating point architecture that preceded it. So application
developers where motivated to use SSE instruction versus what the compiler was developers where motivated to use SSE instructions versus what the compiler was
generating at the time.</para> generating at the time.</para>


<para>Modern compilers can now to generate and  optimize these (SSE <para>Modern compilers can now generate and optimize these (SSE
scalar) instructions for Intel from C standard scalar code. Of course PowerISA scalar) instructions for Intel from C standard scalar code. Of course PowerISA
supported IEEE754 float and double and had 32 dedicated floating point supported IEEE-754 float and double and had 32 dedicated floating point
registers from the start (and now 64 with VSX). So replacing a Intel specific registers from the start (and now 64 with VSX). So replacing Intel specific
scalar intrinsic implementation with the equivalent C language scalar scalar intrinsic implementation with the equivalent C language scalar
implementation is usually a win; allows the compiler to apply the latest implementation is usually a win; it allows the compiler to apply the latest
optimization and tuning for the latest generation processor, and is portable to optimization and tuning for the latest generation processor, and is portable to
other platforms where the compiler can also apply the latest optimization and other platforms where the compiler can also apply the latest optimization and
tuning for that processors latest generation.</para> tuning for that processor's latest generation.</para>
</section> </section>



@ -23,10 +23,13 @@
<title>Vector permute and formatting instructions</title> <title>Vector permute and formatting instructions</title>


<para>The vector Permute and formatting chapter follows and is an important <para>The vector Permute and formatting chapter follows and is an important
one to study. These operation operation on the byte, halfword, word (and with one to study. These operate on the byte, halfword, word (and with
2.07 doubleword) integer types . Plus special Pixel type. The shifts PowerISA 2.07 doubleword) integer types,
plus special pixel type. </para>

<para>The shift
instructions in this chapter operate on the vector as a whole at either the bit instructions in this chapter operate on the vector as a whole at either the bit
or the byte (octet) level, This is an important chapter to study for moving or the byte (octet) level. This is an important chapter to study for moving
PowerISA vector results into the vector elements that Intel Intrinsics PowerISA vector results into the vector elements that Intel Intrinsics
expect: expect:
@ -41,12 +44,13 @@
<para>The Vector Integer instructions include the add / subtract / Multiply <para>The Vector Integer instructions include the add / subtract / Multiply
/ Multiply Add/Sum / (no divide) operations for the standard integer types. / Multiply Add/Sum / (no divide) operations for the standard integer types.
There are instruction forms that  provide signed, unsigned, modulo, and There are instruction forms that  provide signed, unsigned, modulo, and
saturate results for most operations. The PowerISA 2.07 extension add / saturate results for most operations. PowerISA 2.07 extends vector integer
subtract of 128-bit integers with carry and extend to 256, 512-bit and beyond , operations to add / subtract quadword (128-bit) integers with carry and extend.
is included here. There are signed / unsigned compares across the standard This supports extended binary integer arithmetic to 256, 512-bit and beyond.
integer types (byte, .. doubleword). The usual and bit-wise logical operations. There are signed / unsigned compares across the standard
And the SIMD shift / rotate instructions that operate on the vector elements integer types (byte, .. doubleword); the usual bit-wise logical operations;
for various types. and the SIMD shift / rotate instructions that operate on the vector elements
for various integer types.


<literallayout><literal>6.9 Vector Integer Instructions . . . . . . . . . . . . . . . . . . 264 <literallayout><literal>6.9 Vector Integer Instructions . . . . . . . . . . . . . . . . . . 264
6.9.1 Vector Integer Arithmetic Instructions . . . . . . . . . . . . 264 6.9.1 Vector Integer Arithmetic Instructions . . . . . . . . . . . . 264
@ -55,8 +59,8 @@
6.9.4 Vector Integer Rotate and Shift Instructions . . . . . . . . . 302</literal></literallayout></para> 6.9.4 Vector Integer Rotate and Shift Instructions . . . . . . . . . 302</literal></literallayout></para>


<para>The vector [single] float instructions are grouped into this chapter. <para>The vector [single] float instructions are grouped into this chapter.
This chapter does not include the double float instructions which are described This chapter does not include the double float instructions, which are described
in the VSX chapter. VSX also include additional float instructions that operate in the VSX chapter. VSX also includes additional float instructions that operate
on the whole 64 register vector-scalar set. on the whole 64 register vector-scalar set.


<literallayout><literal>6.10 Vector Floating-Point Instruction Set . . . . . . . . . . . . . 306 <literallayout><literal>6.10 Vector Floating-Point Instruction Set . . . . . . . . . . . . . 306
@ -67,7 +71,7 @@
6.10.5 Vector Floating-Point Estimate Instructions . . . . . . . . . 316</literal></literallayout></para> 6.10.5 Vector Floating-Point Estimate Instructions . . . . . . . . . 316</literal></literallayout></para>


<para>The vector XOR based instructions are new with PowerISA 2.07 (POWER8) <para>The vector XOR based instructions are new with PowerISA 2.07 (POWER8)
and provide vector  crypto and check-sum operations: and provide vector crypto and check-sum operations:


<literallayout><literal>6.11 Vector Exclusive-OR-based Instructions . . . . . . . . . . . . 318 <literallayout><literal>6.11 Vector Exclusive-OR-based Instructions . . . . . . . . . . . . 318
6.11.1 Vector AES Instructions . . . . . . . . . . . . . . . . . . . 318 6.11.1 Vector AES Instructions . . . . . . . . . . . . . . . . . . . 318
@ -75,28 +79,28 @@
6.11.3 Vector Binary Polynomial Multiplication Instructions. . . . . 321 6.11.3 Vector Binary Polynomial Multiplication Instructions. . . . . 321
6.11.4 Vector Permute and Exclusive-OR Instruction . . . . . . . . . 323</literal></literallayout></para> 6.11.4 Vector Permute and Exclusive-OR Instruction . . . . . . . . . 323</literal></literallayout></para>


<para>The vector gather and bit permute support bit level rearrangement of <para>The vector gather and bit permute instructions support bit-level rearrangement of
bits with in the vector. While the vector versions of the count leading zeros bits with in the vector, while the vector versions of the count leading zeros
and population count are useful to accelerate specific algorithms. and population count instructions are useful to accelerate specific algorithms.


<literallayout><literal>6.12 Vector Gather Instruction . . . . . . . . . . . . . . . . . . . 324 <literallayout><literal>6.12 Vector Gather Instruction . . . . . . . . . . . . . . . . . . . 324
6.13 Vector Count Leading Zeros Instructions . . . . . . . . . . . . 325 6.13 Vector Count Leading Zeros Instructions . . . . . . . . . . . . 325
6.14 Vector Population Count Instructions. . . . . . . . . . . . . . 326 6.14 Vector Population Count Instructions. . . . . . . . . . . . . . 326
6.15 Vector Bit Permute Instruction . . . . . . . . . . . . . . . . 327</literal></literallayout></para> 6.15 Vector Bit Permute Instruction . . . . . . . . . . . . . . . . 327</literal></literallayout></para>


<para>The Decimal Integer add / subtract instructions complement the <para>The Decimal Integer add / subtract (fixed point) instructions complement the
Decimal Floating-Point instructions. They can also be used to accelerated some Decimal Floating-Point instructions. They can also be used to accelerate some
binary to/from decimal conversions. The VSCR instruction provides access the binary to/from decimal conversions. The VSCR instructions provide access to
the Non-Java mode floating-point control and the saturation status. These the Non-Java mode floating-point control and the saturation status. These
instruction are not normally of interest in porting Intel intrinsics. instructions are not normally of interest in porting Intel intrinsics.


<literallayout><literal>6.16 Decimal Integer Arithmetic Instructions . . . . . . . . . . . . 328 <literallayout><literal>6.16 Decimal Integer Arithmetic Instructions . . . . . . . . . . . . 328
6.17 Vector Status and Control Register Instructions . . . . . . . . 331</literal></literallayout></para> 6.17 Vector Status and Control Register Instructions . . . . . . . . 331</literal></literallayout></para>


<para>With PowerISA 2.07B (Power8) several major extension where added to <para>With PowerISA 2.07B (Power8) several major extensions were added to
the Vector Facility:</para> the Vector Facility:</para>


<itemizedlist> <itemizedlist spacing="compact">
<listitem> <listitem>
<para>Vector Crypto: Under “Vector Exclusive-OR-based Instructions <para>Vector Crypto: Under “Vector Exclusive-OR-based Instructions
Vector Exclusive-OR-based Instructions”, AES [inverse] Cipher, SHA 256 / 512 Vector Exclusive-OR-based Instructions”, AES [inverse] Cipher, SHA 256 / 512
@ -108,7 +112,7 @@
unsigned max / min, rotate and shift left/right.</para> unsigned max / min, rotate and shift left/right.</para>
</listitem> </listitem>
<listitem> <listitem>
<para>Direct Move between GRPs and the FPRs / Left half of Vector <para>Direct Move between GPRs and the FPRs / Left half of Vector
Registers.</para> Registers.</para>
</listitem> </listitem>
<listitem> <listitem>
@ -116,7 +120,7 @@
support for vector <literal>__int128</literal> and multiple precision arithmetic.</para> support for vector <literal>__int128</literal> and multiple precision arithmetic.</para>
</listitem> </listitem>
<listitem> <listitem>
<para>Decimal Integer add subtract for 31 digit BCD.</para> <para>Decimal Integer add / subtract for 31 digit Binary Coded Decimal (BCD).</para>
</listitem> </listitem>
<listitem> <listitem>
<para>Miscellaneous SIMD extensions: Count leading Zeros, Population <para>Miscellaneous SIMD extensions: Count leading Zeros, Population
@ -124,17 +128,19 @@
</listitem> </listitem>
</itemizedlist> </itemizedlist>


<para>The rational for why these are included in the Vector Facilities <para>The rationale for these being included in the Vector Facilities
(VMX) (vs Vector-Scalar Floating-Point Operations (VSX)) has more to do with (VMX) (vs Vector-Scalar Floating-Point Operations (VSX)) has more to do with
how the instruction where encoded then with the type of operations or the ISA how the instructions were encoded than with the type of operations or the ISA
version of introduction. This is primarily a trade-off between the bits version of introduction. This is primarily a trade-off between the bits
required for register selection vs bits for extended op-code space within in a required for register selection versus the bits for extended op-code space within a
fixed 32-bit instruction. Basically accessing 32 vector registers require fixed 32-bit instruction. </para>
5-bits per register, while accessing all 64 vector-scalar registers require
6-bits per register. When you consider the most vector instructions require  3 <para>Basically accessing 32 vector registers requires
 and some (select, fused multiply-add) require 4 register operand forms,  the 5 bits per register, while accessing all 64 vector-scalar registers require
6 bits per register. When you consider that most vector instructions require
3 and some (select, fused multiply-add) require 4 register operand forms,  the
impact on op-code space is significant. The larger register set of VSX was impact on op-code space is significant. The larger register set of VSX was
justified by queuing theory of larger HPC matrix codes using double float, justified by queueing theory of larger HPC matrix codes using double float,
while 32 registers are sufficient for most applications.</para> while 32 registers are sufficient for most applications.</para>


<para>So by definition the VMX instructions are restricted to the original <para>So by definition the VMX instructions are restricted to the original

@ -22,23 +22,25 @@
xml:id="sec_power_vmx"> xml:id="sec_power_vmx">
<title>The Vector Facility (VMX)</title> <title>The Vector Facility (VMX)</title>
<para>The orginal VMX supported SIMD integer byte, halfword, and word, and <para>The original VMX supported SIMD integer byte, halfword, and word, and
single float data types within a separate (from GPR and FPR) bank of 32 x single float data types within a separate (from GPR and FPR) bank of 32 x
128-bit vector registers. These operations like to stay within their (SIMD) 128-bit vector registers. The arithmetic operations like to stay within their (SIMD)
lanes except where the operation changes the element data size (integer lanes except where the operation changes the element data size (integer
multiply, pack, and unpack). </para> multiply) or the generalized permute operations
(splat, permute, pack, unpack merge). </para>


<para>This is complimented by bit logical and shift / rotate / permute / <para>This is complemented by bit logical and shift / rotate instructions
merge instuctions that operate on the vector as a whole.  Some operation that operate on the vector as a whole.  Some operations
(permute, pack, merge, shift double, select) will select 128 bits from a pair (permute, pack, merge, shift double, select) will select 128 bits from a pair
of vectors (256-bits) and deliver 128-bit vector result. These instructions of vectors (256-bits) and delivers a 128-bit vector result. These instructions
will cross lanes or multiple registers to grab fields and assmeble them into will cross lanes or multiple registers to grab fields and assemble them into
the single register result.</para> the single register result.</para>


<para>The PowerISA 2.07B Chapter 6. Vector Facility is organised starting <para>The PowerISA 2.07B Chapter 6. Vector Facility is organised starting
with an overview (chapters 6.1- 6.6): with an overview (chapters 6.1- 6.6):


<literallayout><literal>6.1 Vector Facility Overview . . . . . . . . . . . . . . . . . . . . 227 <literallayout><literal>
6.1 Vector Facility Overview . . . . . . . . . . . . . . . . . . . . 227
6.2 Chapter Conventions. . . . . . . . . . . . . . . . . . . . . . . 227 6.2 Chapter Conventions. . . . . . . . . . . . . . . . . . . . . . . 227
6.2.1 Description of Instruction Operation . . . . . . . . . . . . . 227 6.2.1 Description of Instruction Operation . . . . . . . . . . . . . 227
6.3 Vector Facility Registers . . . . . . . . . . . . . . . . . . . 234 6.3 Vector Facility Registers . . . . . . . . . . . . . . . . . . . 234
@ -52,14 +54,18 @@
6.6 Vector Floating-Point Operations . . . . . . . . . . . . . . . . 240 6.6 Vector Floating-Point Operations . . . . . . . . . . . . . . . . 240
6.6.1 Floating-Point Overview . . . . . . . . . . . . . . . . . . . 240 6.6.1 Floating-Point Overview . . . . . . . . . . . . . . . . . . . 240
6.6.2 Floating-Point Exceptions . . . . . . . . . . . . . . . . . . 240 6.6.2 Floating-Point Exceptions . . . . . . . . . . . . . . . . . . 240
</literal></literallayout></para>
<para>Then a chapter on storage (load/store) access for vector and vector
elements:

<literallayout><literal>
6.7 Vector Storage Access Instructions . . . . . . . . . . . . . . . 242 6.7 Vector Storage Access Instructions . . . . . . . . . . . . . . . 242
6.7.1 Storage Access Exceptions . . . . . . . . . . . . . . . . . . 242 6.7.1 Storage Access Exceptions . . . . . . . . . . . . . . . . . . 242
6.7.2 Vector Load Instructions . . . . . . . . . . . . . . . . . . . 243 6.7.2 Vector Load Instructions . . . . . . . . . . . . . . . . . . . 243
6.7.3 Vector Store Instructions. . . . . . . . . . . . . . . . . . . 246 6.7.3 Vector Store Instructions. . . . . . . . . . . . . . . . . . . 246
6.7.4 Vector Alignment Support Instructions. . . . . . . . . . . . . 248</literal></literallayout></para> 6.7.4 Vector Alignment Support Instructions. . . . . . . . . . . . . 248
</literal></literallayout></para>
<para>Then a chapter on storage (load/store) access for vector and vector
elements:</para>
<xi:include href="sec_power_vector_permute_format.xml"/> <xi:include href="sec_power_vector_permute_format.xml"/>



@ -25,10 +25,11 @@
<para>With PowerISA 2.06 (POWER7) we extended the vector SIMD capabilities <para>With PowerISA 2.06 (POWER7) we extended the vector SIMD capabilities
of the PowerISA:</para> of the PowerISA:</para>
<itemizedlist> <itemizedlist spacing="compact">
<listitem> <listitem>
<para>Extend the available vector and floating-point scalar register <para>Extend the available vector and floating-point scalar register
sets from 32 registers each to a combined 64 x 64-bit scalar floating-point and sets from 32 registers each to a combined register set of 64 x 64-bit
scalar floating-point and
64 x 128-bit vector registers.</para> 64 x 128-bit vector registers.</para>
</listitem> </listitem>
<listitem> <listitem>
@ -42,27 +43,27 @@
<listitem> <listitem>
<para>Enable super-scalar execution of vector instructions and support <para>Enable super-scalar execution of vector instructions and support
2 independent vector floating point  pipelines for parallel execution of 4 x 2 independent vector floating point  pipelines for parallel execution of 4 x
64-bit Floating point Fused Multiply Adds (FMAs) and 8 x 32-bit (FMAs) per 64-bit Floating point Fused Multiply Adds (FMAs) and 8 x 32-bit FMAs per
cycle.</para> cycle.</para>
</listitem> </listitem>
</itemizedlist> </itemizedlist>


<para>With PowerISA 2.07 (POWER8) we added single-precision scalar <para>With PowerISA 2.07 (POWER8) we added single-precision scalar
floating-point instruction to VSX. This completes the floating-point floating-point instructions to VSX. This completes the floating-point
computational set for VSX. This ISA release also clarified how these operate in computational set for VSX. This ISA release also clarified how these operate in
the Little Endian storage model.</para> the Little Endian storage model.</para>


<para>While the focus was on enhanced floating-point computation (for High <para>While the focus was on enhanced floating-point computation (for High
Performance Computing),  VSX also extended  the ISA with additional storage Performance Computing), VSX also extended  the ISA with additional storage
access, logical, and permute (merge, splat, shift) instructions. This was access, logical, and permute (merge, splat, shift) instructions. This was
necessary to extend these operations cover 64 VSX registers, and improves necessary to extend these operations to cover 64 VSX registers, and improves
unaligned storage access for vectors  (not available in VMX).</para> unaligned storage access for vectors  (not available in VMX).</para>


<para>The PowerISA 2.07B Chapter 7. Vector-Scalar Floating-Point Operations <para>The PowerISA 2.07B Chapter 7. Vector-Scalar Floating-Point Operations
is organized starting with an introduction and overview (chapters 7.1- 7.5) . is organized starting with an introduction and overview (chapters 7.1- 7.5) .
The early sections (7.1 and 7.2) describe the layout of the 64 VSX registers The early sections (7.1 and 7.2) describe the layout of the 64 VSX registers
and how they relate (overlap and inter-operate) to the existing floating point and how they relate (overlap and inter-operate) to the existing floating point
scalar (FPRs) and (VMX VRs) vector registers. scalar (FPRs) and vector (VMX VRs) registers.


<literallayout><literal>7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 317 <literallayout><literal>7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 317
7.1.1 Overview of the Vector-Scalar Extension . . . . . . . . . . . 317 7.1.1 Overview of the Vector-Scalar Extension . . . . . . . . . . . 317
@ -91,7 +92,7 @@


<note><para>The reference to scalar element 0 above is from the big endian <note><para>The reference to scalar element 0 above is from the big endian
register perspective of the ISA. In the PPC64LE ABI implementation, and for the register perspective of the ISA. In the PPC64LE ABI implementation, and for the
purpose of porting Intel intrinsics, this is logical element 1.  Intel SSE purpose of porting Intel intrinsics, this is logical doubleword element 1.  Intel SSE
scalar intrinsics operated on logical element [0],  which is in the wrong scalar intrinsics operated on logical element [0],  which is in the wrong
position for PowerISA FPU and VSX scalar floating-point  operations. Another position for PowerISA FPU and VSX scalar floating-point  operations. Another
important note is what happens to the other half of the VSR when you execute a important note is what happens to the other half of the VSR when you execute a
@ -112,20 +113,20 @@
operations only exist for VMX (byte level permute and shift) or VSX (Vector operations only exist for VMX (byte level permute and shift) or VSX (Vector
double).</para> double).</para>


<para>So resister selection that; avoids unnecessary vector moves, follows <para>So register selection that avoids unnecessary vector moves and follows
the ABI, while maintaining the correct instruction specific register numbering, the ABI while maintaining the correct instruction specific register numbering,
can be tricky. The can be tricky. The
<link xlink:href="https://gcc.gnu.org/onlinedocs/gcc-6.3.0/gcc/Machine-Constraints.html#Machine-Constraints">GCC register constraint</link> <link xlink:href="https://gcc.gnu.org/onlinedocs/gcc-6.3.0/gcc/Machine-Constraints.html#Machine-Constraints">GCC register constraint</link>
annotations for Inline annotations for Inline
assembler using vector instructions  is challenging, even for experts. So only assembler using vector instructions are challenging, even for experts. So only
experts should be writing assembler and then only in extraordinary experts should be writing assembler and then only in extraordinary
circumstances. You should leave these details to the compiler (using vector circumstances. You should leave these details to the compiler (using vector
extensions and vector built-ins) when ever possible.</para> extensions and vector built-ins) when ever possible.</para>


<para>The next sections get is into the details of floating point <para>The next sections gets into the details of floating point
representation, operations, and exceptions. Basically the implementation representation, operations, and exceptions. They describe the implementation
details for the IEEE754R and C/C++ language standards that most developers only details for the IEEE-754R and C/C++ language standards that most developers only
access via higher level APIs. So most programmers will not need this level of access via higher level APIs. Most programmers will not need this level of
detail, but it is there if needed. detail, but it is there if needed.


<literallayout><literal>7.3 VSX Operations . . . . . . . . . . . . . . . . . . . . . . . . . 326 <literallayout><literal>7.3 VSX Operations . . . . . . . . . . . . . . . . . . . . . . . . . 326
@ -138,9 +139,9 @@
7.4.3 Floating-Point Overflow Exception. . . . . . . . . . . . . . . 349 7.4.3 Floating-Point Overflow Exception. . . . . . . . . . . . . . . 349
7.4.4 Floating-Point Underflow Exception . . . . . . . . . . . . . . 351</literal></literallayout></para> 7.4.4 Floating-Point Underflow Exception . . . . . . . . . . . . . . 351</literal></literallayout></para>


<para>Finally an overview the VSX storage access instructions for big and <para>Next comes an overview of the VSX storage access instructions for big and
little endian and for aligned and unaligned data addresses. This included little endian and for aligned and unaligned data addresses. This included
diagrams that illuminate the differences diagrams that illuminate the differences.


<literallayout><literal>7.5 VSX Storage Access Operations . . . . . . . . . . . . . . . . . 356 <literallayout><literal>7.5 VSX Storage Access Operations . . . . . . . . . . . . . . . . . 356
7.5.1 Accessing Aligned Storage Operands . . . . . . . . . . . . . . 356 7.5.1 Accessing Aligned Storage Operands . . . . . . . . . . . . . . 356
@ -148,19 +149,19 @@
7.5.3 Storage Access Exceptions . . . . . . . . . . . . . . . . . . 358</literal></literallayout></para> 7.5.3 Storage Access Exceptions . . . . . . . . . . . . . . . . . . 358</literal></literallayout></para>


<para>Section 7.6 starts with a VSX instruction Set Summary which is the <para>Section 7.6 starts with a VSX instruction Set Summary which is the
place to start to get an feel for the types and operations supported.  The place to start to get a feel for the types and operations supported.  The
emphasis on float-point, both scalar and vector (especially vector double), is emphasis on floating-point, both scalar and vector (especially vector double), is
pronounced. Many of the scalar and single-precision vector instruction look pronounced. Many of the scalar and single-precision vector instructions look
like duplicates of what we have seen in the Chapter 4 Floating-Point and like duplicates of what we have seen in the Chapter 4 Floating-Point and
Chapter 6 Vector facilities. The difference here is, new instruction encodings Chapter 6 Vector facilities. The difference here is new instruction encodings
to access the full 64 VSX register space. </para> to access the full 64 VSX register space. </para>


<para>In addition there are small number of logical instructions are <para>In addition there are a small number of logical instructions
include to support predication (selecting / masking vector elements based on included to support predication (selecting / masking vector elements based on
compare results). And set of permute, merge, shift, and splat instructions that comparison results), and a set of permute, merge, shift, and splat instructions that
operation on VSX word (float) and doubleword (double) elements. As mentioned operate on VSX word (float) and doubleword (double) elements. As mentioned
about VMX section 6.8 these instructions are good to study as they are useful about VMX section 6.8 these instructions are good to study as they are useful
for realigning elements from PowerISA vector results to that required for Intel for realigning elements from PowerISA vector results to the form required for Intel
Intrinsics. Intrinsics.


<literallayout><literal>7.6 VSX Instruction Set . . . . . . . . . . . . . . . . . . . . . . 359 <literallayout><literal>7.6 VSX Instruction Set . . . . . . . . . . . . . . . . . . . . . . 359
@ -179,8 +180,8 @@


<para>The VSX Instruction Descriptions section contains the detail <para>The VSX Instruction Descriptions section contains the detail
description for each VSX category instruction.  The table entries from the description for each VSX category instruction.  The table entries from the
Instruction Set Summary are formatted in the document at hyperlinks to Instruction Set Summary are formatted in the document as hyperlinks to
corresponding instruction description.</para> corresponding instruction descriptions.</para>


</section> </section>



@ -22,7 +22,7 @@
xml:id="sec_powerisa"> xml:id="sec_powerisa">
<title>The PowerISA</title> <title>The PowerISA</title>
<para>The PowerISA is for historical reasons is organized at the top level <para>The PowerISA Vector facilities, for historical reasons are organized at the top level
by the distinction between older Vector Facility (Altivec / VMX) and the newer by the distinction between older Vector Facility (Altivec / VMX) and the newer
Vector-Scalar Floating-Point Operations (VSX). </para> Vector-Scalar Floating-Point Operations (VSX). </para>

@ -22,21 +22,21 @@
xml:id="sec_powerisa_vector_facilities"> xml:id="sec_powerisa_vector_facilities">
<title>PowerISA Vector facilities</title> <title>PowerISA Vector facilities</title>
<para>The PowerISA vector facilities (VMX and VSX) are extensive, but does <para>The PowerISA vector facilities (VMX and VSX) are extensive, but do
not always provide a direct or obvious functional equivalent to the Intel not always provide a direct or obvious functional equivalent to the Intel
Intrinsics. But being not obvious is not the same as imposible. It just Intrinsics. However not being obvious is not the same as impossible. It just
requires some basic programing skills.</para> requires some basic programing skills.</para>


<para>It is a good idea to have an overall understanding of the vector <para>It is a good idea to have an overall understanding of the vector
capabilities the PowerISA. You do not need to memorize every instructions but capabilities of the PowerISA. You do not need to memorize every instruction but
is helps to know where to look. Both the PowerISA and OpenPOWER ABI have a it helps to know where to look. Both the PowerISA and OpenPOWER ABI have a
specific structure and organization that can help you find what you looking specific structure and organization that can help you find what you are looking
for. </para> for. </para>


<para>It also helps to understand the relationship between the PowerISAs <para>It also helps to understand the relationship between the PowerISA's
low level instructions and the higher abstraction of the vector intrinsics as low level instructions and the higher abstraction of the vector intrinsics as
defined by the OpenPOWER ABIs Vector Programming Interfaces and the the defacto defined by the OpenPOWER ABI's Vector Programming Interfaces and the de facto
standard of GCC's PowerPC AltiVec Built-in Functions.</para> standard of GCC's PowerPC AltiVec Builtin Functions.</para>
<xi:include href="sec_powerisa.xml"/> <xi:include href="sec_powerisa.xml"/>
<xi:include href="sec_powerisa_vector_intrinsics.xml"/> <xi:include href="sec_powerisa_vector_intrinsics.xml"/>

@ -30,23 +30,23 @@
C/C++ compilers implement. </para> C/C++ compilers implement. </para>


<para>Some of these operations are endian sensitive and the compiler needs <para>Some of these operations are endian sensitive and the compiler needs
to make corresponding adjustments as  it generate code for endian sensitive to make corresponding adjustments as it generates code for endian sensitive
built-ins. There is a good overview for this in the built-ins. There is a good overview for this in the
<emphasis role="bold">OpenPOWER ABI Section</emphasis> <emphasis role="bold">OpenPOWER ABI Section</emphasis>
<emphasis><emphasis role="bold">6.4. <emphasis><emphasis role="bold">6.4.
Vector Built-in Functions</emphasis></emphasis>.</para> Vector Built-in Functions</emphasis></emphasis>.</para>


<para>Appendix A is organized (sorted) by built-in name, output type, then <para>Appendix A is organized (sorted) by built-in name, output type, then
parameter types. Most built-ins are generic as the named the operation (add, parameter types. Most built-ins are generic as the named operation (add,
sub, mul, cmpeq, ...) applies to multiple types. </para> sub, mul, cmpeq, ...) applies to multiple types. </para>


<para>So the build <literal>vec_add</literal> built-in applies to all the signed and unsigned <para>So the <literal>vec_add</literal> built-in applies to all the signed and unsigned
integer types (char, short, in, and long) plus float and double floating-point integer types (char, short, in, and long) plus float and double floating-point
types. The compiler looks at the parameter type to select the vector types. The compiler looks at the parameter type to select the vector
instruction (or instruction sequence) that implements the (add) operation on instruction (or instruction sequence) that implements the add operation on
that type. The compiler infers the output result type from the operation and that type. The compiler infers the output result type from the operation and
input parameters and will complain if the target variable type is not input parameters and will complain if the target variable type is not
compatible. For example: compatible. Some examples:
<programlisting><![CDATA[vector signed char vec_add (vector signed char, vector signed char); <programlisting><![CDATA[vector signed char vec_add (vector signed char, vector signed char);
vector unsigned char vec_add (vector unsigned char, vector unsigned char); vector unsigned char vec_add (vector unsigned char, vector unsigned char);
vector signed short vec_add (vector signed short, vector signed short); vector signed short vec_add (vector signed short, vector signed short);
@ -63,7 +63,7 @@ vector double vec_add (vector double, vector double);]]></programlisting></para>
the name). This is why it is so important to understand the vector element the name). This is why it is so important to understand the vector element
types and to add the appropriate type casts to get the correct results.</para> types and to add the appropriate type casts to get the correct results.</para>


<para>The defacto standard implementation is GCC as defined in the include <para>The de facto standard implementation in GCC is defined in the include
file <literal>&lt;altivec.h&gt;</literal> and documented in the GCC online documentation in file <literal>&lt;altivec.h&gt;</literal> and documented in the GCC online documentation in
<link xlink:href="https://gcc.gnu.org/onlinedocs/gcc-6.3.0/gcc/PowerPC-AltiVec_002fVSX-Built-in-Functions.html#PowerPC-AltiVec_002fVSX-Built-in-Functions">6.59.20 PowerPC <link xlink:href="https://gcc.gnu.org/onlinedocs/gcc-6.3.0/gcc/PowerPC-AltiVec_002fVSX-Built-in-Functions.html#PowerPC-AltiVec_002fVSX-Built-in-Functions">6.59.20 PowerPC
AltiVec Built-in Functions</link>. The header file name and section title AltiVec Built-in Functions</link>. The header file name and section title

@ -23,17 +23,17 @@
<title>How vector elements change size and type</title> <title>How vector elements change size and type</title>
<para>Most vector built ins return the same vector type as the (first) <para>Most vector built ins return the same vector type as the (first)
input parameters, but there are exceptions. Examples include; conversions input parameters, but there are exceptions. Examples include conversions
between types, compares , pack, unpack,  merge, and integer multiply between types, compares, pack, unpack,  merge, and integer multiply
operations.</para> operations.</para>


<para>Converting floats to from integer will change the type and something <para>Converting floats to / from integer types will change the type and sometimes
change the element size as well (double ↔ int and float ↔ long). For the change the element size as well (double ↔ int and float ↔ long). For
VMX the conversions are always the same size (float ↔ [unsigned] int). But VMX the conversions are always the same size (float ↔ [unsigned] int). But
VSX allows conversion of 64-bit (long or double) to from 32-bit (float or VSX allows conversion of 64-bit (long or double) to from 32-bit (float or
 int)  with the inherent size changes. The PowerISA VSX defines a 4 element  int)  with the inherent size changes. The PowerISA VSX defines a 4-element
vector layout where little endian elements 0, 2 are used for input/output and vector layout where little endian elements 0, 2 are used for input/output and
elements 1,3 are undefined. The OpenPOWER ABI Appendix A define elements 1,3 are undefined. The OpenPOWER ABI Appendix A defines
<literal>vec_double</literal> and <literal>vec_float</literal> <literal>vec_double</literal> and <literal>vec_float</literal>
with even/odd and high/low extensions as program aids. These are not with even/odd and high/low extensions as program aids. These are not
included in GCC 7 or earlier but are planned for GCC 8.</para> included in GCC 7 or earlier but are planned for GCC 8.</para>
@ -42,8 +42,8 @@
<literal>vector bool &lt;</literal>input element type<literal>&gt;</literal> <literal>vector bool &lt;</literal>input element type<literal>&gt;</literal>
(effectively bit masks) or predicates (the condition code for all and (effectively bit masks) or predicates (the condition code for all and
any are represented as an int truth variable). When a predicate compare (i.e. any are represented as an int truth variable). When a predicate compare (i.e.
<literal>vec_all_eq</literal>, <literal>vec_any_gt</literal>), <literal>vec_all_eq</literal>, <literal>vec_any_gt</literal>)
is used in a if statement,  the condition code is is used in an if statement,  the condition code is
used directly in the conditional branch and the int truth value is not used directly in the conditional branch and the int truth value is not
generated.</para> generated.</para>


@ -51,7 +51,7 @@
integer sized elements. Pack operations include signed and unsigned saturate integer sized elements. Pack operations include signed and unsigned saturate
and unsigned modulo forms. As the packed result will be half the size (in and unsigned modulo forms. As the packed result will be half the size (in
bits), pack instructions require 2 vectors (256-bits) as input and generate a bits), pack instructions require 2 vectors (256-bits) as input and generate a
single 128-bit vector results. single 128-bit vector result.
<programlisting><![CDATA[vec_vpkudum ({1, 2}, {101, 102}) result={1, 2, 101, 102}]]></programlisting></para> <programlisting><![CDATA[vec_vpkudum ({1, 2}, {101, 102}) result={1, 2, 101, 102}]]></programlisting></para>


<para>Unpack operations expand integer elements into the next larger size <para>Unpack operations expand integer elements into the next larger size
@ -60,7 +60,7 @@
So the PowerISA defines unpack-high and unpack low forms where instruction So the PowerISA defines unpack-high and unpack low forms where instruction
takes (the high or low) half of vector elements and extends them to fill the takes (the high or low) half of vector elements and extends them to fill the
vector output. Element order is maintained and an unpack high / low sequence vector output. Element order is maintained and an unpack high / low sequence
with same input vector has the effect of unpacking to a 256-bit result in two with the same input vector has the effect of unpacking to a 256-bit result in two
vector registers. vector registers.
<programlisting><![CDATA[vec_vupkhsw ({1, 2, 3, 4}) result={1, 2} <programlisting><![CDATA[vec_vupkhsw ({1, 2, 3, 4}) result={1, 2}
vec_vupkhsw ({-1, 2, -3, 4}) result={-1, 2} vec_vupkhsw ({-1, 2, -3, 4}) result={-1, 2}
@ -79,7 +79,7 @@ vec_vupklsw ({-1, 2, -3, 4}) result={-3, 4}]]></programlisting></para>
<para>For PowerISA 2.07 we added vector merge word even / odd instructions. <para>For PowerISA 2.07 we added vector merge word even / odd instructions.
Instead of high or low elements the shuffle is from the even or odd number Instead of high or low elements the shuffle is from the even or odd number
elements of the two input vectors. Passing the same vector to both inputs to elements of the two input vectors. Passing the same vector to both inputs to
merge produces splat like results for each doubleword half, which is handy in merge produces splat-like results for each doubleword half, which is handy in
some convert operations. some convert operations.
<programlisting><![CDATA[vec_mrghd ({1, 2}, {101, 102}) result={1, 101} <programlisting><![CDATA[vec_mrghd ({1, 2}, {101, 102}) result={1, 101}
vec_mrgld ({1, 2}, {101, 102}) result={2, 102} vec_mrgld ({1, 2}, {101, 102}) result={2, 102}
@ -102,11 +102,12 @@ vec_mergeo ({1, 2, 3, 4}, {1, 2, 3, 4}) result={2, 2, 4, 4}]]></programlisting><
double product precision for intermediate computation before reducing the final double product precision for intermediate computation before reducing the final
result back to the original precision.</para> result back to the original precision.</para>


<para>The PowerISA VMX instruction set took the later approach ie keep all <para>The PowerISA VMX instruction set took the later approach, i.e., keep all
the product bits until the programmer explicitly asks for the truncated result. the product bits until the programmer explicitly asks for the truncated result
(via the pack operation).
So the vector integer multiple are split into even/odd forms across signed and So the vector integer multiple are split into even/odd forms across signed and
unsigned; byte, halfword and word inputs. This requires two instructions (given unsigned byte, halfword and word inputs. This requires two instructions (given
the same inputs) to generated the full vector  multiply across 2 vector the same inputs) to generate the full vector multiply across 2 vector
registers and 256-bits. Again as POWER processors are super-scalar this pair of registers and 256-bits. Again as POWER processors are super-scalar this pair of
instructions should execute in parallel.</para> instructions should execute in parallel.</para>



@ -20,7 +20,7 @@
xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0" version="5.0"
xml:id="sec_prefered_methods"> xml:id="sec_prefered_methods">
<title>Prefered methods</title> <title>Preferred methods</title>
<para>As we will see there are multiple ways to implement the logic of <para>As we will see there are multiple ways to implement the logic of
these intrinsics. Some implementation methods are preferred because they allow these intrinsics. Some implementation methods are preferred because they allow
@ -32,7 +32,7 @@
each intrinsic implementation. In general we should use the following list as a each intrinsic implementation. In general we should use the following list as a
guide to these decisions:</para> guide to these decisions:</para>
<orderedlist> <orderedlist spacing="compact">
<listitem> <listitem>
<para>Use C vector arithmetic, logical, dereference, etc., operators in <para>Use C vector arithmetic, logical, dereference, etc., operators in
preference to intrinsics.</para> preference to intrinsics.</para>

@ -26,12 +26,12 @@
with knowledge of PowerISA vector facilities and how to access the associated with knowledge of PowerISA vector facilities and how to access the associated
documentation.</para> documentation.</para>


<itemizedlist> <itemizedlist spacing="compact">
<listitem> <listitem>
<para> <para>
<link xlink:href="https://gcc.gnu.org/onlinedocs/gcc-6.3.0/gcc/Vector-Extensions.html#Vector-Extensions">GCC vector extention</link> <link xlink:href="https://gcc.gnu.org/onlinedocs/gcc-6.3.0/gcc/Vector-Extensions.html#Vector-Extensions"><emphasis role="italic">GCC vector extension</emphasis></link>
syntax and usage. This is one of a set of GCC syntax and usage. This is one of a set of GCC
"<link xlink:href="https://gcc.gnu.org/onlinedocs/gcc-6.3.0/gcc/C-Extensions.html#C-Extensions">Extentions to the C language Family</link>” "<link xlink:href="https://gcc.gnu.org/onlinedocs/gcc-6.3.0/gcc/C-Extensions.html#C-Extensions"><emphasis role="italic">Extensions to the C language Family</emphasis></link>”
that the intrinsic header implementation depends that the intrinsic header implementation depends
on.  As many of the GCC intrinsics for x86 are implemented via C vector on.  As many of the GCC intrinsics for x86 are implemented via C vector
extensions, reading and understanding of this code is an important part of the extensions, reading and understanding of this code is an important part of the

@ -25,13 +25,13 @@
<para>So if this is a code porting activity, where is the source? All the <para>So if this is a code porting activity, where is the source? All the
source code we need to look at is in the GCC source trees. You can either git source code we need to look at is in the GCC source trees. You can either git
(<link xlink:href="https://gcc.gnu.org/wiki/GitMirror">https://gcc.gnu.org/wiki/GitMirro</link>) (<link xlink:href="https://gcc.gnu.org/wiki/GitMirror">https://gcc.gnu.org/wiki/GitMirro</link>)
the gcc source  or down load one of the the gcc source  or download one of the
recent AT source tars (for example: recent AT source tars (for example:
<link xlink:href="ftp://ftp.unicamp.br/pub/linuxpatch/toolchain/at/ubuntu/dists/xenial/at10.0/">ftp://ftp.unicamp.br/pub/linuxpatch/toolchain/at/ubuntu/dists/xenial/at10.0/</link>). <link xlink:href="ftp://ftp.unicamp.br/pub/linuxpatch/toolchain/at/ubuntu/dists/xenial/at10.0/">ftp://ftp.unicamp.br/pub/linuxpatch/toolchain/at/ubuntu/dists/xenial/at10.0/</link>).
 You will find the intrinsic headers in the ./gcc/config/i386/  You will find the intrinsic headers in the ./gcc/config/i386/
sub-directory.</para> sub-directory.</para>


<para>If you have a Intel Linux workstation or laptop with GCC installed, <para>If you have an Intel Linux workstation or laptop with GCC installed,
you already have these headers, if you want to take a look: you already have these headers, if you want to take a look:
<screen><prompt>$ </prompt><userinput>find /usr/lib -name '*mmintrin.h'</userinput> <screen><prompt>$ </prompt><userinput>find /usr/lib -name '*mmintrin.h'</userinput>
/usr/lib/gcc/x86_64-redhat-linux/4.4.4/include/wmmintrin.h /usr/lib/gcc/x86_64-redhat-linux/4.4.4/include/wmmintrin.h
@ -44,8 +44,8 @@


<para>But depending on the vintage of the distro, these may not be the <para>But depending on the vintage of the distro, these may not be the
latest versions of the headers. Looking at the header source will tell you a latest versions of the headers. Looking at the header source will tell you a
few things.: The include structure (what other headers are implicitly few things: the include structure (what other headers are implicitly
included). The types that are used at the API. And finally, how the API is included), the types that are used at the API, and finally, how the API is
implemented.</para> implemented.</para>
<para><literallayout>smmintrin.h (SSE4.1) includes tmmintrin,h <para><literallayout>smmintrin.h (SSE4.1) includes tmmintrin,h

@ -48,11 +48,13 @@ _mm_mullo_epi16 (__m128i __A, __m128i __B)
<para>Note this requires a cast for the compiler to generate the correct <para>Note this requires a cast for the compiler to generate the correct
code for the intended operation. The parameters and result are the generic code for the intended operation. The parameters and result are the generic
<literal>__m128i</literal>, which is a vector long long with the type <literal>__m128i</literal>, which is a vector long long with the
<literal>__may_alias__</literal> attribute. But <literal>__may_alias__</literal> attribute. But
operation is a vector multiply low unsigned short (<literal>__v8hu</literal>). So not only do we operation is a vector multiply low on unsigned short elements.
use the cast to drop the <literal>__may_alias__</literal> attribute but we also need to cast to So not only do we use the cast to drop the <literal>__may_alias__</literal>
the correct (vector unsigned short) type for the specified operation.</para> attribute but we also need to cast to
the correct type (<literal>__v8hu</literal> or vector unsigned short)
for the specified operation.</para>


<para>I have successfully copied these (and similar) source snippets over <para>I have successfully copied these (and similar) source snippets over
to the PPC64LE implementation unchanged. This of course assumes the associated to the PPC64LE implementation unchanged. This of course assumes the associated

@ -22,6 +22,25 @@
xml:id="sec_vec_or_not"> xml:id="sec_vec_or_not">
<title>To vec_not or not</title> <title>To vec_not or not</title>
<para>Now lets look at a similar example that adds some surprising
complexity. When we look at the negated compare forms we can not find
exact matches in the PowerISA. But a little knowledge of boolean
algebra can show the way to the equivalent functions.</para>

<para>First the X86 compare not equal case where we might expect to
find the equivalent vec_cmpne builtins for PowerISA:
<programlisting><![CDATA[extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_cmpneq_pd (__m128d __A, __m128d __B)
{
return (__m128d)__builtin_ia32_cmpneqpd ((__v2df)__A, (__v2df)__B);
}

extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_cmpneq_sd (__m128d __A, __m128d __B)
{
return (__m128d)__builtin_ia32_cmpneqsd ((__v2df)__A, (__v2df)__B);
}]]></programlisting></para>

<para>Well not exactly. Looking at the OpenPOWER ABI document we see a <para>Well not exactly. Looking at the OpenPOWER ABI document we see a
reference to reference to
<literal>vec_cmpne</literal> for all numeric types. But when we look in the current <literal>vec_cmpne</literal> for all numeric types. But when we look in the current
@ -52,7 +71,7 @@


<para>This is RISC philosophy again. We can always use a logical <para>This is RISC philosophy again. We can always use a logical
instruction (like bit wise <emphasis role="bold">and</emphasis> or instruction (like bit wise <emphasis role="bold">and</emphasis> or
<emphasis role="bold">or</emphasis>) to effect a move given that we also have <emphasis role="bold">or</emphasis>) to effect a move, given that we also have
nondestructive 3 register instruction forms. In the PowerISA most instruction nondestructive 3 register instruction forms. In the PowerISA most instruction
have two input registers and a separate result register. So if the result have two input registers and a separate result register. So if the result
register number is  different from either input register then the inputs are register number is  different from either input register then the inputs are
@ -62,22 +81,21 @@


<para>The statement <literal>B = vec_or (A,A)</literal> is is effectively a vector move/copy <para>The statement <literal>B = vec_or (A,A)</literal> is is effectively a vector move/copy
from <literal>A</literal> to <literal>B</literal>. And <literal>A = vec_or (A,A)</literal> is obviously a from <literal>A</literal> to <literal>B</literal>. And <literal>A = vec_or (A,A)</literal> is obviously a
<emphasis role="bold"><literal>nop</literal></emphasis> (no operation). In the the <emphasis role="bold"><literal>nop</literal></emphasis> (no operation). In fact the
PowerISA defines the preferred <literal>nop</literal> and register move for vector registers in PowerISA defines the preferred <literal>nop</literal> and register move for vector registers
this way.</para> in this way.</para>


<para>It is also useful to have hardware implement the logical operators <para>The PowerISA implements the logical operators
<emphasis role="bold">nor</emphasis> (<emphasis role="bold">not or</emphasis>) <emphasis role="bold">nor</emphasis> (<emphasis role="bold">not or</emphasis>)
and <emphasis role="bold">nand</emphasis> (<emphasis role="bold">not and</emphasis>).   and <emphasis role="bold">nand</emphasis> (<emphasis role="bold">not and</emphasis>).  
The PowerISA provides these instruction for The PowerISA provides these instruction for
fixed point and vector logical operation. So <literal>vec_not(A)</literal> fixed point and vector logical operations. So <literal>vec_not(A)</literal>
can be implemented as <literal>vec_nor(A,A)</literal>. can be implemented as <literal>vec_nor(A,A)</literal>.
So looking at the  implementation of _mm_cmpne we propose the So for the implementation of _mm_cmpne we propose the following:
following:
<programlisting><![CDATA[extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__)) <programlisting><![CDATA[extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_cmpneq_pd (__m128d __A, __m128d __B) _mm_cmpneq_pd (__m128d __A, __m128d __B)
{ {
__m128d temp = (__m128d)vec_cmpeq (__A, __B); __v2df temp = (__v2df ) vec_cmpeq ((__v2df) __A, (__v2df)__B);
return ((__m128d)vec_nor (temp, temp)); return ((__m128d)vec_nor (temp, temp));
} }
extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__)) extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
@ -94,15 +112,15 @@ _mm_cmpneq_sd (__m128d __A, __m128d __B)
<para>The Intel Intrinsics also include the not forms of the relational <para>The Intel Intrinsics also include the not forms of the relational
compares: compares:
<programlisting><![CDATA[extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__)) <programlisting><![CDATA[extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_cmpnlt_sd (__m128d __A, __m128d __B) _mm_cmpnlt_pd (__m128d __A, __m128d __B)
{ {
return (__m128d)__builtin_ia32_cmpnltsd ((__v2df)__A, (__v2df)__B); return (__m128d)__builtin_ia32_cmpnltpd ((__v2df)__A, (__v2df)__B);
} }


extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__)) extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_cmpnle_sd (__m128d __A, __m128d __B) _mm_cmpnle_pd (__m128d __A, __m128d __B)
{ {
return (__m128d)__builtin_ia32_cmpnlesd ((__v2df)__A, (__v2df)__B); return (__m128d)__builtin_ia32_cmpnlepd ((__v2df)__A, (__v2df)__B);
}]]></programlisting></para> }]]></programlisting></para>


<para>The PowerISA and OpenPOWER ABI, or GCC PowerPC Altivec Built-in <para>The PowerISA and OpenPOWER ABI, or GCC PowerPC Altivec Built-in

Loading…
Cancel
Save