diff --git a/Vector_Intrinsics/app_references.xml b/Vector_Intrinsics/app_references.xml index 52010af..73707b7 100644 --- a/Vector_Intrinsics/app_references.xml +++ b/Vector_Intrinsics/app_references.xml @@ -58,10 +58,10 @@ GCC online documentation - GCC Manual (GCC 6.3) + GCC Manual - GCC Internals Manual + GCC Internals Manual diff --git a/Vector_Intrinsics/bk_main.xml b/Vector_Intrinsics/bk_main.xml index ea01818..0542c63 100644 --- a/Vector_Intrinsics/bk_main.xml +++ b/Vector_Intrinsics/bk_main.xml @@ -22,7 +22,7 @@ xml:id="bk_main"> Linux on Power Porting Guide - Vector Intrinsic + Vector Intrinsics @@ -69,11 +69,11 @@ - 2017-07-26 + 2017-09-14 - Revision 0.1 - initial draft from Steve Munroe + Revision 0.2 - initial draft from Steve Munroe diff --git a/Vector_Intrinsics/ch_howto_start.xml b/Vector_Intrinsics/ch_howto_start.xml index 27ff744..3c52da1 100644 --- a/Vector_Intrinsics/ch_howto_start.xml +++ b/Vector_Intrinsics/ch_howto_start.xml @@ -23,12 +23,14 @@ How do we work this? The working assumption is to start with the existing GCC headers from - ./gcc/config/i386/, then convert them to PowerISA and add them to - ./gcc/config/rs6000/. I assume we will replicate the existing header structure - and retain the existing header file and intrinsic names. This also allows us to - reuse existing DejaGNU test cases from ./gcc/testsuite/gcc.target/i386, modify - them as needed for the POWER target, and them to the - ./gcc/testsuite/gcc.target/powerpc. + ./gcc/config/i386/, then convert them to PowerISA + and add them to ./gcc/config/rs6000/. + I assume we will replicate the existing header structure + and retain the existing header file and intrinsic names. + This also allows us to reuse existing DejaGNU test cases from + ./gcc/testsuite/gcc.target/i386, modify + them as needed for the POWER target, and add them to + ./gcc/testsuite/gcc.target/powerpc. We can be flexible on the sequence that headers/intrinsics and test cases are ported.  This should be based on customer need and resolving @@ -42,8 +44,8 @@  ./gcc/config/i386/ and copy the header comment (including FSF copyright) down to any vector typedefs used in the API or implementation. Skip the Intel intrinsic implementation code for now, but add the ending #end if matching the - headers conditional guard against multiple inclusion. You can add  #include - <alternative> as needed. For examples: + headers conditional guard against multiple inclusion. You can add additional + #include's as needed. For example: The - GCC - testsuite uses the DejaGNU  test framework as documented in the - GNU Compiler Collection (GCC) - Internals manual. GCC adds its own DejaGNU directives and extensions, + + GCC testsuite + uses the DejaGNU  test framework as documented in the + + GNU Compiler Collection (GCC) + Internals + manual. GCC adds its own DejaGNU directives and extensions, that are embedded in the testsuite source as comments.  Some are platform specific and will need to be adjusted for tests that are ported to our platform. For example @@ -102,9 +107,9 @@ typedef float __v4sf __attribute__ ((__vector_size__ (16))); /* { dg-require-effective-target lp64 } */ /* { dg-require-effective-target p8vector_hw { target powerpc*-*-* } } */]]> - Repeat this process until you have equivalent implementations for all - the intrinsics in that header and associated test cases that execute without - error. + Repeat this process until you have equivalent DejaGNU test + implementations for all the intrinsics in that header and associated + test cases that execute without error. diff --git a/Vector_Intrinsics/ch_intel_intrinsic_porting.xml b/Vector_Intrinsics/ch_intel_intrinsic_porting.xml index 7eb1bba..2a461a9 100644 --- a/Vector_Intrinsics/ch_intel_intrinsic_porting.xml +++ b/Vector_Intrinsics/ch_intel_intrinsic_porting.xml @@ -27,19 +27,26 @@ applications, and make them (or equivalents) available for the PowerPC64LE platform. These X86 intrinsics started with the Intel and Microsoft compilers but were then ported to the GCC compiler. The GCC implementation is a set of - headers with inline functions. These inline functions provide a implementation + headers with inline functions. These inline functions provide an implementation mapping from the Intel/Microsoft dialect intrinsic names to the corresponding - GCC Intel built-in's or directly via C language vector extension syntax. + GCC Intel built-ins or directly via C language vector extension syntax. The current proposal is to start with the existing X86 GCC intrinsic headers and port them (copy and change the source)  to POWER using C language vector extensions, VMX and VSX built-ins. Another key assumption is that we - will be able to use many of existing Intel DejaGNU test cases on + will be able to use many of the existing Intel DejaGNU test cases in ./gcc/testsuite/gcc.target/i386. This document is intended as a guide to developers participating in this effort. However this document provides guidance and examples that should be useful to developers who may encounter X86 intrinsics in code that they are porting to another platform. + (We have started contributions of X86 intrinsic headers + to the GCC project.) The current status of the project is the BMI + (bmiintrin.h), BMI2 (bmi2intrin.h), MMX (mmintrin.h), and SSE (xmmintrin.h) + intrinsic headers are committed to GCC development trunk for GCC 8. + Work on SSE2 (emmintrin.h) is in progress. + + diff --git a/Vector_Intrinsics/sec_crossing_lanes.xml b/Vector_Intrinsics/sec_crossing_lanes.xml index e73d63f..ff866de 100644 --- a/Vector_Intrinsics/sec_crossing_lanes.xml +++ b/Vector_Intrinsics/sec_crossing_lanes.xml @@ -22,24 +22,26 @@ xml:id="sec_crossing_lanes"> Crossing lanes - We have seen that, most of the time, vector SIMD units prefer to keep + Vector SIMD units prefer to keep computations in the same “lane” (element number) as the input elements. The - only exception in the examples so far are the occasional splat (copy one + only exception in the examples so far are the occasional vector splat (copy one element to all the other elements of the vector) operations. Splat is an example of the general category of “permute” operations (Intel would call - this a “shuffle” or “blend”). Permutes selects and rearrange the - elements of (usually) a concatenated pair of vectors and delivers those + this a “shuffle” or “blend”). + + Permutes select and rearrange the + elements of an input vector (or a concatenated pair of vectors) and deliver those selected elements, in a specific order, to a result vector. The selection and - order of elements in the result is controlled by a third vector, either as 3rd - input vector or and immediate field of the instruction. + order of elements in the result is controlled by a third operand, either as a 3rd + input vector or as an immediate field of the instruction. - For example the Intel intrisics for + For example, consider the Intel intrisics for Horizontal Add / Subtract - added with SSE3. These instrinsics add (subtract) adjacent element pairs, across pair of + added with SSE3. These instrinsics add (subtract) adjacent element pairs across a pair of input vectors, placing the sum of the adjacent elements in the result vector. For example _mm_hadd_ps   - which implments the operation on float: + which implements the operation on float: The PowerISA provides generalized byte-level vector permute (vperm) - based a vector register pair source as input and a control vector. The control + based on a vector register pair (32 bytes) source as input and a (16-byte) control vector. + The control vector provides 16 indexes (0-31) to select bytes from the concatenated input - vector register pair (VRA, VRB). A more specific set of permutes (pack, unpack, - merge, splat) operations (across element sizes) are encoded as separate -  instruction opcodes or instruction immediate fields. + vector register pair (VRA, VRB). There are also predefined permutes (splat, pack, unpack, + merge) operations (across element sizes) that are encoded as separate +  instruction op-codes or instruction immediate fields. Unfortunately only the general vec_perm can provide the realignment - we need the _mm_hadd_ps operation or any of the int, short variants of hadd. + we need for the _mm_hadd_ps operation or any of the int, short variants of hadd. For example: __X and __Y, and another to select the odd word elements across __X and __Y. - The result of these permutes (vec_perm) are inputs to the - vec_add and completes the add operation. + The results of these permutes (vec_perm) are inputs to the + vec_add that completes the horizontal add operation. - Fortunately the permute required for the double (64-bit) case (IE - _mm_hadd_pd) reduces to the equivalent of vec_mergeh / - vec_mergel  doubleword + Fortunately the permute required for the double (64-bit) case + (_mm_hadd_pd) reduces to the equivalent of + vec_mergeh / vec_mergel  doubleword (which are variants of  VSX Permute Doubleword Immediate). So the implementation of _mm_hadd_pd can be simplified to this: Profound differences We have already mentioned above a number of architectural differences - that effect porting of codes containing Intel intrinsics to POWER. The fact + that affect porting of codes containing Intel intrinsics to POWER. The fact that Intel supports multiple vector extensions with different vector widths - (64, 128, 256, and 512-bits) while the PowerISA only supports vectors of - 128-bits is one issue. Another is the difference in how the respective ISAs - support scalars in vector registers is another.  In the text above we propose - workable alternatives for the PowerPC port. There also differences in the + (64, 128, 256, and 512 bits) while the PowerISA only supports vectors of + 128 bits is one issue. Another is the difference in how the respective ISAs + support scalars in vector registers.  In the text above we propose + workable alternatives for the PowerPC port. There are also differences in the handling of floating point exceptions and rounding modes that may impact the application's performance or behavior. diff --git a/Vector_Intrinsics/sec_floatingpoint_exceptions.xml b/Vector_Intrinsics/sec_floatingpoint_exceptions.xml index 2e61aeb..812ae69 100644 --- a/Vector_Intrinsics/sec_floatingpoint_exceptions.xml +++ b/Vector_Intrinsics/sec_floatingpoint_exceptions.xml @@ -22,16 +22,16 @@ xml:id="sec_floatingpoint_exceptions"> Floating Point Exceptions - Nominally both ISAs support the IEEE754 specifications, but there are - some subtle differences. Both architecture define a status and control register - to record exceptions and enable / disable floating exceptions for program + Nominally both ISAs support the IEEE-754 specifications, but there are + some subtle differences. Both architectures define a status and control register + to record exceptions and enable / disable floating point exceptions for program interrupt or default action. Intel has a MXCSR and PowerISA has a FPSCR which basically do the same thing but with different bit layout. Intel provides _mm_setcsr / _mm_getcsr - intrinsics to allow direct - access to the MXCSR. In the early days before the OS POSIX run-times where - updated  to manage the MXCSR, this might have been useful. Today this would be + intrinsic functions to allow direct access to the MXCSR. + This might have been useful in the early days before the OS run-times were + updated to manage the MXCSR via the POSIX APIs. Today this would be highly discouraged with a strong preference to use the POSIX APIs (feclearexceptflag, fegetexceptflag, @@ -44,29 +44,29 @@ might be simpler just to replace these intrinsics with macros that generate #error. - The Intel MXCSR does have some none (POSIX/IEEE754) standard quirks; - Flush-To-Zero and Denormals-Are-Zeros flags. This simplifies the hardware + The Intel MXCSR does have some non- (POSIX/IEEE754) standard quirks: + The Flush-To-Zero and Denormals-Are-Zeros flags. This simplifies the hardware response to what should be a rare condition (underflows where the result can not be represented in the exponent range and precision of the format) by simply returning a signed 0.0 value. The intrinsic header implementation does provide constant masks for _MM_DENORMALS_ZERO_ON (<pmmintrin.h>) and - _MM_FLUSH_ZERO_ON (<xmmintrin.h>, + _MM_FLUSH_ZERO_ON (<xmmintrin.h>), so technically it is available to users of the Intel Intrinsics API. The VMX Vector facility provides a separate Vector Status and Control register (VSCR) with a Non-Java Mode control bit. This control combines the - flush-to-zero semantics for floating Point underflow and denormal values. But + flush-to-zero semantics for floating point underflow and denormal values. But this control only applies to VMX vector float instructions and does not apply to VSX scalar floating Point or vector double instructions. The FPSCR does define a Floating-Point non-IEEE mode which is optional in the architecture. - This would apply to Scalar and VSX floating-point operations if it was + This would apply to Scalar and VSX floating-point operations if it were implemented. This was largely intended for embedded processors and is not implemented in the POWER processor line. - As the flush-to-zero is primarily a performance enhansement and is - clearly outside the IEEE754 standard, it may be best to simply ignore this + As the flush-to-zero is primarily a performance enhancement and is + clearly outside the IEEE-754 standard, it may be best to simply ignore this option for the intrinsic port. diff --git a/Vector_Intrinsics/sec_floatingpoint_rounding.xml b/Vector_Intrinsics/sec_floatingpoint_rounding.xml index 19653a7..93591d1 100644 --- a/Vector_Intrinsics/sec_floatingpoint_rounding.xml +++ b/Vector_Intrinsics/sec_floatingpoint_rounding.xml @@ -23,7 +23,7 @@ Floating-point rounding modes The Intel (x86 / x86_64) and PowerISA architectures both support the - 4 IEEE754 rounding modes. Again while the Intel Intrinsic API allows the + 4 IEEE-754 rounding modes. Again while the Intel Intrinsic API allows the application to change rounding modes via updates to the MXCSR it is a bad idea and should be replaced with the POSIX APIs (fegetround and diff --git a/Vector_Intrinsics/sec_gcc_vector_extensions.xml b/Vector_Intrinsics/sec_gcc_vector_extensions.xml index d78bca2..697560c 100644 --- a/Vector_Intrinsics/sec_gcc_vector_extensions.xml +++ b/Vector_Intrinsics/sec_gcc_vector_extensions.xml @@ -23,7 +23,7 @@ GCC Vector Extensions The GCC vector extensions are common syntax but implemented in a - target specific way. Using the C vector extensions require the + target specific way. Using the C vector extensions requires the __gnu_inline__ attribute to avoid syntax errors in case the user specified  C standard compliance (-std=c90, -std=c11, @@ -78,10 +78,18 @@ _mm_store_ss (float *__P, __m128 __A) The code generation is complicated by the fact that PowerISA vector registers are Big Endian (element 0 is the left most word of the vector) and - X86 scalar stores are from the left most (work/dword) for the vector register. + scalar loads / stores are also to / from the right most word / dword. + X86 scalar loads / stores are to / from the right most element for the + XMM vector register. + The PowerPC64 ELF V2 ABI mimics the X86 Little Endian behavior by placing + logical element [0] in the right most element of the vector register. + + This may require the compiler to generate additional instructions + to place the scalar value in the expected position. Application code with extensive use of scalar (vs packed) intrinsic loads / - stores should be flagged for rewrite to native PPC code using exisiing scalar - types (float, double, int, long, etc.). + stores should be flagged for rewrite to C code using existing scalar + types (float, double, int, long, etc.). The compiler may be able the + vectorize this scalar code using the native vector SIMD instruction set. Another example is the set reverse order: Dealing with AVX and AVX512 AVX is a bit easier for PowerISA and the ELF V2 ABI. First we have - lots (64) of vector registers and a super scalar vector pipe-line (can execute + lots (64) of vector registers and a superscalar vector pipeline (can execute two or more independent 128-bit vector operations concurrently). Second the ELF V2 ABI was designed to pass and return larger aggregates in vector registers: @@ -35,7 +35,7 @@ A qualified vector argument corresponds to: - + A vector data type @@ -58,7 +58,7 @@ So the ABI allows for passing up to three structures each - representing 512-bit vectors and returning such (512-bit) structure all in VMX + representing 512-bit vectors and returning such (512-bit) structures all in VMX registers. This can be extended further by spilling parameters (beyond 12 X 128-bit vectors) to the parameter save area, but we should not need that, as most intrinsics only use 2 or 3 operands.. Vector registers not needed for @@ -79,7 +79,7 @@ __vector_size__ (32) or (64) in the PowerPC implementation of __m256 and __m512 types. Instead we will typedef structs of 2 or 4 vector (__m128) fields. This - allows efficient handling of these larger data types without require new GCC + allows efficient handling of these larger data types without requiring new GCC language extensions. In the end we should use the same type names and definitions as the diff --git a/Vector_Intrinsics/sec_handling_mmx.xml b/Vector_Intrinsics/sec_handling_mmx.xml index dc21a90..8b31d02 100644 --- a/Vector_Intrinsics/sec_handling_mmx.xml +++ b/Vector_Intrinsics/sec_handling_mmx.xml @@ -22,7 +22,7 @@ xml:id="sec_handling_mmx"> Dealing with MMX - MMX is actually the hard case. The __m64 + MMX is actually the harder case. The __m64 type supports SIMD vector int types (char, short, int, long).  The  Intel API defines   __m64 as: @@ -32,23 +32,23 @@ GCC) and we would prefer to use a native PowerISA type that can be passed in a single register.  The PowerISA Rotate Under Mask instructions can easily extract and insert integer fields of a General Purpose Register (GPR). This - implies that MMX integer types can be handled as a internal union of arrays for - the supported element types. So an 64-bit unsigned long long is the best type - for parameter passing and return values. Especially for the 64-bit (_si64) + implies that MMX integer types can be handled as an internal union of arrays for + the supported element types. So a 64-bit unsigned long long is the best type + for parameter passing and return values, especially for the 64-bit (_si64) operations as these normally generate a single PowerISA instruction. - The SSE extensions include some convert operations for + The SSE extensions include some copy / convert operations for _m128 to / from _m64 and this includes some int to / from float conversions. However in these cases the float operands always reside in SSE (XMM) registers (which match the PowerISA vector registers) and the MMX registers only contain integer values. POWER8 (PowerISA-2.07) has direct move instructions between GPRs and VSRs. So these transfers are normally a single instruction and any conversions - can be handed in the vector unit. + can be handled in the vector unit. When transferring a __m64 value to a vector register we should also execute a xxsplatd instruction to insure there is valid data in all four - element lanes before doing floating point operations. This avoids generating + float element lanes before doing floating point operations. This avoids causing extraneous floating point exceptions that might be generated by uninitialized parts of the vector. The top two lanes will have the floating point results that are in position for direct transfer to a GPR or stored via Store Float @@ -57,7 +57,8 @@ form. Also for the smaller element sizes and higher element counts (MMX - _pi8 and _p16 types) the number of  Rotate Under Mask instructions required to + _pi8 and _p16 types) + the number of  Rotate Under Mask instructions required to disassemble the 64-bit __m64 into elements, perform the element calculations, and reassemble the elements in a single __m64 diff --git a/Vector_Intrinsics/sec_how_findout.xml b/Vector_Intrinsics/sec_how_findout.xml index 4c41752..4572a38 100644 --- a/Vector_Intrinsics/sec_how_findout.xml +++ b/Vector_Intrinsics/sec_how_findout.xml @@ -52,9 +52,9 @@ document, Chapter 6. Vector Programming Interfaces and Appendix A. Predefined Functions for Vector Programming. - Another useful document is the original Altivec Technology Programers Interface Manual - with a  user friendly structure and many helpful diagrams. But alas the PIM does does not - cover the resent PowerISA (power7,  power8, and power9) enhancements. + Another useful document is the original Altivec Technology Programmers Interface Manual + with a user friendly structure and many helpful diagrams. But alas the PIM does does not + cover the recent PowerISA (power7,  power8, and power9) enhancements. diff --git a/Vector_Intrinsics/sec_intel_intrinsic_functions.xml b/Vector_Intrinsics/sec_intel_intrinsic_functions.xml index e83513b..1539acc 100644 --- a/Vector_Intrinsics/sec_intel_intrinsic_functions.xml +++ b/Vector_Intrinsics/sec_intel_intrinsic_functions.xml @@ -53,8 +53,8 @@ are a lot of scalar operations on a single float, double, or long long type. In effect these are scalars that can take advantage of the larger (xmm) register space. Also in the Intel 32-bit architecture they provided IEEE754 float and - double types, and 64-bit integers that did not exist or where hard to implement - in the base i386/387 instruction set. These scalar operation use a suffix + double types, and 64-bit integers that did not exist or were hard to implement + in the base i386/387 instruction set. These scalar operations use a suffix starting with '_s' (_sd for scalar double float, _ss scalar float, and _si64 for scalar long long). @@ -72,7 +72,7 @@ of 4 32-bit integers. The GCC  builtins for the - i386.target, + i386.target (includes x86 and x86_64) are not the same as the Intel Intrinsics. While they have similar intent and cover most of the same functions, they use a different naming (prefixed with @@ -83,10 +83,10 @@ v4hi __builtin_ia32_paddw (v4hi, v4hi) v2si __builtin_ia32_paddd (v2si, v2si) v2di __builtin_ia32_paddq (v2di, v2di)]]> - Note: A key difference between GCC builtins for i386 and Powerpc is - that the x86 builtins have different names of each operation and type while the - powerpc altivec builtins tend to have a single generatic builtin for  each - operation, across a set of compatible operand types. + A key difference between GCC built-ins for i386 and PowerPC is + that the x86 built-ins have different names of each operation and type while the + PowerPC altivec built-ins tend to have a single generic built-in for  each + operation, across a set of compatible operand types. In GCC the Intel Intrinsic header (*intrin.h) files are implemented as a set of inline functions using the Intel Intrinsic API names and types. @@ -106,7 +106,7 @@ _mm_add_sd (__m128d __A, __m128d __B) }]]> Note that the   - _mm_add_pd is implemented direct as C vector + _mm_add_pd is implemented direct as GCC C vector extension code., while _mm_add_sd is implemented via the GCC builtin __builtin_ia32_addsd. From the diff --git a/Vector_Intrinsics/sec_intel_intrinsic_includes.xml b/Vector_Intrinsics/sec_intel_intrinsic_includes.xml index 275cffb..4bf437c 100644 --- a/Vector_Intrinsics/sec_intel_intrinsic_includes.xml +++ b/Vector_Intrinsics/sec_intel_intrinsic_includes.xml @@ -23,10 +23,10 @@ The structure of the intrinsic includes The GCC x86 intrinsic functions for vector were initially grouped by - technology (MMX and SSE), which starts with MMX continues with SSE through + technology (MMX and SSE), which starts with MMX and continues with SSE through SSE4.1 stacked like a set of Russian dolls. - Basically each higher layer include, needs typedefs and helper macros + Basically each higher layer include needs typedefs and helper macros defined by the lower level intrinsic includes. mm_malloc.h simply provides wrappers for posix_memalign and free. Then it gets a little weird, starting with the crypto extensions: @@ -34,8 +34,8 @@ For AVX, AVX2, and AVX512 they must have decided - that the Russian Dolls thing was getting out of hand. AVX et all is split - across 14 files + that the Russian Dolls thing was getting out of hand. AVX et al. is split + across 14 files: #include @@ -53,25 +53,25 @@ #include #include ]]> - but they do not want the applications include these + but they do not want the applications to include these individually. - So immintrin.h includes everything Intel vector, include all the - AVX, AES, SSE and MMX flavors. + So immintrin.h includes everything Intel vector, including all the + AVX, AES, SSE, and MMX flavors. directly; include instead." #endif]]> - So what is the net? The include structure provides some strong clues + So why is this interesting? The include structure provides some strong clues about the order that we should approach this effort.  For example if you need - to intrinsic from SSE4 (smmintrin.h) we are likely to need to type definitions + to use intrinsics from SSE4 (smmintrin.h) you are likely to need to type definitions from SSE (emmintrin.h). So a bottoms up (MMX, SSE, SSE2, …) approach seems - like the best plan of attack. Also saving the AVX parts for latter make sense, - as most are just wider forms of operations that already exists in SSE. + like the best plan of attack. Also saving the AVX parts for later make sense, + as most are just wider forms of operations that already exist in SSE. We should use the same include structure to implement our PowerISA equivalent API headers. This will make porting easier (drop-in replacement) and - should get the application running quickly on POWER. Then we are in a position + should get the application running quickly on POWER. Then we will be in a position to profile and analyze the resulting application. This will show any hot spots where the simple one-to-one transformation results in bottlenecks and additional tuning is needed. For these cases we should improve our tools (SDK diff --git a/Vector_Intrinsics/sec_intel_intrinsic_types.xml b/Vector_Intrinsics/sec_intel_intrinsic_types.xml index 23435d1..f2e43f9 100644 --- a/Vector_Intrinsics/sec_intel_intrinsic_types.xml +++ b/Vector_Intrinsics/sec_intel_intrinsic_types.xml @@ -42,28 +42,35 @@ typedef float __v4sf __attribute__ ((__vector_size__ (16)));]]> in which pointers to one vector type are permitted to alias pointers to a different vector type. - So there are a - couple of issues here: 1)  the API seem to force the compiler to assume - aliasing of any parameter passed by reference. Normally the compiler assumes - that parameters of different size do not overlap in storage, which allows more - optimization. 2) the data type used at the interface may not be the correct - type for the implied operation. So parameters of type - __m128i (which is defined - as vector long long) is also used for parameters and return values of vector - [char | short | int ]. + There are a couple of issues here: + + + The API seems to force the compiler to assume + aliasing of any parameter passed by reference. + + + The data type used at the interface may not be + the correct type for the implied operation. + + + Normally the compiler assumes that parameters of different size do + not overlap in storage, which allows more optimization. + However parameters for different vector element sizes + [char | short | int | long] are all passed and returned as type __m128i + (defined as vector long long). - This may not matter when using x86 built-in's but does matter when - the implementation uses C vector extensions or in our case use PowerPC generic + This may not matter when using x86 built-ins but does matter when + the implementation uses C vector extensions or in our case uses PowerPC generic vector built-ins (). - For the later cases the type must be correct for + For the latter cases the type must be correct for the compiler to generate the correct type (char, short, int, long) () for the generic builtin operation. There is also concern that excessive use of __may_alias__ will limit compiler optimization. We are not sure how important this attribute is to the correct operation of the API.  So at a later stage we should - experiment with removing it from our implementation for PowerPC + experiment with removing it from our implementation for PowerPC. The good news is that PowerISA has good support for 128-bit vectors and (with the addition of VSX) all the required vector data (char, short, int, @@ -76,11 +83,16 @@ typedef float __v4sf __attribute__ ((__vector_size__ (16)));]]> implemented as vector attribute extensions of the appropriate  size (   __vector_size__ ({8 | 16 | 32, and 64}). For the PowerPC target  GCC currently only supports the native __vector_size__ ( 16 ). These we can support directly - in VMX/VSX registers and associated instructions. The GCC will compile with + in VMX/VSX registers and associated instructions. GCC will compile code with other   __vector_size__ values, but the resulting types are treated as simple arrays of the element type. This does not allow the compiler to use the vector - registers and vector instructions for these (nonnative) vectors.   So what is - a programmer to do? + registers and vector instructions for these (nonnative) vectors. + + So the PowerISA VMX/VSX facilities and GCC compiler support for + 128-bit/16-byte vectors and associated vector built-ins + are well matched to implementing equivalent X86 SSE intrinsic functions. + However implementing the older MMX (64-bit) and the latest + AVX (256 / 512-bit) extensions requires more thought and some ingenuity. diff --git a/Vector_Intrinsics/sec_more_examples.xml b/Vector_Intrinsics/sec_more_examples.xml index 8e4232e..839ea2c 100644 --- a/Vector_Intrinsics/sec_more_examples.xml +++ b/Vector_Intrinsics/sec_more_examples.xml @@ -27,23 +27,23 @@ converts a packed vector double into a packed vector single float. Since only 2 doubles fit into a 128-bit vector only 2 floats are returned and occupy only half (64-bits) of the XMM register. - For this intrinsic the 64-bit are packed into the logical left half of the - registers and the logical right half of the register is set to zero (as per the + For this intrinsic the 64 bits are packed into the logical left half of the result + register and the logical right half of the register is set to zero (as per the Intel cvtpd2ps instruction). The PowerISA provides the VSX Vector round and Convert Double-Precision to Single-Precision format (xvcvdpsp) instruction. In the ABI - this is vec_floato (vector double) .   - This instruction convert each double - element then transfers converted element 0 to float element 1, and converted + this is vec_floato (vector double).   + This instruction converts each double + element, then transfers converted element 0 to float element 1, and converted element 1 to float element 3. Float elements 0 and 2 are undefined (the - hardware can do what ever). This does not match the expected results for + hardware can do whatever). This does not match the expected results for _mm_cvtpd_ps. , 1.0, , 2.0} _mm_cvtpd_ps ({1.0, 2.0}) result = {1.0, 2.0, 0.0, 0.0}]]> So we need to re-position the results to word elements 0 and 2, which - allows a pack operation to deliver the correct format. Here the merge odd + allows a pack operation to deliver the correct format. Here the merge-odd splats element 1 to 0 and element 3 to 2. The Pack operation combines the low half of each doubleword from the vector result and vector of zeros to generate the require format. @@ -53,8 +53,6 @@ _mm_cvtpd_ps (__m128d __A) __v4sf result; __v4si temp; const __v4si vzero = {0,0,0,0}; -// temp = (__v4si)vec_floato (__A); /* GCC 8 */ - __asm__( "xvcvdpsp %x0,%x1;\n" : "=wa" (temp) @@ -66,9 +64,9 @@ _mm_cvtpd_ps (__m128d __A) return (result); }]]> - This  technique is also used to implement   + This technique is also used to implement   _mm_cvttpd_epi32 - which converts a packed vector double in to a packed vector int. The PowerISA instruction + which converts a packed vector double into a packed vector int. The PowerISA instruction xvcvdpsxws uses a similar layout for the result as xvcvdpsp and requires the same fix up. diff --git a/Vector_Intrinsics/sec_other_intrinsic_examples.xml b/Vector_Intrinsics/sec_other_intrinsic_examples.xml index 37b2fae..338e11b 100644 --- a/Vector_Intrinsics/sec_other_intrinsic_examples.xml +++ b/Vector_Intrinsics/sec_other_intrinsic_examples.xml @@ -54,15 +54,16 @@ _mm_load_sd (double const *__P) to combine __F and 0.0).  Other examples may generate sub-optimal code and justify a rewrite to PowerISA scalar or vector code (GCC PowerPC - AltiVec Built-in Functions or inline assembler). + in-Functions.html#PowerPC-AltiVec_002fVSX-Built-in-Functions"> + GCC PowerPC AltiVec Built-in Functions + or inline assembler). - Net: try using the existing C code if you can, but check on what the + Try using the existing C code if you can, but check on what the compiler generates.  If the generated code is horrendous, it may be worth the effort to write a PowerISA specific equivalent. For codes making extensive use of MMX or SSE scalar intrinsics you will be better off rewriting to use - standard C scalar types and letting the the GCC compiler handle the details - (see ). + standard C scalar types and letting the GCC compiler handle the details + (see ). diff --git a/Vector_Intrinsics/sec_packed_vs_scalar_intrinsics.xml b/Vector_Intrinsics/sec_packed_vs_scalar_intrinsics.xml index e798816..a75aa7e 100644 --- a/Vector_Intrinsics/sec_packed_vs_scalar_intrinsics.xml +++ b/Vector_Intrinsics/sec_packed_vs_scalar_intrinsics.xml @@ -23,8 +23,8 @@ Packed vs scalar intrinsics So what is actually going on here? The vector code is clear enough if - you know that '+' operator is applied to each vector element. The the intent of - the builtin is a little less clear, as the GCC documentation for + you know that the '+' operator is applied to each vector element. The intent of + the X86 built-in is a little less clear, as the GCC documentation for __builtin_ia32_addsd is not very helpful (nonexistent). So perhaps the Intel Intrinsic Guide @@ -54,7 +54,7 @@ The vector bit and field numbering is different (reversed). - + For Intel the scalar is always placed in the low order (right most) bits of the XMM register (and the low order address for load and store). @@ -62,13 +62,13 @@ For PowerISA and VSX, scalar floating point operations and Floating - Point Registers (FPRs) are on the low numbered bits which is the left hand + Point Registers (FPRs) are in the low numbered bits which is the left hand side of the vector / scalar register (VSR). - For the PowerPC64 ELF V2 little endian ABI we also make point of - making the GCC vector extensions and vector built ins, appear to be little + For the PowerPC64 ELF V2 little endian ABI we also make a point of + making the GCC vector extensions and vector built-ins, appear to be little endian. So vector element 0 corresponds to the low order address and low order (right hand) bits of the vector register (VSR). @@ -77,7 +77,7 @@ The handling of the non-scalar part of the register for scalar operations are different. - + For Intel ISA the scalar operations either leaves the high order part of the XMM vector unchanged or in some cases force it to 0.0. @@ -94,7 +94,7 @@ To minimize confusion and use consistent nomenclature, I will try to use the terms logical left and logical right elements based on the order they apprear in a C vector initializers and element index order. So in the vector - (__v2df){1.0, 20.}, The value 1.0 is the in the logical left element [0] and + (__v2df){1.0, 2.0}, The value 1.0 is the in the logical left element [0] and the value 2.0 is logical right element [1]. So lets look at how to implement these intrinsics for the PowerISA. @@ -119,7 +119,7 @@ _mm_add_sd (__m128d __A, __m128d __B) compiler generates the following code for PPC64LE target.: The packed vector double generated the corresponding VSX vector - double add (xvadddp). But the scalar implementation is bit more complicated. + double add (xvadddp). But the scalar implementation is a bit more complicated. : 720: 07 1b 42 f0 xvadddp vs34,vs34,vs35 ... @@ -149,7 +149,7 @@ _mm_add_sd (__m128d __A, __m128d __B) element (copied to itself).Fun fact: The vector registers in PowerISA are decidedly Big Endian. But we decided to make the PPC64LE ABI behave like a Little Endian system to make application - porting easier. This require the compiler to manipulate the PowerISA vector + porting easier. This requires the compiler to manipulate the PowerISA vector instrinsic behind the the scenes to get the correct Little Endian results. For example the element selector [0|1] for vec_splat and the generation of vec_mergeh vs vec_mergel @@ -161,9 +161,9 @@ _mm_add_sd (__m128d __A, __m128d __B) opportunity to optimize the whole function. Now we can look at a slightly more interesting (complicated) case. - Square root (sqrt) is not a arithmetic operator in C and is usually handled + Square root (sqrt) is not an arithmetic operator in C and is usually handled with a library call or a compiler builtin. We really want to avoid a library - calls and want to avoid any unexpected side effects. As you see below the + call and want to avoid any unexpected side effects. As you see below the implementation of _mm_sqrt_pd and _mm_sqrt_sd @@ -197,13 +197,14 @@ _mm_sqrt_sd (__m128d __A, __m128d __B) external library dependency for what should be only a few inline instructions. So this is not a good option. - Thinking outside the box; we do have an inline intrinsic for a - (packed) vector double sqrt, that we just implemented. However we need to + Thinking outside the box: we do have an inline intrinsic for a + (packed) vector double sqrt that we just implemented. However we need to insure the other half of __B (__B[1]) - does not cause an harmful side effects + does not cause any harmful side effects (like raising exceptions for NAN or  negative values). The simplest solution - is to splat __B[0] to both halves of a temporary value before taking the - vec_sqrt. Then this result can be combined with __A[1] to return the final + is to vector splat __B[0] to both halves of a temporary + value before taking the vec_sqrt. + Then this result can be combined with __A[1] to return the final result. For example: {c[0], __A[1]} initializer instead of _mm_setr_pd. - Now we can look at vector and scalar compares that add there own - complication: For example, the Intel Intrinsic Guide for + Now we can look at vector and scalar compares that add their own + complications: For example, the Intel Intrinsic Guide for _mm_cmpeq_pd describes comparing double elements [0|1] and returning either 0s for not equal and 1s (0xFFFFFFFFFFFFFFFF @@ -242,9 +243,9 @@ _mm_sqrt_sd (__m128d __A, __m128d __B) the final vector result. The packed vector implementation for PowerISA is simple as VSX - provides the equivalent instruction and GCC provides the - vec_cmpeq builtin - supporting the vector double type. The technique of using scalar comparison + provides the equivalent instruction and GCC provides the builtin + vec_cmpeq supporting the vector double type. + However the technique of using scalar comparison operators on the __A[0] and __B[0] does not work as the C comparison operators return 0 or 1 results while we need the vector select mask (effectively 0 or @@ -253,10 +254,10 @@ _mm_sqrt_sd (__m128d __A, __m128d __B) banks. In this case we are better off using explicit vector built-ins for - _mm_add_sd as and example. We can use vec_splat - from element [0] to temporaries - where we can safely use vec_cmpeq to generate the expect selector mask. Note - that the vec_cmpeq returns a bool long type so we need the cast the result back + _mm_add_sd and _mm_sqrt_sd as examples. + We can use vec_splat from element [0] to temporaries + where we can safely use vec_cmpeq to generate the expected selector mask. Note + that the vec_cmpeq returns a bool long type so we need to cast the result back to __v2df. Then use the (__m128d){c[0], __A[1]} initializer to combine the comparison result with the original __A[1] input and cast to the require @@ -283,20 +284,6 @@ _mm_cmpeq_sd(__m128d __A, __m128d __B) return ((__m128d){c[0], __A[1]}); }]]> - Now lets look at a similar example that adds some surprising - complexity. This is the compare not equal case so we should be able to find the - equivalent vec_cmpne builtin: - diff --git a/Vector_Intrinsics/sec_performance.xml b/Vector_Intrinsics/sec_performance.xml index 5230ae2..4a86946 100644 --- a/Vector_Intrinsics/sec_performance.xml +++ b/Vector_Intrinsics/sec_performance.xml @@ -31,13 +31,13 @@ ). This requires additional PowerISA instructions to preserve the non-scalar portion of the vector registers. This may or may not - be important to the logic of the program being ported, but we have handle the + be important to the logic of the program being ported, but we have to handle the case where it is. - This is where the context of now the intrinsic is used starts to + This is where the context of how the intrinsic is used starts to matter. If the scalar intrinsics are used within a larger program the compiler may be able to eliminate the redundant register moves as the results are never - used. In the other cases common set up (like permute vectors or bit masks) can + used. In other cases common set up (like permute vectors or bit masks) can be common-ed up and hoisted out of the loop. So it is very important to let the compiler do its job with higher optimization levels (-O3, -funroll-loops). diff --git a/Vector_Intrinsics/sec_performance_mmx.xml b/Vector_Intrinsics/sec_performance_mmx.xml index a4e59ad..c52f8a0 100644 --- a/Vector_Intrinsics/sec_performance_mmx.xml +++ b/Vector_Intrinsics/sec_performance_mmx.xml @@ -23,17 +23,17 @@ Using MMX intrinsics MMX was the first and oldest SIMD extension and initially filled a - need for wider (64-bit) integer and additional register. This is back when + need for wider (64-bit) integer and additional registers. This is back when processors were 32-bit and 8 x 32-bit registers was starting to cramp our programming style. Now 64-bit processors, larger register sets, and 128-bit (or - larger) vector SIMD extensions are common. There is simply no good reasons + larger) vector SIMD extensions are common. There is simply no good reason to write new code using the (now) very limited MMX capabilities. We recommend that existing MMX codes be rewritten to use the newer - SSE  and VMX/VSX intrinsics or using the more portable GCC  builtin vector + SSE and VMX/VSX intrinsics or using the more portable GCC  builtin vector support or in the case of si64 operations use C scalar code. The MMX si64 - scalars which are just (64-bit) operations on long long int types and any - modern C compiler can handle this type. The char short in SIMD operations + scalars are just (64-bit) operations on long long int types and any + modern C compiler can handle this type. The char / short in SIMD operations should all be promoted to 128-bit SIMD operations on GCC builtin vectors. Both will improve cross platform portability and performance. diff --git a/Vector_Intrinsics/sec_performance_sse.xml b/Vector_Intrinsics/sec_performance_sse.xml index 1b8379f..4ccf042 100644 --- a/Vector_Intrinsics/sec_performance_sse.xml +++ b/Vector_Intrinsics/sec_performance_sse.xml @@ -22,23 +22,23 @@ xml:id="sec_performance_sse"> Using SSE float and double scalars - SSE scalar float / double intrinsics  “hand” optimization is no + For SSE scalar float / double intrinsics,  “hand” optimization is no longer necessary. This was important, when SSE was initially introduced, and compiler support was limited or nonexistent.  Also SSE scalar float / double - provided additional (16) registers and IEEE754 compliance, not available from + provided additional (16) registers and IEEE-754 compliance, not available from the 8087 floating point architecture that preceded it. So application - developers where motivated to use SSE instruction versus what the compiler was + developers where motivated to use SSE instructions versus what the compiler was generating at the time. - Modern compilers can now to generate and  optimize these (SSE + Modern compilers can now generate and optimize these (SSE scalar) instructions for Intel from C standard scalar code. Of course PowerISA - supported IEEE754 float and double and had 32 dedicated floating point - registers from the start (and now 64 with VSX). So replacing a Intel specific + supported IEEE-754 float and double and had 32 dedicated floating point + registers from the start (and now 64 with VSX). So replacing Intel specific scalar intrinsic implementation with the equivalent C language scalar - implementation is usually a win; allows the compiler to apply the latest + implementation is usually a win; it allows the compiler to apply the latest optimization and tuning for the latest generation processor, and is portable to other platforms where the compiler can also apply the latest optimization and - tuning for that processors latest generation. + tuning for that processor's latest generation. diff --git a/Vector_Intrinsics/sec_power_vector_permute_format.xml b/Vector_Intrinsics/sec_power_vector_permute_format.xml index e626d63..4a9ef82 100644 --- a/Vector_Intrinsics/sec_power_vector_permute_format.xml +++ b/Vector_Intrinsics/sec_power_vector_permute_format.xml @@ -23,10 +23,13 @@ Vector permute and formatting instructions The vector Permute and formatting chapter follows and is an important - one to study. These operation operation on the byte, halfword, word (and with - 2.07 doubleword) integer types . Plus special Pixel type. The shifts + one to study. These operate on the byte, halfword, word (and with + PowerISA 2.07 doubleword) integer types, + plus special pixel type. + + The shift instructions in this chapter operate on the vector as a whole at either the bit - or the byte (octet) level, This is an important chapter to study for moving + or the byte (octet) level. This is an important chapter to study for moving PowerISA vector results into the vector elements that Intel Intrinsics expect: @@ -41,12 +44,13 @@ The Vector Integer instructions include the add / subtract / Multiply / Multiply Add/Sum / (no divide) operations for the standard integer types. There are instruction forms that  provide signed, unsigned, modulo, and - saturate results for most operations. The PowerISA 2.07 extension add / - subtract of 128-bit integers with carry and extend to 256, 512-bit and beyond , - is included here. There are signed / unsigned compares across the standard - integer types (byte, .. doubleword). The usual and bit-wise logical operations. - And the SIMD shift / rotate instructions that operate on the vector elements - for various types. + saturate results for most operations. PowerISA 2.07 extends vector integer + operations to add / subtract quadword (128-bit) integers with carry and extend. + This supports extended binary integer arithmetic to 256, 512-bit and beyond. + There are signed / unsigned compares across the standard + integer types (byte, .. doubleword); the usual bit-wise logical operations; + and the SIMD shift / rotate instructions that operate on the vector elements + for various integer types. 6.9 Vector Integer Instructions . . . . . . . . . . . . . . . . . . 264 6.9.1 Vector Integer Arithmetic Instructions . . . . . . . . . . . . 264 @@ -55,8 +59,8 @@ 6.9.4 Vector Integer Rotate and Shift Instructions . . . . . . . . . 302 The vector [single] float instructions are grouped into this chapter. - This chapter does not include the double float instructions which are described - in the VSX chapter. VSX also include additional float instructions that operate + This chapter does not include the double float instructions, which are described + in the VSX chapter. VSX also includes additional float instructions that operate on the whole 64 register vector-scalar set. 6.10 Vector Floating-Point Instruction Set . . . . . . . . . . . . . 306 @@ -67,7 +71,7 @@ 6.10.5 Vector Floating-Point Estimate Instructions . . . . . . . . . 316 The vector XOR based instructions are new with PowerISA 2.07 (POWER8) - and provide vector  crypto and check-sum operations: + and provide vector crypto and check-sum operations: 6.11 Vector Exclusive-OR-based Instructions . . . . . . . . . . . . 318 6.11.1 Vector AES Instructions . . . . . . . . . . . . . . . . . . . 318 @@ -75,28 +79,28 @@ 6.11.3 Vector Binary Polynomial Multiplication Instructions. . . . . 321 6.11.4 Vector Permute and Exclusive-OR Instruction . . . . . . . . . 323 - The vector gather and bit permute support bit level rearrangement of - bits with in the vector. While the vector versions of the count leading zeros - and population count are useful to accelerate specific algorithms. + The vector gather and bit permute instructions support bit-level rearrangement of + bits with in the vector, while the vector versions of the count leading zeros + and population count instructions are useful to accelerate specific algorithms. 6.12 Vector Gather Instruction . . . . . . . . . . . . . . . . . . . 324 6.13 Vector Count Leading Zeros Instructions . . . . . . . . . . . . 325 6.14 Vector Population Count Instructions. . . . . . . . . . . . . . 326 6.15 Vector Bit Permute Instruction . . . . . . . . . . . . . . . . 327 - The Decimal Integer add / subtract instructions complement the - Decimal Floating-Point instructions. They can also be used to accelerated some - binary to/from decimal conversions. The VSCR instruction provides access the + The Decimal Integer add / subtract (fixed point) instructions complement the + Decimal Floating-Point instructions. They can also be used to accelerate some + binary to/from decimal conversions. The VSCR instructions provide access to the Non-Java mode floating-point control and the saturation status. These - instruction are not normally of interest in porting Intel intrinsics. + instructions are not normally of interest in porting Intel intrinsics. 6.16 Decimal Integer Arithmetic Instructions . . . . . . . . . . . . 328 6.17 Vector Status and Control Register Instructions . . . . . . . . 331 - With PowerISA 2.07B (Power8) several major extension where added to + With PowerISA 2.07B (Power8) several major extensions were added to the Vector Facility: - + Vector Crypto: Under “Vector Exclusive-OR-based Instructions Vector Exclusive-OR-based Instructions”, AES [inverse] Cipher, SHA 256 / 512 @@ -108,7 +112,7 @@ unsigned max / min, rotate and shift left/right. - Direct Move between GRPs and the FPRs / Left half of Vector + Direct Move between GPRs and the FPRs / Left half of Vector Registers. @@ -116,7 +120,7 @@ support for vector __int128 and multiple precision arithmetic. - Decimal Integer add subtract for 31 digit BCD. + Decimal Integer add / subtract for 31 digit Binary Coded Decimal (BCD). Miscellaneous SIMD extensions: Count leading Zeros, Population @@ -124,17 +128,19 @@ - The rational for why these are included in the Vector Facilities + The rationale for these being included in the Vector Facilities (VMX) (vs Vector-Scalar Floating-Point Operations (VSX)) has more to do with - how the instruction where encoded then with the type of operations or the ISA + how the instructions were encoded than with the type of operations or the ISA version of introduction. This is primarily a trade-off between the bits - required for register selection vs bits for extended op-code space within in a - fixed 32-bit instruction. Basically accessing 32 vector registers require - 5-bits per register, while accessing all 64 vector-scalar registers require - 6-bits per register. When you consider the most vector instructions require  3 -  and some (select, fused multiply-add) require 4 register operand forms,  the + required for register selection versus the bits for extended op-code space within a + fixed 32-bit instruction. + + Basically accessing 32 vector registers requires + 5 bits per register, while accessing all 64 vector-scalar registers require + 6 bits per register. When you consider that most vector instructions require + 3 and some (select, fused multiply-add) require 4 register operand forms,  the impact on op-code space is significant. The larger register set of VSX was - justified by queuing theory of larger HPC matrix codes using double float, + justified by queueing theory of larger HPC matrix codes using double float, while 32 registers are sufficient for most applications. So by definition the VMX instructions are restricted to the original diff --git a/Vector_Intrinsics/sec_power_vmx.xml b/Vector_Intrinsics/sec_power_vmx.xml index 638d6ab..c83a0a3 100644 --- a/Vector_Intrinsics/sec_power_vmx.xml +++ b/Vector_Intrinsics/sec_power_vmx.xml @@ -22,23 +22,25 @@ xml:id="sec_power_vmx"> The Vector Facility (VMX) - The orginal VMX supported SIMD integer byte, halfword, and word, and + The original VMX supported SIMD integer byte, halfword, and word, and single float data types within a separate (from GPR and FPR) bank of 32 x - 128-bit vector registers. These operations like to stay within their (SIMD) + 128-bit vector registers. The arithmetic operations like to stay within their (SIMD) lanes except where the operation changes the element data size (integer - multiply, pack, and unpack). + multiply) or the generalized permute operations + (splat, permute, pack, unpack merge). - This is complimented by bit logical and shift / rotate / permute / - merge instuctions that operate on the vector as a whole.  Some operation + This is complemented by bit logical and shift / rotate instructions + that operate on the vector as a whole.  Some operations (permute, pack, merge, shift double, select) will select 128 bits from a pair - of vectors (256-bits) and deliver 128-bit vector result. These instructions - will cross lanes or multiple registers to grab fields and assmeble them into + of vectors (256-bits) and delivers a 128-bit vector result. These instructions + will cross lanes or multiple registers to grab fields and assemble them into the single register result. The PowerISA 2.07B Chapter 6. Vector Facility is organised starting with an overview (chapters 6.1- 6.6): - 6.1 Vector Facility Overview . . . . . . . . . . . . . . . . . . . . 227 + +6.1 Vector Facility Overview . . . . . . . . . . . . . . . . . . . . 227 6.2 Chapter Conventions. . . . . . . . . . . . . . . . . . . . . . . 227 6.2.1 Description of Instruction Operation . . . . . . . . . . . . . 227 6.3 Vector Facility Registers . . . . . . . . . . . . . . . . . . . 234 @@ -52,14 +54,18 @@ 6.6 Vector Floating-Point Operations . . . . . . . . . . . . . . . . 240 6.6.1 Floating-Point Overview . . . . . . . . . . . . . . . . . . . 240 6.6.2 Floating-Point Exceptions . . . . . . . . . . . . . . . . . . 240 + + + Then a chapter on storage (load/store) access for vector and vector + elements: + + 6.7 Vector Storage Access Instructions . . . . . . . . . . . . . . . 242 6.7.1 Storage Access Exceptions . . . . . . . . . . . . . . . . . . 242 6.7.2 Vector Load Instructions . . . . . . . . . . . . . . . . . . . 243 6.7.3 Vector Store Instructions. . . . . . . . . . . . . . . . . . . 246 -6.7.4 Vector Alignment Support Instructions. . . . . . . . . . . . . 248 - - Then a chapter on storage (load/store) access for vector and vector - elements: +6.7.4 Vector Alignment Support Instructions. . . . . . . . . . . . . 248 + diff --git a/Vector_Intrinsics/sec_power_vsx.xml b/Vector_Intrinsics/sec_power_vsx.xml index 6add740..ca89784 100644 --- a/Vector_Intrinsics/sec_power_vsx.xml +++ b/Vector_Intrinsics/sec_power_vsx.xml @@ -25,10 +25,11 @@ With PowerISA 2.06 (POWER7) we extended the vector SIMD capabilities of the PowerISA: - + Extend the available vector and floating-point scalar register - sets from 32 registers each to a combined 64 x 64-bit scalar floating-point and + sets from 32 registers each to a combined register set of 64 x 64-bit + scalar floating-point and 64 x 128-bit vector registers. @@ -42,27 +43,27 @@ Enable super-scalar execution of vector instructions and support 2 independent vector floating point  pipelines for parallel execution of 4 x - 64-bit Floating point Fused Multiply Adds (FMAs) and 8 x 32-bit (FMAs) per + 64-bit Floating point Fused Multiply Adds (FMAs) and 8 x 32-bit FMAs per cycle. With PowerISA 2.07 (POWER8) we added single-precision scalar - floating-point instruction to VSX. This completes the floating-point + floating-point instructions to VSX. This completes the floating-point computational set for VSX. This ISA release also clarified how these operate in the Little Endian storage model. While the focus was on enhanced floating-point computation (for High - Performance Computing),  VSX also extended  the ISA with additional storage + Performance Computing), VSX also extended  the ISA with additional storage access, logical, and permute (merge, splat, shift) instructions. This was - necessary to extend these operations cover 64 VSX registers, and improves + necessary to extend these operations to cover 64 VSX registers, and improves unaligned storage access for vectors  (not available in VMX). The PowerISA 2.07B Chapter 7. Vector-Scalar Floating-Point Operations is organized starting with an introduction and overview (chapters 7.1- 7.5) . The early sections (7.1 and 7.2) describe the layout of the 64 VSX registers and how they relate (overlap and inter-operate) to the existing floating point - scalar (FPRs) and (VMX VRs) vector registers. + scalar (FPRs) and vector (VMX VRs) registers. 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 317 7.1.1 Overview of the Vector-Scalar Extension . . . . . . . . . . . 317 @@ -91,7 +92,7 @@ The reference to scalar element 0 above is from the big endian register perspective of the ISA. In the PPC64LE ABI implementation, and for the - purpose of porting Intel intrinsics, this is logical element 1.  Intel SSE + purpose of porting Intel intrinsics, this is logical doubleword element 1.  Intel SSE scalar intrinsics operated on logical element [0],  which is in the wrong position for PowerISA FPU and VSX scalar floating-point  operations. Another important note is what happens to the other half of the VSR when you execute a @@ -112,20 +113,20 @@ operations only exist for VMX (byte level permute and shift) or VSX (Vector double). - So resister selection that; avoids unnecessary vector moves, follows - the ABI, while maintaining the correct instruction specific register numbering, + So register selection that avoids unnecessary vector moves and follows + the ABI while maintaining the correct instruction specific register numbering, can be tricky. The GCC register constraint annotations for Inline - assembler using vector instructions  is challenging, even for experts. So only + assembler using vector instructions are challenging, even for experts. So only experts should be writing assembler and then only in extraordinary circumstances. You should leave these details to the compiler (using vector extensions and vector built-ins) when ever possible. - The next sections get is into the details of floating point - representation, operations, and exceptions. Basically the implementation - details for the IEEE754R and C/C++ language standards that most developers only - access via higher level APIs. So most programmers will not need this level of + The next sections gets into the details of floating point + representation, operations, and exceptions. They describe the implementation + details for the IEEE-754R and C/C++ language standards that most developers only + access via higher level APIs. Most programmers will not need this level of detail, but it is there if needed. 7.3 VSX Operations . . . . . . . . . . . . . . . . . . . . . . . . . 326 @@ -138,9 +139,9 @@ 7.4.3 Floating-Point Overflow Exception. . . . . . . . . . . . . . . 349 7.4.4 Floating-Point Underflow Exception . . . . . . . . . . . . . . 351 - Finally an overview the VSX storage access instructions for big and + Next comes an overview of the VSX storage access instructions for big and little endian and for aligned and unaligned data addresses. This included - diagrams that illuminate the differences + diagrams that illuminate the differences. 7.5 VSX Storage Access Operations . . . . . . . . . . . . . . . . . 356 7.5.1 Accessing Aligned Storage Operands . . . . . . . . . . . . . . 356 @@ -148,19 +149,19 @@ 7.5.3 Storage Access Exceptions . . . . . . . . . . . . . . . . . . 358 Section 7.6 starts with a VSX instruction Set Summary which is the - place to start to get an feel for the types and operations supported.  The - emphasis on float-point, both scalar and vector (especially vector double), is - pronounced. Many of the scalar and single-precision vector instruction look + place to start to get a feel for the types and operations supported.  The + emphasis on floating-point, both scalar and vector (especially vector double), is + pronounced. Many of the scalar and single-precision vector instructions look like duplicates of what we have seen in the Chapter 4 Floating-Point and - Chapter 6 Vector facilities. The difference here is, new instruction encodings + Chapter 6 Vector facilities. The difference here is new instruction encodings to access the full 64 VSX register space. - In addition there are small number of logical instructions are - include to support predication (selecting / masking vector elements based on - compare results). And set of permute, merge, shift, and splat instructions that - operation on VSX word (float) and doubleword (double) elements. As mentioned + In addition there are a small number of logical instructions + included to support predication (selecting / masking vector elements based on + comparison results), and a set of permute, merge, shift, and splat instructions that + operate on VSX word (float) and doubleword (double) elements. As mentioned about VMX section 6.8 these instructions are good to study as they are useful - for realigning elements from PowerISA vector results to that required for Intel + for realigning elements from PowerISA vector results to the form required for Intel Intrinsics. 7.6 VSX Instruction Set . . . . . . . . . . . . . . . . . . . . . . 359 @@ -179,8 +180,8 @@ The VSX Instruction Descriptions section contains the detail description for each VSX category instruction.  The table entries from the - Instruction Set Summary are formatted in the document at hyperlinks to - corresponding instruction description. + Instruction Set Summary are formatted in the document as hyperlinks to + corresponding instruction descriptions. diff --git a/Vector_Intrinsics/sec_powerisa.xml b/Vector_Intrinsics/sec_powerisa.xml index 1fc45ae..9926d80 100644 --- a/Vector_Intrinsics/sec_powerisa.xml +++ b/Vector_Intrinsics/sec_powerisa.xml @@ -22,7 +22,7 @@ xml:id="sec_powerisa"> The PowerISA - The PowerISA is for historical reasons is organized at the top level + The PowerISA Vector facilities, for historical reasons are organized at the top level by the distinction between older Vector Facility (Altivec / VMX) and the newer Vector-Scalar Floating-Point Operations (VSX). diff --git a/Vector_Intrinsics/sec_powerisa_vector_facilities.xml b/Vector_Intrinsics/sec_powerisa_vector_facilities.xml index e0c0a75..af3da70 100644 --- a/Vector_Intrinsics/sec_powerisa_vector_facilities.xml +++ b/Vector_Intrinsics/sec_powerisa_vector_facilities.xml @@ -22,21 +22,21 @@ xml:id="sec_powerisa_vector_facilities"> PowerISA Vector facilities - The PowerISA vector facilities (VMX and VSX) are extensive, but does + The PowerISA vector facilities (VMX and VSX) are extensive, but do not always provide a direct or obvious functional equivalent to the Intel - Intrinsics. But being not obvious is not the same as imposible. It just + Intrinsics. However not being obvious is not the same as impossible. It just requires some basic programing skills. It is a good idea to have an overall understanding of the vector - capabilities the PowerISA. You do not need to memorize every instructions but - is helps to know where to look. Both the PowerISA and OpenPOWER ABI have a - specific structure and organization that can help you find what you looking + capabilities of the PowerISA. You do not need to memorize every instruction but + it helps to know where to look. Both the PowerISA and OpenPOWER ABI have a + specific structure and organization that can help you find what you are looking for. - It also helps to understand the relationship between the PowerISAs + It also helps to understand the relationship between the PowerISA's low level instructions and the higher abstraction of the vector intrinsics as - defined by the OpenPOWER ABIs Vector Programming Interfaces and the the defacto - standard of GCC's PowerPC AltiVec Built-in Functions. + defined by the OpenPOWER ABI's Vector Programming Interfaces and the de facto + standard of GCC's PowerPC AltiVec Builtin Functions. diff --git a/Vector_Intrinsics/sec_powerisa_vector_intrinsics.xml b/Vector_Intrinsics/sec_powerisa_vector_intrinsics.xml index bed2226..79a08ac 100644 --- a/Vector_Intrinsics/sec_powerisa_vector_intrinsics.xml +++ b/Vector_Intrinsics/sec_powerisa_vector_intrinsics.xml @@ -30,23 +30,23 @@ C/C++ compilers implement. Some of these operations are endian sensitive and the compiler needs - to make corresponding adjustments as  it generate code for endian sensitive + to make corresponding adjustments as it generates code for endian sensitive built-ins. There is a good overview for this in the OpenPOWER ABI Section 6.4. Vector Built-in Functions. Appendix A is organized (sorted) by built-in name, output type, then - parameter types. Most built-ins are generic as the named the operation (add, + parameter types. Most built-ins are generic as the named operation (add, sub, mul, cmpeq, ...) applies to multiple types. - So the build vec_add built-in applies to all the signed and unsigned + So the vec_add built-in applies to all the signed and unsigned integer types (char, short, in, and long) plus float and double floating-point types. The compiler looks at the parameter type to select the vector - instruction (or instruction sequence) that implements the (add) operation on + instruction (or instruction sequence) that implements the add operation on that type. The compiler infers the output result type from the operation and input parameters and will complain if the target variable type is not - compatible. For example: + compatible. Some examples: the name). This is why it is so important to understand the vector element types and to add the appropriate type casts to get the correct results. - The defacto standard implementation is GCC as defined in the include + The de facto standard implementation in GCC is defined in the include file <altivec.h> and documented in the GCC online documentation in 6.59.20 PowerPC AltiVec Built-in Functions. The header file name and section title diff --git a/Vector_Intrinsics/sec_powerisa_vector_size_type.xml b/Vector_Intrinsics/sec_powerisa_vector_size_type.xml index c18f2bb..2c92797 100644 --- a/Vector_Intrinsics/sec_powerisa_vector_size_type.xml +++ b/Vector_Intrinsics/sec_powerisa_vector_size_type.xml @@ -23,17 +23,17 @@ How vector elements change size and type Most vector built ins return the same vector type as the (first) - input parameters, but there are exceptions. Examples include; conversions - between types, compares , pack, unpack,  merge, and integer multiply + input parameters, but there are exceptions. Examples include conversions + between types, compares, pack, unpack,  merge, and integer multiply operations. - Converting floats to from integer will change the type and something - change the element size as well (double ↔ int and float ↔ long). For the + Converting floats to / from integer types will change the type and sometimes + change the element size as well (double ↔ int and float ↔ long). For VMX the conversions are always the same size (float ↔ [unsigned] int). But VSX allows conversion of 64-bit (long or double) to from 32-bit (float or -  int)  with the inherent size changes. The PowerISA VSX defines a 4 element +  int)  with the inherent size changes. The PowerISA VSX defines a 4-element vector layout where little endian elements 0, 2 are used for input/output and - elements 1,3 are undefined. The OpenPOWER ABI Appendix A define + elements 1,3 are undefined. The OpenPOWER ABI Appendix A defines vec_double and vec_float with even/odd and high/low extensions as program aids. These are not included in GCC 7 or earlier but are planned for GCC 8. @@ -42,8 +42,8 @@ vector bool <input element type> (effectively bit masks) or predicates (the condition code for all and any are represented as an int truth variable). When a predicate compare (i.e. - vec_all_eq, vec_any_gt), - is used in a if statement,  the condition code is + vec_all_eq, vec_any_gt) + is used in an if statement,  the condition code is used directly in the conditional branch and the int truth value is not generated. @@ -51,7 +51,7 @@ integer sized elements. Pack operations include signed and unsigned saturate and unsigned modulo forms. As the packed result will be half the size (in bits), pack instructions require 2 vectors (256-bits) as input and generate a - single 128-bit vector results. + single 128-bit vector result. Unpack operations expand integer elements into the next larger size @@ -60,7 +60,7 @@ So the PowerISA defines unpack-high and unpack low forms where instruction takes (the high or low) half of vector elements and extends them to fill the vector output. Element order is maintained and an unpack high / low sequence - with same input vector has the effect of unpacking to a 256-bit result in two + with the same input vector has the effect of unpacking to a 256-bit result in two vector registers. For PowerISA 2.07 we added vector merge word even / odd instructions. Instead of high or low elements the shuffle is from the even or odd number elements of the two input vectors. Passing the same vector to both inputs to - merge produces splat like results for each doubleword half, which is handy in + merge produces splat-like results for each doubleword half, which is handy in some convert operations. < double product precision for intermediate computation before reducing the final result back to the original precision. - The PowerISA VMX instruction set took the later approach ie keep all - the product bits until the programmer explicitly asks for the truncated result. + The PowerISA VMX instruction set took the later approach, i.e., keep all + the product bits until the programmer explicitly asks for the truncated result + (via the pack operation). So the vector integer multiple are split into even/odd forms across signed and - unsigned; byte, halfword and word inputs. This requires two instructions (given - the same inputs) to generated the full vector  multiply across 2 vector + unsigned byte, halfword and word inputs. This requires two instructions (given + the same inputs) to generate the full vector multiply across 2 vector registers and 256-bits. Again as POWER processors are super-scalar this pair of instructions should execute in parallel. diff --git a/Vector_Intrinsics/sec_prefered_methods.xml b/Vector_Intrinsics/sec_prefered_methods.xml index 3a8f729..784de8e 100644 --- a/Vector_Intrinsics/sec_prefered_methods.xml +++ b/Vector_Intrinsics/sec_prefered_methods.xml @@ -20,7 +20,7 @@ xmlns:xlink="http://www.w3.org/1999/xlink" version="5.0" xml:id="sec_prefered_methods"> - Prefered methods + Preferred methods As we will see there are multiple ways to implement the logic of these intrinsics. Some implementation methods are preferred because they allow @@ -32,7 +32,7 @@ each intrinsic implementation. In general we should use the following list as a guide to these decisions: - + Use C vector arithmetic, logical, dereference, etc., operators in preference to intrinsics. diff --git a/Vector_Intrinsics/sec_prepare.xml b/Vector_Intrinsics/sec_prepare.xml index fd1f444..398f5d2 100644 --- a/Vector_Intrinsics/sec_prepare.xml +++ b/Vector_Intrinsics/sec_prepare.xml @@ -26,12 +26,12 @@ with knowledge of PowerISA vector facilities and how to access the associated documentation. - + - GCC vector extention + GCC vector extension syntax and usage. This is one of a set of GCC - "Extentions to the C language Family” + "Extensions to the C language Family” that the intrinsic header implementation depends on.  As many of the GCC intrinsics for x86 are implemented via C vector extensions, reading and understanding of this code is an important part of the diff --git a/Vector_Intrinsics/sec_review_source.xml b/Vector_Intrinsics/sec_review_source.xml index d92a0e7..d07e6c6 100644 --- a/Vector_Intrinsics/sec_review_source.xml +++ b/Vector_Intrinsics/sec_review_source.xml @@ -25,13 +25,13 @@ So if this is a code porting activity, where is the source? All the source code we need to look at is in the GCC source trees. You can either git (https://gcc.gnu.org/wiki/GitMirro) - the gcc source  or down load one of the + the gcc source  or download one of the recent AT source tars (for example: ftp://ftp.unicamp.br/pub/linuxpatch/toolchain/at/ubuntu/dists/xenial/at10.0/).  You will find the intrinsic headers in the ./gcc/config/i386/ sub-directory. - If you have a Intel Linux workstation or laptop with GCC installed, + If you have an Intel Linux workstation or laptop with GCC installed, you already have these headers, if you want to take a look: $ find /usr/lib -name '*mmintrin.h' /usr/lib/gcc/x86_64-redhat-linux/4.4.4/include/wmmintrin.h @@ -44,8 +44,8 @@ But depending on the vintage of the distro, these may not be the latest versions of the headers. Looking at the header source will tell you a - few things.: The include structure (what other headers are implicitly - included). The types that are used at the API. And finally, how the API is + few things: the include structure (what other headers are implicitly + included), the types that are used at the API, and finally, how the API is implemented. smmintrin.h (SSE4.1) includes tmmintrin,h diff --git a/Vector_Intrinsics/sec_simple_examples.xml b/Vector_Intrinsics/sec_simple_examples.xml index c2a2482..c4dc656 100644 --- a/Vector_Intrinsics/sec_simple_examples.xml +++ b/Vector_Intrinsics/sec_simple_examples.xml @@ -48,11 +48,13 @@ _mm_mullo_epi16 (__m128i __A, __m128i __B) Note this requires a cast for the compiler to generate the correct code for the intended operation. The parameters and result are the generic - __m128i, which is a vector long long with the + type __m128i, which is a vector long long with the __may_alias__ attribute. But - operation is a vector multiply low unsigned short (__v8hu). So not only do we - use the cast to drop the __may_alias__ attribute but we also need to cast to - the correct (vector unsigned short) type for the specified operation. + operation is a vector multiply low on unsigned short elements. + So not only do we use the cast to drop the __may_alias__ + attribute but we also need to cast to + the correct type (__v8hu or vector unsigned short) + for the specified operation. I have successfully copied these (and similar) source snippets over to the PPC64LE implementation unchanged. This of course assumes the associated diff --git a/Vector_Intrinsics/sec_vec_or_not.xml b/Vector_Intrinsics/sec_vec_or_not.xml index e66de7d..9b70ed9 100644 --- a/Vector_Intrinsics/sec_vec_or_not.xml +++ b/Vector_Intrinsics/sec_vec_or_not.xml @@ -22,6 +22,25 @@ xml:id="sec_vec_or_not"> To vec_not or not + Now lets look at a similar example that adds some surprising + complexity. When we look at the negated compare forms we can not find + exact matches in the PowerISA. But a little knowledge of boolean + algebra can show the way to the equivalent functions. + + First the X86 compare not equal case where we might expect to + find the equivalent vec_cmpne builtins for PowerISA: + + Well not exactly. Looking at the OpenPOWER ABI document we see a reference to vec_cmpne for all numeric types. But when we look in the current @@ -52,7 +71,7 @@ This is RISC philosophy again. We can always use a logical instruction (like bit wise and or - or) to effect a move given that we also have + or) to effect a move, given that we also have nondestructive 3 register instruction forms. In the PowerISA most instruction have two input registers and a separate result register. So if the result register number is  different from either input register then the inputs are @@ -62,22 +81,21 @@ The statement B = vec_or (A,A) is is effectively a vector move/copy from A to B. And A = vec_or (A,A) is obviously a - nop (no operation). In the the - PowerISA defines the preferred nop and register move for vector registers in - this way. + nop (no operation). In fact the + PowerISA defines the preferred nop and register move for vector registers + in this way. - It is also useful to have hardware implement the logical operators + The PowerISA implements the logical operators nor (not or) and nand (not and).   The PowerISA provides these instruction for - fixed point and vector logical operation. So vec_not(A) + fixed point and vector logical operations. So vec_not(A) can be implemented as vec_nor(A,A). - So looking at the  implementation of _mm_cmpne we propose the - following: + So for the implementation of _mm_cmpne we propose the following: The Intel Intrinsics also include the not forms of the relational compares: The PowerISA and OpenPOWER ABI, or GCC PowerPC Altivec Built-in