diff --git a/Intrinsics_Reference/ch_biendian.xml b/Intrinsics_Reference/ch_biendian.xml index 958ba41..5846956 100644 --- a/Intrinsics_Reference/ch_biendian.xml +++ b/Intrinsics_Reference/ch_biendian.xml @@ -769,7 +769,7 @@ register vector double vd = vec_splats(*double_ptr); introduced serious compiler complexity without much utility. Thus this support (previously controlled by switches -maltivec=be and/or -qaltivec=be) is - now deprecated. Current versions of the gcc and clang + now deprecated. Current versions of the GCC and Clang open-source compilers do not implement this support. @@ -1146,8 +1146,8 @@ register vector double vd = vec_splats(*double_ptr); elements using the groups of 4 contiguous bytes, and the values of the integers will be reordered without compromising each integer's contents. The fact that the little-endian - result matches the big-endian result is left as an exercise to - the reader. + result matches the big-endian result is left as an exercise + for the reader. Now, suppose instead that the original PCV does not reorder diff --git a/Intrinsics_Reference/ch_intro.xml b/Intrinsics_Reference/ch_intro.xml index b2bb054..49a1946 100644 --- a/Intrinsics_Reference/ch_intro.xml +++ b/Intrinsics_Reference/ch_intro.xml @@ -54,10 +54,9 @@ xmlns:xlink="http://www.w3.org/1999/xlink" xml:id="section_intro"> provides for overloaded intrinsics that can operate on different data types. However, such function overloading is not normally acceptable in the C programming language, so compilers compliant - with the AltiVec PIM (such as gcc and - clang) were required to add special handling to - their parsers to permit this. The PIM suggested (but did not - mandate) the use of a header file, + with the AltiVec PIM (such as GCC and Clang) were required to + add special handling to their parsers to permit this. The PIM + suggested (but did not mandate) the use of a header file, <altivec.h>, for implementations that provide AltiVec intrinsics. This is common practice for all compliant compilers today. @@ -208,6 +207,15 @@ xmlns:xlink="http://www.w3.org/1999/xlink" xml:id="section_intro"> + + + Using the GNU Compiler Collection. + + https://gcc.gnu.org/onlinedocs/gcc.pdf + + + + diff --git a/Intrinsics_Reference/ch_techniques.xml b/Intrinsics_Reference/ch_techniques.xml index 3f8f4c1..892c5f9 100644 --- a/Intrinsics_Reference/ch_techniques.xml +++ b/Intrinsics_Reference/ch_techniques.xml @@ -23,45 +23,181 @@ xmlns:xlink="http://www.w3.org/1999/xlink" xml:id="section_techniques">
Help the Compiler Help You - Start with scalar code, which is the most portable. Use various - tricks for helping the compiler vectorize scalar code. Make - sure you align your data on 16-byte boundaries wherever - possible, and tell the compiler it's aligned. Use __restrict__ - pointers to promise data does not alias. + The best way to use vector intrinsics is often not to + use them at all. + + This may seem counterintuitive at first. Aren't vector + intrinsics the best way to ensure that the compiler does exactly + what you want? Well, sometimes. But the problem is that the + best instruction sequence today may not be the best instruction + sequence tomorrow. As the PowerISA moves forward, new + instruction capabilities appear, and the old code you wrote can + easily become obsolete. Then you start having to create + different versions of the code for different levels of the + PowerISA, and it can quickly become difficult to maintain. + + + Most often programmers use vector intrinsics to increase the + performance of loop kernels that dominate the performance of an + application or library. However, modern compilers are often + able to optimize such loops to use vector instructions without + having to resort to intrinsics, using an optimization called + autovectorization (or auto-SIMD). Your first focus when writing + loop kernels should be on making the code amenable to + autovectorization by the compiler. Start by writing the code + naturally, using scalar memory accesses and data operations, and + see whether the compiler autovectorizes your code. If not, here + are some steps you can try: + + + + + Check your optimization + level. Different compilers enable + autovectorization at different optimization levels. For + example, at this writing the GCC compiler requires + -O3 to enable autovectorization by default. + + + + + Consider using + -ffast-math. This option assumes + that certain fussy aspects of IEEE floating-point can be + ignored, such as the presence of Not-a-Numbers (NaNs), + signed zeros, and so forth. -ffast-math may + also affect precision of results that may not matter to your + application. Turning on this option can simplify the + control flow of loops generated for your application by + removing tests for NaNs and so forth. (Note that + -Ofast turns on both -O3 and -ffast-math in + GCC.) + + + + + Align your data wherever + possible. For most effective auto-vectorization, + arrays of data should be aligned on at least a 16-byte + boundary, and pointers to that data should be identified as + having the appropriate alignment. For example: + + float fdata[4096] __attribute__((aligned(16))); + + ensures that the compiler can use an efficient, aligned + vector load to bring data from fdata into a + vector register. Autovectorization will appear more + profitable to the compiler when data is known to be + aligned. + + + You can also declare pointers to point to aligned data, + which is particularly useful in function arguments: + + void foo (__attribute__((aligned(16))) double * aligned_fptr) + + + + Tell the compiler when data can't + overlap. In C and C++, use of pointers can cause + compilers to pessimistically analyze which memory references + can refer to the same memory. This can prevent important + optimizations, such as reordering memory references, or + keeping previously loaded values in memory rather than + reloading them. Inefficiently optimized scalar loops are + less likely to be autovectorized. You can annotate your + pointers with the restrict or + __restrict__ keyword to tell the compiler that + your pointers don't "alias" with any other memory + references. (restrict can be used only in C + when compiling for the C99 standard or later. + __restrict__ is a language extension, available + in both GCC and Clang, that can be used for both C and C++.) + + + Suppose you have a function that takes two pointer + arguments, one that points to data your function writes to, and + one that points to data your function reads from. By + default, the compiler may believe that the data being read + and written could overlap. To disabuse the compiler of this + notion, do the following: + + void foo (double *__restrict__ outp, double *__restrict__ inp) + +
Use Portable Intrinsics - Individual compilers may provide other intrinsic support. Only - the intrinsics in this manual are guaranteed to be portable - across compliant compilers. + If you can't convince the compiler to autovectorize your code, + or you want to access specific processor features not + appropriate for autovectorization, you should use intrinsics. + However, you should go out of your way to use intrinsics that + are as portable as possible, in case you need to change + compilers in the future. + + + This reference provides intrinsics that are guaranteed to be + portable across compliant compilers. In particular, both the + GCC and Clang compilers for POWER implement the intrinsics in + this manual. The compilers may each implement many more + intrinsics, but the ones in this manual are the only ones + guaranteed to be portable. So if you are using an interface not + described here, you should look for an equivalent one in this + manual and change your code to use that. - Some compilers may provide compatibility headers for use with - other architectures. Recent GCC and Clang compilers support - compatibility headers for the lower levels of the x86 vector - architecture. These can be used initially for ease of porting, - but for best performance, it is preferable to rewrite important - sections of code with native Power intrinsics. + There are also other vector APIs that may be of use to you (see + ). In particular, the + POWER Vector Library (see ) provides additional + portability across compiler versions.
Use Assembly Code Sparingly - filler -
- Inline Assembly - filler -
-
- Assembly Files - filler -
+ + Sometimes the compiler will absolutely not cooperate in giving + you the code you need. You might not get the instruction you + want, or you might get extra instructions that are slowing down + your ideal performance. When that happens, the first thing you + should do is report this to the compiler community! This will + allow them to get the problem fixed in the next release of the + compiler. + + + In the meanwhile, though, what are your options? As a + workaround, your best option may be to use assembly code. There + are two ways to go about this. Using inline assembly is + generally appropriate only for very small snippets of code (1-5 + instructions, say). If you want to write a whole function in + assembly code, though, it is better to create a separate + .s or .S file. The only difference in + these two file types is that a .S file will be + processed by the C preprocessor before being assembled. + + + Assembly programming is beyond the scope of this manual. + Getting inline assembly correct can be quite tricky, and it is + best to look at existing examples to learn how to use it + properly. However, there is a good introduction to inline + assembly in Using the GNU Compiler + Collection (see ), + in section 6.47 at the time of this writing. + + + If you write a function entirely in assembly, you are + responsible for following the calling conventions established by + the ABI (see ). Again, it is + best to look at examples. One place to find well-written + .S files is in the GLIBC project. +
-
+
Other Vector Programming APIs In addition to the intrinsic functions provided in this reference, programmers should be aware of other vector programming @@ -69,14 +205,13 @@ xmlns:xlink="http://www.w3.org/1999/xlink" xml:id="section_techniques">
x86 Vector Portability Headers - Recent versions of the gcc and clang - open source compilers provide "drop-in" portability headers - for portions of the Intel Architecture Instruction Set - Extensions (see ). These - headers mirror the APIs of Intel headers having the same - names. Support is provided for the MMX and SSE layers, up - through SSE4. At this time, no support for the AVX layers is - envisioned. + Recent versions of the GCC and Clang open source compilers + provide "drop-in" portability headers for portions of the + Intel Architecture Instruction Set Extensions (see ). These headers mirror the APIs + of Intel headers having the same names. Support is provided + for the MMX and SSE layers, up through SSE4. At this time, no + support for the AVX layers is envisioned. The portability headers provide the same semantics as the @@ -95,7 +230,7 @@ xmlns:xlink="http://www.w3.org/1999/xlink" xml:id="section_techniques"> <mmintrin.h>.
-
+
The POWER Vector Library (pveclib) The POWER Vector Library, also known as pveclib, is a separate project available from diff --git a/Intrinsics_Reference/ch_vec_reference.xml b/Intrinsics_Reference/ch_vec_reference.xml index a18fcdf..7117f70 100644 --- a/Intrinsics_Reference/ch_vec_reference.xml +++ b/Intrinsics_Reference/ch_vec_reference.xml @@ -23,8 +23,95 @@ xmlns:xlink="http://www.w3.org/1999/xlink" xml:id="VIPR.vec-ref">
How to Use This Reference - Brief description of the format of the entries, the cross-reference - index, and so forth. + This chapter contains reference material for each supported + vector intrinsic. The information for each intrinsic includes: + + + + + The intrinsic name and extended name; + + + + + A type-free example of the intrinsic's usage; + + + + + A description of the intrinsic's purpose; + + + + + A description of the value(s) returned from the intrinsic, + if any; + + + + + A description of any unusual characteristics of the + intrinsic when different target endiannesses are in force. + If the semantics of the intrinsic in big-endian and + little-endian modes are identical, the description will read + "None."; + + + + + Optionally, additional explanatory notes about the + intrinsic; and + + + + + A table of supported type signatures for the intrinsic. + + + + + Most intrinsics are overloaded, supporting multiple type + signatures. The types of the input arguments always determine + the type of the result argument; that is, it is not possible to + define two intrinsic overloads with the same input argument + types and different result argument types. + + + The type-free example of the intrinsic's usage uses the + convention that r represents + the result of the intrinsic, and a, b, + etc., represent the input arguments. The allowed type + combinations of these variables are shown as rows in the table + of supported type signatures. + + + Each row contains at least one example implementation. This + shows one way that a conforming compiler might achieve the + intended semantics of the intrinsic, but compilers are not + required to generate this code specifically. The letters + r, a, b, + etc., in the examples represent vector registers containing the + values of those variables. The letters t, u, + etc., represent vector registers containing temporary + intermediate results. The same register is assumed to be used + for each instance of one of these letters. + + + When implementations differ for big- and little-endian targets, + separate example implementations are shown for each endianness. + + + The implementations show which vector instructions are used in + the implementation of a particular intrinsic. When trying to + determine which intrinsic to use, it can be useful to have a + cross-reference from a specific vector instruction to the + intrinsics whose implementations make use of it. This manual + contains such a cross-reference () for the programmer's + convenience.