Vector Programming Techniques
Help the Compiler Help You The best way to use vector intrinsics is often not to use them at all. This may seem counterintuitive at first. Aren't vector intrinsics the best way to ensure that the compiler does exactly what you want? Well, sometimes. But the problem is that the best instruction sequence today may not be the best instruction sequence tomorrow. As the Power ISA moves forward, new instruction capabilities appear, and the old code you wrote can easily become obsolete. Then you start having to create different versions of the code for different levels of the Power ISA, and it can quickly become difficult to maintain. Most often programmers use vector intrinsics to increase the performance of loop kernels that dominate the performance of an application or library. However, modern compilers are often able to optimize such loops to use vector instructions without having to resort to intrinsics, using an optimization called autovectorization (or auto-SIMD). Your first focus when writing loop kernels should be on making the code amenable to autovectorization by the compiler. Start by writing the code naturally, using scalar memory accesses and data operations, and see whether the compiler autovectorizes your code. If not, here are some steps you can try: Check your optimization level. Different compilers enable autovectorization at different optimization levels. For example, at this writing the GCC compiler requires -O3 to enable autovectorization by default. Consider using -ffast-math. This option assumes that certain fussy aspects of IEEE floating-point can be ignored, such as the presence of Not-a-Numbers (NaNs), signed zeros, and so forth. -ffast-math may also affect precision of results that may not matter to your application. Turning on this option can simplify the control flow of loops generated for your application by removing tests for NaNs and so forth. (Note that -Ofast turns on both -O3 and -ffast-math in GCC.) Align your data wherever possible. For most effective auto-vectorization, arrays of data should be aligned on at least a 16-byte boundary, and pointers to that data should be identified as having the appropriate alignment. For example: float fdata[4096] __attribute__((aligned(16))); ensures that the compiler can use an efficient, aligned vector load to bring data from fdata into a vector register. Autovectorization will appear more profitable to the compiler when data is known to be aligned. You can also declare pointers to point to aligned data, which is particularly useful in function arguments: void foo (__attribute__((aligned(16))) double * aligned_fptr) Tell the compiler when data can't overlap. In C and C++, use of pointers can cause compilers to pessimistically analyze which memory references can refer to the same memory. This can prevent important optimizations, such as reordering memory references, or keeping previously loaded values in memory rather than reloading them. Inefficiently optimized scalar loops are less likely to be autovectorized. You can annotate your pointers with the restrict or __restrict__ keyword to tell the compiler that your pointers don't "alias" with any other memory references. (restrict can be used only in C when compiling for the C99 standard or later. __restrict__ is a language extension, available in GCC, Clang, and the XL and Open XL compilers, that can be used without restriction for both C and C++. See your compiler's user manual for details.) Suppose you have a function that takes two pointer arguments, one that points to data your function writes to, and one that points to data your function reads from. By default, the compiler may believe that the data being read and written could overlap. To disabuse the compiler of this notion, do the following: void foo (double *__restrict__ outp, double *__restrict__ inp)
Use Portable Intrinsics If you can't convince the compiler to autovectorize your code, or you want to access specific processor features not appropriate for autovectorization, you should use intrinsics. However, you should go out of your way to use intrinsics that are as portable as possible, in case you need to change compilers in the future. This reference provides intrinsics that are guaranteed to be portable across compliant compilers. In particular, the GCC, Clang, and Open XL compilers for Power implement the intrinsics in this manual. The compilers may each implement many more intrinsics, but the ones in this manual are the only ones guaranteed to be portable. So if you are using an interface not described here, you should look for an equivalent one in this manual and change your code to use that. Where an intrinsic may not be available from all compilers or at all ISA levels, this information is called out in the description of the intrinsic in . There are also other vector APIs that may be of use to you (see ). In particular, the Power Vector Library (see ) provides additional portability across compiler and ISA versions, as well as interfaces that hide cases where assembly language is needed.
Use Assembly Code Sparingly Sometimes the compiler will absolutely not cooperate in giving you the code you need. You might not get the instruction you want, or you might get extra instructions that are slowing down your ideal performance. When that happens, the first thing you should do is report this to the compiler community! This will allow them to get the problem fixed in the next release of the compiler. See if you need to report an issue. In the meanwhile, though, what are your options? As a workaround, your best option may be to use assembly code. There are two ways to go about this. Using inline assembly is generally appropriate only for very small snippets of code (1-5 instructions, say). If you want to write a whole function in assembly code, though, it is better to create a separate .s or .S file. The only difference in these two file types is that a .S file will be processed by the C preprocessor before being assembled. Assembly programming is beyond the scope of this manual. Getting inline assembly correct can be quite tricky, and it is best to look at existing examples to learn how to use it properly. However, there is a good introduction to inline assembly in Using the GNU Compiler Collection, in section 6.47 at the time of this writing. Felix Cloutier has also written a very nice guide. See . If you write a function entirely in assembly, you are responsible for following the calling conventions established by the ABI (see ). Again, it is best to look at examples. One place to find well-written .S files is in the GNU C Library project (see ). You can also study the assembly output from your favorite compiler, which can be obtained with the -S or similar option, or by using the objdump utility: objdump -dr <binary or object file>
Other Vector Programming APIs In addition to the intrinsic functions provided in this reference, programmers should be aware of other vector programming API resources.
x86 Vector Portability Headers Recent versions of the GCC, Clang, and Open XL compilers for Power provide "drop-in" portability headers for portions of the Intel Architecture Instruction Set Extensions (see ). These headers mirror the APIs of Intel headers having the same names. As of this writing, support is provided for the MMX and SSE layers, up through SSE3 and portions of SSE4. No support for the AVX layers is envisioned. The portability headers are available starting with GCC 8.1 and Clang 9.0.0. The portability headers provide the same semantics as the corresponding Intel APIs, but using VMX and VSX instructions to emulate the Intel vector instructions. It should be emphasized that these headers are provided for portability, and will not necessarily perform optimally (although in many cases the performance is very good). Using these headers is often a good first step in porting a library using Intel intrinsics to Power, after which more detailed rewriting of algorithms is usually desirable for best performance. Access to the portability APIs occurs automatically when including one of the corresponding Intel header files, such as <mmintrin.h>. To enable the portability headers, you must compile with -DNO_WARN_X86_INTRINSICS.
The Power Vector Library (pveclib) The Power Vector Library, also known as pveclib, is a separate project available from GitHub (see ). The pveclib project builds on top of the intrinsics described in this manual to provide higher-level vector interfaces that are highly portable. The goals of the project include: Providing equivalent functions across versions of the Power ISA. For example, the Vector Multiply-by-10 Unsigned Quadword operation introduced in Power ISA 3.0 (POWER9) can be implemented using a few vector instructions on earlier Power ISA versions. Providing equivalent functions across compiler versions. For example, intrinsics provided in later versions of the compiler can be implemented as inline functions with inline asm in earlier compiler versions. Providing higher-order functions not provided directly by the Power ISA. One example is a vector SIMD implementation for ASCII __isalpha and similar functions. Another example is full __int128 implementations of Count Leading Zeroes, Population Count, and Multiply.