You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
307 lines
13 KiB
307 lines
13 KiB
<!-- |
|
Copyright (c) 2019 OpenPOWER Foundation |
|
|
|
Licensed under the Apache License, Version 2.0 (the "License"); |
|
you may not use this file except in compliance with the License. |
|
You may obtain a copy of the License at |
|
|
|
http://www.apache.org/licenses/LICENSE-2.0 |
|
|
|
Unless required by applicable law or agreed to in writing, software |
|
distributed under the License is distributed on an "AS IS" BASIS, |
|
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
|
See the License for the specific language governing permissions and |
|
limitations under the License. |
|
|
|
--> |
|
<chapter version="5.0" xml:lang="en" xmlns="http://docbook.org/ns/docbook" xmlns:xi="http://www.w3.org/2001/XInclude" |
|
xmlns:xlink="http://www.w3.org/1999/xlink" xml:id="section_techniques"> |
|
|
|
<!-- Chapter Title goes here. --> |
|
<title>Vector Programming Techniques</title> |
|
|
|
<section> |
|
<title>Help the Compiler Help You</title> |
|
<para> |
|
The best way to use vector intrinsics is often <emphasis>not to |
|
use them at all</emphasis>. |
|
</para> |
|
<para> |
|
This may seem counterintuitive at first. Aren't vector |
|
intrinsics the best way to ensure that the compiler does exactly |
|
what you want? Well, sometimes. But the problem is that the |
|
best instruction sequence today may not be the best instruction |
|
sequence tomorrow. As the Power ISA moves forward, new |
|
instruction capabilities appear, and the old code you wrote can |
|
easily become obsolete. Then you start having to create |
|
different versions of the code for different levels of the |
|
Power ISA, and it can quickly become difficult to maintain. |
|
</para> |
|
<para> |
|
Most often programmers use vector intrinsics to increase the |
|
performance of loop kernels that dominate the performance of an |
|
application or library. However, modern compilers are often |
|
able to optimize such loops to use vector instructions without |
|
having to resort to intrinsics, using an optimization called |
|
autovectorization (or auto-SIMD). Your first focus when writing |
|
loop kernels should be on making the code amenable to |
|
autovectorization by the compiler. Start by writing the code |
|
naturally, using scalar memory accesses and data operations, and |
|
see whether the compiler autovectorizes your code. If not, here |
|
are some steps you can try: |
|
</para> |
|
<itemizedlist> |
|
<listitem> |
|
<para> |
|
<emphasis role="underline">Check your optimization |
|
level</emphasis>. Different compilers enable |
|
autovectorization at different optimization levels. For |
|
example, at this writing the GCC compiler requires |
|
<code>-O3</code> to enable autovectorization by default. |
|
</para> |
|
</listitem> |
|
<listitem> |
|
<para> |
|
<emphasis role="underline">Consider using |
|
<code>-ffast-math</code></emphasis>. This option assumes |
|
that certain fussy aspects of IEEE floating-point can be |
|
ignored, such as the presence of Not-a-Numbers (NaNs), |
|
signed zeros, and so forth. <code>-ffast-math</code> may |
|
also affect precision of results that may not matter to your |
|
application. Turning on this option can simplify the |
|
control flow of loops generated for your application by |
|
removing tests for NaNs and so forth. (Note that |
|
<code>-Ofast</code> turns on both -O3 and -ffast-math in |
|
GCC.) |
|
</para> |
|
</listitem> |
|
<listitem> |
|
<para> |
|
<emphasis role="underline">Align your data wherever |
|
possible</emphasis>. For most effective auto-vectorization, |
|
arrays of data should be aligned on at least a 16-byte |
|
boundary, and pointers to that data should be identified as |
|
having the appropriate alignment. For example: |
|
</para> |
|
<programlisting> float fdata[4096] __attribute__((aligned(16)));</programlisting> |
|
<para> |
|
ensures that the compiler can use an efficient, aligned |
|
vector load to bring data from <code>fdata</code> into a |
|
vector register. Autovectorization will appear more |
|
profitable to the compiler when data is known to be |
|
aligned. |
|
</para> |
|
<para> |
|
You can also declare pointers to point to aligned data, |
|
which is particularly useful in function arguments: |
|
</para> |
|
<programlisting> void foo (__attribute__((aligned(16))) double * aligned_fptr)</programlisting> |
|
</listitem> |
|
<listitem> |
|
<para> |
|
<emphasis role="underline">Tell the compiler when data can't |
|
overlap</emphasis>. In C and C++, use of pointers can cause |
|
compilers to pessimistically analyze which memory references |
|
can refer to the same memory. This can prevent important |
|
optimizations, such as reordering memory references, or |
|
keeping previously loaded values in memory rather than |
|
reloading them. Inefficiently optimized scalar loops are |
|
less likely to be autovectorized. You can annotate your |
|
pointers with the <code>restrict</code> or |
|
<code>__restrict__</code> keyword to tell the compiler that |
|
your pointers don't "alias" with any other memory |
|
references. (<code>restrict</code> can be used only in C |
|
when compiling for the C99 standard or later. |
|
<code>__restrict__</code> is a language extension, available |
|
in GCC, Clang, and the XL <phrase revisionflag="added">and |
|
Open XL</phrase> compilers, that can be used without |
|
restriction for both C and C++. See your compiler's user |
|
manual for details.) |
|
</para> |
|
<para> |
|
Suppose you have a function that takes two pointer |
|
arguments, one that points to data your function writes to, and |
|
one that points to data your function reads from. By |
|
default, the compiler may believe that the data being read |
|
and written could overlap. To disabuse the compiler of this |
|
notion, do the following: |
|
</para> |
|
<programlisting> void foo (double *__restrict__ outp, double *__restrict__ inp)</programlisting> |
|
</listitem> |
|
</itemizedlist> |
|
</section> |
|
|
|
<section> |
|
<title>Use Portable Intrinsics</title> |
|
<para> |
|
If you can't convince the compiler to autovectorize your code, |
|
or you want to access specific processor features not |
|
appropriate for autovectorization, you should use intrinsics. |
|
However, you should go out of your way to use intrinsics that |
|
are as portable as possible, in case you need to change |
|
compilers in the future. |
|
</para> |
|
<para> |
|
This reference provides intrinsics that are guaranteed to be |
|
portable across compliant compilers. In particular, the <phrase |
|
revisionflag="changed">GCC, Clang, and Open XL</phrase> |
|
compilers for Power implement the intrinsics in |
|
this manual. The compilers may each implement many more |
|
intrinsics, but the ones in this manual are the only ones |
|
guaranteed to be portable. So if you are using an interface not |
|
described here, you should look for an equivalent one in this |
|
manual and change your code to use that. |
|
</para> |
|
<para> |
|
Where an intrinsic may not be available from all compilers or at |
|
all ISA levels, this information is called out in the |
|
description of the intrinsic in <xref |
|
linkend="VIPR.reference.vecfns" />. |
|
</para> |
|
<para> |
|
There are also other vector APIs that may be of use to you (see |
|
<xref linkend="VIPR.techniques.apis" />). In particular, the |
|
Power Vector Library (see <xref |
|
linkend="VIPR.techniques.pveclib" />) provides additional |
|
portability across compiler and ISA versions, as well as |
|
interfaces that hide cases where assembly language is needed. |
|
</para> |
|
</section> |
|
|
|
<section> |
|
<title>Use Assembly Code Sparingly</title> |
|
<para> |
|
Sometimes the compiler will absolutely not cooperate in giving |
|
you the code you need. You might not get the instruction you |
|
want, or you might get extra instructions that are slowing down |
|
your ideal performance. When that happens, the first thing you |
|
should do is report this to the compiler community! This will |
|
allow them to get the problem fixed in the next release of the |
|
compiler. See <xref linkend="VIPR.intro.reporting" /> if you |
|
need to report an issue. |
|
</para> |
|
<para> |
|
In the meanwhile, though, what are your options? As a |
|
workaround, your best option may be to use assembly code. There |
|
are two ways to go about this. Using inline assembly is |
|
generally appropriate only for very small snippets of code (1-5 |
|
instructions, say). If you want to write a whole function in |
|
assembly code, though, it is better to create a separate |
|
<code>.s</code> or <code>.S</code> file. The only difference in |
|
these two file types is that a <code>.S</code> file will be |
|
processed by the C preprocessor before being assembled. |
|
</para> |
|
<para> |
|
Assembly programming is beyond the scope of this manual. |
|
Getting inline assembly correct can be quite tricky, and it is |
|
best to look at existing examples to learn how to use it |
|
properly. However, there is a good introduction to inline |
|
assembly in <emphasis>Using the GNU Compiler |
|
Collection</emphasis>, in section 6.47 at the time of this |
|
writing. Felix Cloutier has also written a very nice guide. |
|
See <xref linkend="VIPR.intro.links" />. |
|
</para> |
|
<para> |
|
If you write a function entirely in assembly, you are |
|
responsible for following the calling conventions established by |
|
the ABI (see <xref linkend="VIPR.intro.links" />). Again, it is |
|
best to look at examples. One place to find well-written |
|
<code>.S</code> files is in the <phrase |
|
revisionflag="changed">GNU C Library project (see <xref |
|
linkend="VIPR.intro.links" />).</phrase> You can also |
|
study the assembly output from your favorite compiler, which can |
|
be obtained with the <code>-S</code> or similar option, or by |
|
using the <emphasis role="bold">objdump</emphasis> utility: |
|
</para> |
|
<para revisionflag="added"> |
|
<programlisting> objdump -dr <binary or object file></programlisting> |
|
</para> |
|
</section> |
|
|
|
<section xml:id="VIPR.techniques.apis"> |
|
<title>Other Vector Programming APIs</title> |
|
<para>In addition to the intrinsic functions provided in this |
|
reference, programmers should be aware of other vector programming |
|
API resources.</para> |
|
<section> |
|
<title>x86 Vector Portability Headers</title> |
|
<para> |
|
Recent versions of the <phrase revisionflag="changed">GCC, |
|
Clang, and Open XL</phrase> compilers |
|
for Power provide "drop-in" portability headers for portions |
|
of the <phrase revisionflag="changed"><trademark |
|
class="registered">Intel</trademark></phrase> Architecture |
|
Instruction Set Extensions (see <xref |
|
linkend="VIPR.intro.links" />). These headers mirror the APIs |
|
of Intel headers having the same names. As of this writing, |
|
support is provided for the MMX and SSE layers, up through |
|
SSE3 and portions of SSE4. No support for the AVX layers is |
|
envisioned. The portability headers are available starting |
|
with GCC 8.1 and Clang 9.0.0. |
|
</para> |
|
<para> |
|
The portability headers provide the same semantics as the |
|
corresponding Intel APIs, but using VMX and VSX instructions |
|
to emulate the Intel vector instructions. It should be |
|
emphasized that these headers are provided for portability, |
|
and will not necessarily perform optimally (although in many |
|
cases the performance is very good). Using these headers is |
|
often a good first step in porting a library using Intel |
|
intrinsics to Power, after which more detailed rewriting of |
|
algorithms is usually desirable for best performance. |
|
</para> |
|
<para> |
|
Access to the portability APIs occurs automatically when |
|
including one of the corresponding Intel header files, such as |
|
<code><mmintrin.h></code>. <phrase |
|
revisionflag="added">To enable the portability headers, you |
|
must compile with |
|
<code>-DNO_WARN_X86_INTRINSICS</code>.</phrase> |
|
</para> |
|
</section> |
|
<section xml:id="VIPR.techniques.pveclib"> |
|
<title>The Power Vector Library (pveclib)</title> |
|
<para>The Power Vector Library, also known as |
|
<code>pveclib</code>, is a separate project available from |
|
<phrase revisionflag="changed">GitHub</phrase> (see <xref |
|
linkend="VIPR.intro.links" />). The |
|
<code>pveclib</code> project builds on top of the intrinsics |
|
described in this manual to provide higher-level vector |
|
interfaces that are highly portable. The goals of the project |
|
include: |
|
</para> |
|
<itemizedlist> |
|
<listitem> |
|
<para> |
|
Providing equivalent functions across versions of the |
|
Power ISA. For example, the <emphasis>Vector |
|
Multiply-by-10 Unsigned Quadword</emphasis> operation |
|
introduced in Power ISA 3.0 (POWER9) can be implemented |
|
using a few vector instructions on earlier Power ISA |
|
versions. |
|
</para> |
|
</listitem> |
|
<listitem> |
|
<para> |
|
Providing equivalent functions across compiler versions. |
|
For example, intrinsics provided in later versions of the |
|
compiler can be implemented as inline functions with |
|
inline asm in earlier compiler versions. |
|
</para> |
|
</listitem> |
|
<listitem> |
|
<para> |
|
Providing higher-order functions not provided directly by |
|
the Power ISA. One example is a vector SIMD implementation |
|
for ASCII <code>__isalpha</code> and similar functions. |
|
Another example is full <code>__int128</code> |
|
implementations of <emphasis>Count Leading |
|
Zeroes</emphasis>, <emphasis>Population Count</emphasis>, |
|
and <emphasis>Multiply</emphasis>. |
|
</para> |
|
</listitem> |
|
</itemizedlist> |
|
</section> |
|
</section> |
|
|
|
</chapter>
|
|
|