You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Programming-Guides/Porting_Vector_Intrinsics/sec_power_vsx.xml

188 lines
10 KiB
XML

<?xml version="1.0" encoding="UTF-8"?>
<!--
Copyright (c) 2017 OpenPOWER Foundation
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="sec_power_vector_scalar_floatingpoint">
<title>Vector-Scalar Floating-Point Operations (VSX)</title>
<para>With PowerISA 2.06 (POWER7) we extended the vector SIMD capabilities
of the PowerISA:</para>
<itemizedlist spacing="compact">
<listitem>
<para>Extend the available vector and floating-point scalar register
sets from 32 registers each to a combined register set of 64 x 64-bit
scalar floating-point and
64 x 128-bit vector registers.</para>
</listitem>
<listitem>
<para>Enable scalar double float operations on all 64 scalar
registers.</para>
</listitem>
<listitem>
<para>Enable vector double and vector float operations for all 64
vector registers.</para>
</listitem>
<listitem>
<para>Enable super-scalar execution of vector instructions and support
2 independent vector floating point  pipelines for parallel execution of 4 x
64-bit Floating point Fused Multiply Adds (FMAs) and 8 x 32-bit FMAs per
cycle.</para>
</listitem>
</itemizedlist>
<para>With PowerISA 2.07 (POWER8) we added single-precision scalar
floating-point instructions to VSX. This completes the floating-point
computational set for VSX. This ISA release also clarified how these operate in
the Little Endian storage model.</para>
<para>While the focus was on enhanced floating-point computation (for High
Performance Computing), VSX also extended  the ISA with additional storage
access, logical, and permute (merge, splat, shift) instructions. This was
necessary to extend these operations to cover 64 VSX registers, and improves
unaligned storage access for vectors  (not available in VMX).</para>
<para>The PowerISA 2.07B Chapter 7. Vector-Scalar Floating-Point Operations
is organized starting with an introduction and overview (chapters 7.1- 7.5) .
The early sections (7.1 and 7.2) describe the layout of the 64 VSX registers
and how they relate (overlap and inter-operate) to the existing floating point
scalar (FPRs) and vector (VMX VRs) registers.
<literallayout><literal>7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 317
7.1.1 Overview of the Vector-Scalar Extension . . . . . . . . . . . 317
7.2 VSX Registers . . . . . . . . . . . . . . . . . . . . . . . . . 318
7.2.1 Vector-Scalar Registers . . . . . . . . . . . . . . . . . . . 318
7.2.2 Floating-Point Status and Control Register . . . . . . . . . . 321</literal></literallayout></para>
<para>The definitions given in “7.1.1.1 Compatibility with Category
Floating-Point and Category Decimal Floating-Point Operations”, and
“7.1.1.2 Compatibility with Category Vector Operations”
<blockquote>
<para>The instruction sets defined in Chapter 4.
Floating-Point Facility and Chapter 5. Decimal
Floating-Point retain their definition with one primary
difference. The FPRs are mapped to doubleword
element 0 of VSRs 0-31. The contents of doubleword 1
of the VSR corresponding to a source FPR specified
by an instruction are ignored. The contents of
doubleword 1 of a VSR corresponding to the target
FPR specified by an instruction are undefined.</para>
<para>The instruction set defined in Chapter 6. Vector Facility
[Category: Vector], retains its definition with one
primary difference. The VRs are mapped to VSRs
32-63.</para></blockquote></para>
<note><para>The reference to scalar element 0 above is from the big endian
register perspective of the ISA. In the PPC64LE ABI implementation, and for the
purpose of porting Intel intrinsics, this is logical doubleword element 1.  Intel SSE
scalar intrinsics operated on logical element [0],  which is in the wrong
position for PowerISA FPU and VSX scalar floating-point  operations. Another
important note is what happens to the other half of the VSR when you execute a
scalar floating-point instruction (<emphasis>The contents of doubleword 1 of a VSR …
are undefined.</emphasis>)</para></note>
<para>The compiler will hide some of this detail when generating code for
little endian vector element [] notation and most vector built-ins. For example
<literal>vec_splat (A, 0)</literal> is transformed for
PPC64LE to <literal>xxspltd VRT,VRA,1</literal>.
What the compiler <emphasis><emphasis role="bold">can not</emphasis></emphasis>
hide is the different placement of scalars within vector registers.</para>
<para>Vector registers (VRs) 0-31 overlay and can be accessed from vector
scalar registers (VSRs) 32-63. The ABI also specifies that VR2-13 are used to
pass parameter and return values. In some cases the same (similar) operations
exist in both VMX and VSX instruction forms, while in the other cases
operations only exist for VMX (byte level permute and shift) or VSX (Vector
double).</para>
<para>So register selection that avoids unnecessary vector moves and follows
the ABI while maintaining the correct instruction specific register numbering,
can be tricky. The
<link xlink:href="https://gcc.gnu.org/onlinedocs/gcc-6.3.0/gcc/Machine-Constraints.html#Machine-Constraints">GCC register constraint</link>
annotations for Inline
assembler using vector instructions are challenging, even for experts. So only
experts should be writing assembler and then only in extraordinary
circumstances. You should leave these details to the compiler (using vector
extensions and vector built-ins) when ever possible.</para>
<para>The next sections gets into the details of floating point
representation, operations, and exceptions. They describe the implementation
details for the IEEE-754R and C/C++ language standards that most developers only
access via higher level APIs. Most programmers will not need this level of
detail, but it is there if needed.
<literallayout><literal>7.3 VSX Operations . . . . . . . . . . . . . . . . . . . . . . . . . 326
7.3.1 VSX Floating-Point Arithmetic Overview . . . . . . . . . . . . 326
7.3.2 VSX Floating-Point Data . . . . . . . . . . . . . . . . . . . 327
7.3.3 VSX Floating-Point Execution Models . . . . . . . . . . . . . 335
7.4 VSX Floating-Point Exceptions . . . . . . . . . . . . . . . . . 338
7.4.1 Floating-Point Invalid Operation Exception . . . . . . . . . . 341
7.4.2 Floating-Point Zero Divide Exception . . . . . . . . . . . . . 347
7.4.3 Floating-Point Overflow Exception. . . . . . . . . . . . . . . 349
7.4.4 Floating-Point Underflow Exception . . . . . . . . . . . . . . 351</literal></literallayout></para>
<para>Next comes an overview of the VSX storage access instructions for big and
little endian and for aligned and unaligned data addresses. This included
diagrams that illuminate the differences.
<literallayout><literal>7.5 VSX Storage Access Operations . . . . . . . . . . . . . . . . . 356
7.5.1 Accessing Aligned Storage Operands . . . . . . . . . . . . . . 356
7.5.2 Accessing Unaligned Storage Operands . . . . . . . . . . . . . 357
7.5.3 Storage Access Exceptions . . . . . . . . . . . . . . . . . . 358</literal></literallayout></para>
<para>Section 7.6 starts with a VSX instruction Set Summary which is the
place to start to get a feel for the types and operations supported.  The
emphasis on floating-point, both scalar and vector (especially vector double), is
pronounced. Many of the scalar and single-precision vector instructions look
like duplicates of what we have seen in the Chapter 4 Floating-Point and
Chapter 6 Vector facilities. The difference here is new instruction encodings
to access the full 64 VSX register space. </para>
<para>In addition there are a small number of logical instructions
included to support predication (selecting / masking vector elements based on
comparison results), and a set of permute, merge, shift, and splat instructions that
operate on VSX word (float) and doubleword (double) elements. As mentioned
about VMX section 6.8 these instructions are good to study as they are useful
for realigning elements from PowerISA vector results to the form required for Intel
Intrinsics.
<literallayout><literal>7.6 VSX Instruction Set . . . . . . . . . . . . . . . . . . . . . . 359
7.6.1 VSX Instruction Set Summary . . . . . . . . . . . . . . . . . 359
7.6.1.1 VSX Storage Access Instructions . . . . . . . . . . . . . . 359
7.6.1.2 VSX Move Instructions . . . . . . . . . . . . . . . . . . . 360
7.6.1.3 VSX Floating-Point Arithmetic Instructions . . . . . . . . 360
7.6.1.4 VSX Floating-Point Compare Instructions . . . . . . . . . . 363
7.6.1.5 VSX DP-SP Conversion Instructions . . . . . . . . . . . . . 364
7.6.1.6 VSX Integer Conversion Instructions . . . . . . . . . . . . 364
7.6.1.7 VSX Round to Floating-Point Integer Instructions . . . . . 366
7.6.1.8 VSX Logical Instructions. . . . . . . . . . . . . . . . . . 366
7.6.1.9 VSX Permute Instructions. . . . . . . . . . . . . . . . . . 367
7.6.2 VSX Instruction Description Conventions . . . . . . . . . . . 368
7.6.3 VSX Instruction Descriptions . . . . . . . . . . . . . . . . 392</literal></literallayout></para>
<para>The VSX Instruction Descriptions section contains the detail
description for each VSX category instruction.  The table entries from the
Instruction Set Summary are formatted in the document as hyperlinks to
corresponding instruction descriptions.</para>
</section>