Vector-Scalar Floating-Point Operations (VSX) With PowerISA 2.06 (POWER7) we extended the vector SIMD capabilities of the PowerISA: Extend the available vector and floating-point scalar register sets from 32 registers each to a combined register set of 64 x 64-bit scalar floating-point and 64 x 128-bit vector registers. Enable scalar double float operations on all 64 scalar registers. Enable vector double and vector float operations for all 64 vector registers. Enable super-scalar execution of vector instructions and support 2 independent vector floating point  pipelines for parallel execution of 4 x 64-bit Floating point Fused Multiply Adds (FMAs) and 8 x 32-bit FMAs per cycle. With PowerISA 2.07 (POWER8) we added single-precision scalar floating-point instructions to VSX. This completes the floating-point computational set for VSX. This ISA release also clarified how these operate in the Little Endian storage model. While the focus was on enhanced floating-point computation (for High Performance Computing), VSX also extended  the ISA with additional storage access, logical, and permute (merge, splat, shift) instructions. This was necessary to extend these operations to cover 64 VSX registers, and improves unaligned storage access for vectors  (not available in VMX). The PowerISA 2.07B Chapter 7. Vector-Scalar Floating-Point Operations is organized starting with an introduction and overview (chapters 7.1- 7.5) . The early sections (7.1 and 7.2) describe the layout of the 64 VSX registers and how they relate (overlap and inter-operate) to the existing floating point scalar (FPRs) and vector (VMX VRs) registers. 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 317 7.1.1 Overview of the Vector-Scalar Extension . . . . . . . . . . . 317 7.2 VSX Registers . . . . . . . . . . . . . . . . . . . . . . . . . 318 7.2.1 Vector-Scalar Registers . . . . . . . . . . . . . . . . . . . 318 7.2.2 Floating-Point Status and Control Register . . . . . . . . . . 321 The definitions given in “7.1.1.1 Compatibility with Category Floating-Point and Category Decimal Floating-Point Operations”, and “7.1.1.2 Compatibility with Category Vector Operations”
The instruction sets defined in Chapter 4. Floating-Point Facility and Chapter 5. Decimal Floating-Point retain their definition with one primary difference. The FPRs are mapped to doubleword element 0 of VSRs 0-31. The contents of doubleword 1 of the VSR corresponding to a source FPR specified by an instruction are ignored. The contents of doubleword 1 of a VSR corresponding to the target FPR specified by an instruction are undefined. The instruction set defined in Chapter 6. Vector Facility [Category: Vector], retains its definition with one primary difference. The VRs are mapped to VSRs 32-63.
The reference to scalar element 0 above is from the big endian register perspective of the ISA. In the PPC64LE ABI implementation, and for the purpose of porting Intel intrinsics, this is logical doubleword element 1.  Intel SSE scalar intrinsics operated on logical element [0],  which is in the wrong position for PowerISA FPU and VSX scalar floating-point  operations. Another important note is what happens to the other half of the VSR when you execute a scalar floating-point instruction (The contents of doubleword 1 of a VSR … are undefined.) The compiler will hide some of this detail when generating code for little endian vector element [] notation and most vector built-ins. For example vec_splat (A, 0) is transformed for PPC64LE to xxspltd VRT,VRA,1. What the compiler can not hide is the different placement of scalars within vector registers. Vector registers (VRs) 0-31 overlay and can be accessed from vector scalar registers (VSRs) 32-63. The ABI also specifies that VR2-13 are used to pass parameter and return values. In some cases the same (similar) operations exist in both VMX and VSX instruction forms, while in the other cases operations only exist for VMX (byte level permute and shift) or VSX (Vector double). So register selection that avoids unnecessary vector moves and follows the ABI while maintaining the correct instruction specific register numbering, can be tricky. The GCC register constraint annotations for Inline assembler using vector instructions are challenging, even for experts. So only experts should be writing assembler and then only in extraordinary circumstances. You should leave these details to the compiler (using vector extensions and vector built-ins) when ever possible. The next sections gets into the details of floating point representation, operations, and exceptions. They describe the implementation details for the IEEE-754R and C/C++ language standards that most developers only access via higher level APIs. Most programmers will not need this level of detail, but it is there if needed. 7.3 VSX Operations . . . . . . . . . . . . . . . . . . . . . . . . . 326 7.3.1 VSX Floating-Point Arithmetic Overview . . . . . . . . . . . . 326 7.3.2 VSX Floating-Point Data . . . . . . . . . . . . . . . . . . . 327 7.3.3 VSX Floating-Point Execution Models . . . . . . . . . . . . . 335 7.4 VSX Floating-Point Exceptions . . . . . . . . . . . . . . . . . 338 7.4.1 Floating-Point Invalid Operation Exception . . . . . . . . . . 341 7.4.2 Floating-Point Zero Divide Exception . . . . . . . . . . . . . 347 7.4.3 Floating-Point Overflow Exception. . . . . . . . . . . . . . . 349 7.4.4 Floating-Point Underflow Exception . . . . . . . . . . . . . . 351 Next comes an overview of the VSX storage access instructions for big and little endian and for aligned and unaligned data addresses. This included diagrams that illuminate the differences. 7.5 VSX Storage Access Operations . . . . . . . . . . . . . . . . . 356 7.5.1 Accessing Aligned Storage Operands . . . . . . . . . . . . . . 356 7.5.2 Accessing Unaligned Storage Operands . . . . . . . . . . . . . 357 7.5.3 Storage Access Exceptions . . . . . . . . . . . . . . . . . . 358 Section 7.6 starts with a VSX instruction Set Summary which is the place to start to get a feel for the types and operations supported.  The emphasis on floating-point, both scalar and vector (especially vector double), is pronounced. Many of the scalar and single-precision vector instructions look like duplicates of what we have seen in the Chapter 4 Floating-Point and Chapter 6 Vector facilities. The difference here is new instruction encodings to access the full 64 VSX register space. In addition there are a small number of logical instructions included to support predication (selecting / masking vector elements based on comparison results), and a set of permute, merge, shift, and splat instructions that operate on VSX word (float) and doubleword (double) elements. As mentioned about VMX section 6.8 these instructions are good to study as they are useful for realigning elements from PowerISA vector results to the form required for Intel Intrinsics. 7.6 VSX Instruction Set . . . . . . . . . . . . . . . . . . . . . . 359 7.6.1 VSX Instruction Set Summary . . . . . . . . . . . . . . . . . 359 7.6.1.1 VSX Storage Access Instructions . . . . . . . . . . . . . . 359 7.6.1.2 VSX Move Instructions . . . . . . . . . . . . . . . . . . . 360 7.6.1.3 VSX Floating-Point Arithmetic Instructions . . . . . . . . 360 7.6.1.4 VSX Floating-Point Compare Instructions . . . . . . . . . . 363 7.6.1.5 VSX DP-SP Conversion Instructions . . . . . . . . . . . . . 364 7.6.1.6 VSX Integer Conversion Instructions . . . . . . . . . . . . 364 7.6.1.7 VSX Round to Floating-Point Integer Instructions . . . . . 366 7.6.1.8 VSX Logical Instructions. . . . . . . . . . . . . . . . . . 366 7.6.1.9 VSX Permute Instructions. . . . . . . . . . . . . . . . . . 367 7.6.2 VSX Instruction Description Conventions . . . . . . . . . . . 368 7.6.3 VSX Instruction Descriptions . . . . . . . . . . . . . . . . 392 The VSX Instruction Descriptions section contains the detail description for each VSX category instruction.  The table entries from the Instruction Set Summary are formatted in the document as hyperlinks to corresponding instruction descriptions.