Vector permute and formatting instructions The vector Permute and formatting chapter follows and is an important one to study. These operate on the byte, halfword, word (and with PowerISA 2.07 doubleword) integer types, plus special pixel type. The shift instructions in this chapter operate on the vector as a whole at either the bit or the byte (octet) level. This is an important chapter to study for moving PowerISA vector results into the vector elements that Intel Intrinsics expect: 6.8 Vector Permute and Formatting Instructions . . . . . . . . . . . 249 6.8.1 Vector Pack and Unpack Instructions . . . . . . . . . . . . . 249 6.8.2 Vector Merge Instructions . . . . . . . . . . . . . . . . . . 256 6.8.3 Vector Splat Instructions . . . . . . . . . . . . . . . . . . 259 6.8.4 Vector Permute Instruction . . . . . . . . . . . . . . . . . . 260 6.8.5 Vector Select Instruction . . . . . . . . . . . . . . . . . . 261 6.8.6 Vector Shift Instructions . . . . . . . . . . . . . . . . . . 262 The Vector Integer instructions include the add / subtract / Multiply / Multiply Add/Sum / (no divide) operations for the standard integer types. There are instruction forms that  provide signed, unsigned, modulo, and saturate results for most operations. PowerISA 2.07 extends vector integer operations to add / subtract quadword (128-bit) integers with carry and extend. This supports extended binary integer arithmetic to 256, 512-bit and beyond. There are signed / unsigned compares across the standard integer types (byte, .. doubleword); the usual bit-wise logical operations; and the SIMD shift / rotate instructions that operate on the vector elements for various integer types. 6.9 Vector Integer Instructions . . . . . . . . . . . . . . . . . . 264 6.9.1 Vector Integer Arithmetic Instructions . . . . . . . . . . . . 264 6.9.2 Vector Integer Compare Instructions. . . . . . . . . . . . . . 294 6.9.3 Vector Logical Instructions . . . . . . . . . . . . . . . . . 300 6.9.4 Vector Integer Rotate and Shift Instructions . . . . . . . . . 302 The vector [single] float instructions are grouped into this chapter. This chapter does not include the double float instructions, which are described in the VSX chapter. VSX also includes additional float instructions that operate on the whole 64 register vector-scalar set. 6.10 Vector Floating-Point Instruction Set . . . . . . . . . . . . . 306 6.10.1 Vector Floating-Point Arithmetic Instructions . . . . . . . . 306 6.10.2 Vector Floating-Point Maximum and Minimum Instructions . . . 308 6.10.3 Vector Floating-Point Rounding and Conversion Instructions. . 309 6.10.4 Vector Floating-Point Compare Instructions . . . . . . . . . 313 6.10.5 Vector Floating-Point Estimate Instructions . . . . . . . . . 316 The vector XOR based instructions are new with PowerISA 2.07 (POWER8) and provide vector crypto and check-sum operations: 6.11 Vector Exclusive-OR-based Instructions . . . . . . . . . . . . 318 6.11.1 Vector AES Instructions . . . . . . . . . . . . . . . . . . . 318 6.11.2 Vector SHA-256 and SHA-512 Sigma Instructions . . . . . . . . 320 6.11.3 Vector Binary Polynomial Multiplication Instructions. . . . . 321 6.11.4 Vector Permute and Exclusive-OR Instruction . . . . . . . . . 323 The vector gather and bit permute instructions support bit-level rearrangement of bits with in the vector, while the vector versions of the count leading zeros and population count instructions are useful to accelerate specific algorithms. 6.12 Vector Gather Instruction . . . . . . . . . . . . . . . . . . . 324 6.13 Vector Count Leading Zeros Instructions . . . . . . . . . . . . 325 6.14 Vector Population Count Instructions. . . . . . . . . . . . . . 326 6.15 Vector Bit Permute Instruction . . . . . . . . . . . . . . . . 327 The Decimal Integer add / subtract (fixed point) instructions complement the Decimal Floating-Point instructions. They can also be used to accelerate some binary to/from decimal conversions. The VSCR instructions provide access to the Non-Java mode floating-point control and the saturation status. These instructions are not normally of interest in porting Intel intrinsics. 6.16 Decimal Integer Arithmetic Instructions . . . . . . . . . . . . 328 6.17 Vector Status and Control Register Instructions . . . . . . . . 331 With PowerISA 2.07B (Power8) several major extensions were added to the Vector Facility: Vector Crypto: Under “Vector Exclusive-OR-based Instructions Vector Exclusive-OR-based Instructions”, AES [inverse] Cipher, SHA 256 / 512 Sigma, Polynomial Multiplication, and Permute and XOR instructions. 64-bit Integer; signed and unsigned add / subtract, signed and unsigned compare, Even / Odd 32 x 32 multiple with 64-bit product, signed / unsigned max / min, rotate and shift left/right. Direct Move between GPRs and the FPRs / Left half of Vector Registers. 128-bit integer add / subtract with carry / extend, direct support for vector __int128 and multiple precision arithmetic. Decimal Integer add / subtract for 31 digit Binary Coded Decimal (BCD). Miscellaneous SIMD extensions: Count leading Zeros, Population count, bit gather / permute, and vector forms of eqv, nand, orc. The rationale for these being included in the Vector Facilities (VMX) (vs Vector-Scalar Floating-Point Operations (VSX)) has more to do with how the instructions were encoded than with the type of operations or the ISA version of introduction. This is primarily a trade-off between the bits required for register selection versus the bits for extended op-code space within a fixed 32-bit instruction. Basically accessing 32 vector registers requires 5 bits per register, while accessing all 64 vector-scalar registers require 6 bits per register. When you consider that most vector instructions require 3 and some (select, fused multiply-add) require 4 register operand forms,  the impact on op-code space is significant. The larger register set of VSX was justified by queueing theory of larger HPC matrix codes using double float, while 32 registers are sufficient for most applications. So by definition the VMX instructions are restricted to the original 32 vector registers while VSX instructions are encoded to  access all 64 floating-point scalar and vector double registers. This distinction can be troublesome when programming at the assembler level, but the compiler and compiler built-ins can hide most of this detail from the programmer.