|
|
<?xml version="1.0" encoding="UTF-8"?> |
|
|
<!-- |
|
|
Copyright (c) 2017 OpenPOWER Foundation |
|
|
|
|
|
Licensed under the Apache License, Version 2.0 (the "License"); |
|
|
you may not use this file except in compliance with the License. |
|
|
You may obtain a copy of the License at |
|
|
|
|
|
http://www.apache.org/licenses/LICENSE-2.0 |
|
|
|
|
|
Unless required by applicable law or agreed to in writing, software |
|
|
distributed under the License is distributed on an "AS IS" BASIS, |
|
|
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
|
|
See the License for the specific language governing permissions and |
|
|
limitations under the License. |
|
|
|
|
|
--> |
|
|
<section xmlns="http://docbook.org/ns/docbook" |
|
|
xmlns:xi="http://www.w3.org/2001/XInclude" |
|
|
xmlns:xlink="http://www.w3.org/1999/xlink" |
|
|
version="5.0" |
|
|
xml:id="sec_powerisa_vector_size_type"> |
|
|
<title>How vector elements change size and type</title> |
|
|
|
|
|
<para>Most vector built ins return the same vector type as the (first) |
|
|
input parameters, but there are exceptions. Examples include conversions |
|
|
between types, compares, pack, unpack, merge, and integer multiply |
|
|
operations.</para> |
|
|
|
|
|
<para>Converting floats to / from integer types will change the type and sometimes |
|
|
change the element size as well (double ↔ int and float ↔ long). For |
|
|
VMX the conversions are always the same size (float ↔ [unsigned] int). But |
|
|
VSX allows conversion of 64-bit (long or double) to from 32-bit (float or |
|
|
int) with the inherent size changes. The PowerISA VSX defines a 4-element |
|
|
vector layout where little endian elements 0, 2 are used for input/output and |
|
|
elements 1,3 are undefined. The OpenPOWER ABI Appendix A defines |
|
|
<literal>vec_double</literal> and <literal>vec_float</literal> |
|
|
with even/odd and high/low extensions as program aids. These are not |
|
|
included in GCC 7 or earlier but are planned for GCC 8.</para> |
|
|
|
|
|
<para>Compare operations produce either |
|
|
<literal>vector bool <</literal>input element type<literal>></literal> |
|
|
(effectively bit masks) or predicates (the condition code for all and |
|
|
any are represented as an int truth variable). When a predicate compare (i.e. |
|
|
<literal>vec_all_eq</literal>, <literal>vec_any_gt</literal>) |
|
|
is used in an if statement, the condition code is |
|
|
used directly in the conditional branch and the int truth value is not |
|
|
generated.</para> |
|
|
|
|
|
<para>Pack operations pack integer elements into the next smaller (half) |
|
|
integer sized elements. Pack operations include signed and unsigned saturate |
|
|
and unsigned modulo forms. As the packed result will be half the size (in |
|
|
bits), pack instructions require 2 vectors (256-bits) as input and generate a |
|
|
single 128-bit vector result. |
|
|
<programlisting><![CDATA[vec_vpkudum ({1, 2}, {101, 102}) result={1, 2, 101, 102}]]></programlisting></para> |
|
|
|
|
|
<para>Unpack operations expand integer elements into the next larger size |
|
|
elements. The integers are always treated as signed values and sign-extended. |
|
|
The processor design avoids instructions that return multiple register values. |
|
|
So the PowerISA defines unpack-high and unpack low forms where instruction |
|
|
takes (the high or low) half of vector elements and extends them to fill the |
|
|
vector output. Element order is maintained and an unpack high / low sequence |
|
|
with the same input vector has the effect of unpacking to a 256-bit result in two |
|
|
vector registers. |
|
|
<programlisting><![CDATA[vec_vupkhsw ({1, 2, 3, 4}) result={1, 2} |
|
|
vec_vupkhsw ({-1, 2, -3, 4}) result={-1, 2} |
|
|
vec_vupklsw ({1, 2, 3, 4}) result={3, 4} |
|
|
vec_vupklsw ({-1, 2, -3, 4}) result={-3, 4}]]></programlisting></para> |
|
|
|
|
|
<para>Merge operations resemble shuffling two (vectors) card decks |
|
|
together, alternating (elements) cards in the result. As we are merging from |
|
|
2 vectors (256-bits) into 1 vector (128-bits) and the elements do not change |
|
|
size, we have merge high and merge low instruction forms for each (byte, |
|
|
halfword and word) integer type. The merge high operations alternate elements |
|
|
from the (vector register left) high half of the two input vectors. The merge |
|
|
low operation alternate elements from the (vector register right) low half of |
|
|
the two input vectors.</para> |
|
|
|
|
|
<para>For PowerISA 2.07 we added vector merge word even / odd instructions. |
|
|
Instead of high or low elements the shuffle is from the even or odd number |
|
|
elements of the two input vectors. Passing the same vector to both inputs to |
|
|
merge produces splat-like results for each doubleword half, which is handy in |
|
|
some convert operations. |
|
|
<programlisting><![CDATA[vec_mrghd ({1, 2}, {101, 102}) result={1, 101} |
|
|
vec_mrgld ({1, 2}, {101, 102}) result={2, 102} |
|
|
|
|
|
vec_vmrghw ({1, 2, 3, 4}, {101, 102, 103, 104}) result={1, 101, 2, 102} |
|
|
vec_vmrghw ({1, 2, 3, 4}, {1, 2, 3, 4}) result={1, 1, 2, 2} |
|
|
vec_vmrglw ({1, 2, 3, 4}, {101, 102, 103, 104}) result={3, 103, 4, 104} |
|
|
vec_vmrglw ({1, 2, 3, 4}, {1, 2, 3, 4}) result={3, 3, 4, 4} |
|
|
|
|
|
|
|
|
vec_mergee ({1, 2, 3, 4}, {101, 102, 103, 104}) result={1, 101, 3, 103} |
|
|
vec_mergee ({1, 2, 3, 4}, {1, 2, 3, 4}) result={1, 1, 3, 3} |
|
|
vec_mergeo ({1, 2, 3, 4}, {101, 102, 103, 104}) result={2, 102, 4, 104} |
|
|
vec_mergeo ({1, 2, 3, 4}, {1, 2, 3, 4}) result={2, 2, 4, 4}]]></programlisting></para> |
|
|
|
|
|
<para>Integer multiply has the potential to generate twice as many bits in |
|
|
the product as input. A multiply of 2 int (32-bit) values produces a long |
|
|
(64-bits). Normal C language * operations ignore this and discard the top |
|
|
32-bits of the result. However in some computations it useful to preserve the |
|
|
double product precision for intermediate computation before reducing the final |
|
|
result back to the original precision.</para> |
|
|
|
|
|
<para>The PowerISA VMX instruction set took the later approach, i.e., keep all |
|
|
the product bits until the programmer explicitly asks for the truncated result |
|
|
(via the pack operation). |
|
|
So the vector integer multiple are split into even/odd forms across signed and |
|
|
unsigned byte, halfword and word inputs. This requires two instructions (given |
|
|
the same inputs) to generate the full vector multiply across 2 vector |
|
|
registers and 256-bits. Again as POWER processors are super-scalar this pair of |
|
|
instructions should execute in parallel.</para> |
|
|
|
|
|
<para>The set of expanded product values can either be used directly in |
|
|
further (doubled precision) computation or merged/packed into the single single |
|
|
vector at the smaller bit size. This is what the compiler will generate for C |
|
|
vector extension multiply of vector integer types.</para> |
|
|
|
|
|
</section> |
|
|
|
|
|
|