|
|
<?xml version="1.0" encoding="UTF-8"?> |
|
|
<!-- |
|
|
Copyright (c) 2017 OpenPOWER Foundation |
|
|
|
|
|
Licensed under the Apache License, Version 2.0 (the "License"); |
|
|
you may not use this file except in compliance with the License. |
|
|
You may obtain a copy of the License at |
|
|
|
|
|
http://www.apache.org/licenses/LICENSE-2.0 |
|
|
|
|
|
Unless required by applicable law or agreed to in writing, software |
|
|
distributed under the License is distributed on an "AS IS" BASIS, |
|
|
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
|
|
See the License for the specific language governing permissions and |
|
|
limitations under the License. |
|
|
|
|
|
--> |
|
|
<section xmlns="http://docbook.org/ns/docbook" |
|
|
xmlns:xi="http://www.w3.org/2001/XInclude" |
|
|
xmlns:xlink="http://www.w3.org/1999/xlink" |
|
|
version="5.0" |
|
|
xml:id="sec_handling_mmx"> |
|
|
<title>Dealing with MMX</title> |
|
|
|
|
|
<para>MMX is actually the harder case. The <literal>__m64</literal> |
|
|
type supports SIMD vector |
|
|
int types (char, short, int, long). The Intel API defines |
|
|
<literal>__m64</literal> as: |
|
|
<programlisting><![CDATA[typedef int __m64 __attribute__ ((__vector_size__ (8), __may_alias__));]]></programlisting></para> |
|
|
|
|
|
<para>Which is problematic for the PowerPC target (not really supported in |
|
|
GCC) and we would prefer to use a native PowerISA type that can be passed in a |
|
|
single register. The PowerISA Rotate Under Mask instructions can easily |
|
|
extract and insert integer fields of a General Purpose Register (GPR). This |
|
|
implies that MMX integer types can be handled as an internal union of arrays for |
|
|
the supported element types. So a 64-bit unsigned long long is the best type |
|
|
for parameter passing and return values, especially for the 64-bit (_si64) |
|
|
operations as these normally generate a single PowerISA instruction. |
|
|
So for the PowerPC implementation we will define |
|
|
<literal>__m64</literal> as: |
|
|
<programlisting><![CDATA[typedef __attribute__ ((__aligned__ (8))) unsigned long long __m64;]]></programlisting></para> |
|
|
|
|
|
<para>The SSE extensions include some copy / convert operations for |
|
|
<literal>_m128</literal> to / |
|
|
from <literal>_m64</literal> and this includes some int to / from float conversions. However in |
|
|
these cases the float operands always reside in SSE (XMM) registers (which |
|
|
match the PowerISA vector registers) and the MMX registers only contain integer |
|
|
values. POWER8 (PowerISA-2.07) has direct move instructions between GPRs and |
|
|
VSRs. So these transfers are normally a single instruction and any conversions |
|
|
can be handled in the vector unit.</para> |
|
|
|
|
|
<para>When transferring a <literal>__m64</literal> value to a vector register we should also |
|
|
execute a xxsplatd instruction to insure there is valid data in all four |
|
|
float element lanes before doing floating point operations. This avoids causing |
|
|
extraneous floating point exceptions that might be generated by uninitialized |
|
|
parts of the vector. The top two lanes will have the floating point results |
|
|
that are in position for direct transfer to a GPR or stored via Store Float |
|
|
Double (stfd). These operation are internal to the intrinsic implementation and |
|
|
there is no requirement to keep temporary vectors in correct Little Endian |
|
|
form.</para> |
|
|
|
|
|
<para>Also for the smaller element sizes and higher element counts (MMX |
|
|
<literal>_pi8</literal> and <literal>_p16</literal> types) |
|
|
the number of Rotate Under Mask instructions required to |
|
|
disassemble the 64-bit <literal>__m64</literal> |
|
|
into elements, perform the element calculations, |
|
|
and reassemble the elements in a single <literal>__m64</literal> |
|
|
value can get larger. In this |
|
|
case we can generate shorter instruction sequences by transfering (via direct |
|
|
move instruction) the GPR <literal>__m64</literal> value to the |
|
|
a vector register, performance the |
|
|
SIMD operation there, then transfer the <literal>__m64</literal> |
|
|
result back to a GPR.</para> |
|
|
|
|
|
</section> |
|
|
|
|
|
|