|
|
<?xml version="1.0" encoding="UTF-8"?> |
|
|
<!-- |
|
|
Copyright (c) 2017 OpenPOWER Foundation |
|
|
|
|
|
Licensed under the Apache License, Version 2.0 (the "License"); |
|
|
you may not use this file except in compliance with the License. |
|
|
You may obtain a copy of the License at |
|
|
|
|
|
http://www.apache.org/licenses/LICENSE-2.0 |
|
|
|
|
|
Unless required by applicable law or agreed to in writing, software |
|
|
distributed under the License is distributed on an "AS IS" BASIS, |
|
|
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
|
|
See the License for the specific language governing permissions and |
|
|
limitations under the License. |
|
|
|
|
|
--> |
|
|
<section xmlns="http://docbook.org/ns/docbook" |
|
|
xmlns:xi="http://www.w3.org/2001/XInclude" |
|
|
xmlns:xlink="http://www.w3.org/1999/xlink" |
|
|
version="5.0" |
|
|
xml:id="sec_intel_intrinsic_includes"> |
|
|
<title>The structure of the intrinsic includes</title> |
|
|
|
|
|
<para>The GCC x86 intrinsic functions for vector were initially grouped by |
|
|
technology (MMX and SSE), which starts with MMX and continues with SSE through |
|
|
SSE4.1 stacked like a set of Russian dolls.</para> |
|
|
|
|
|
<para>Basically each higher layer include needs typedefs and helper macros |
|
|
defined by the lower level intrinsic includes. mm_malloc.h simply provides |
|
|
wrappers for posix_memalign and free. Then it gets a little weird, starting |
|
|
with the crypto extensions: |
|
|
|
|
|
<programlisting><![CDATA[wmmintrin.h (AES) includes emmintrin.h]]></programlisting></para> |
|
|
|
|
|
<para>For AVX, AVX2, and AVX512 they must have decided |
|
|
that the Russian Dolls thing was getting out of hand. AVX et al. is split |
|
|
across 14 files: |
|
|
|
|
|
<programlisting><![CDATA[#include <avxintrin.h> |
|
|
#include <avx2intrin.h> |
|
|
#include <avx512fintrin.h> |
|
|
#include <avx512erintrin.h> |
|
|
#include <avx512pfintrin.h> |
|
|
#include <avx512cdintrin.h> |
|
|
#include <avx512vlintrin.h> |
|
|
#include <avx512bwintrin.h> |
|
|
#include <avx512dqintrin.h> |
|
|
#include <avx512vlbwintrin.h> |
|
|
#include <avx512vldqintrin.h> |
|
|
#include <avx512ifmaintrin.h> |
|
|
#include <avx512ifmavlintrin.h> |
|
|
#include <avx512vbmiintrin.h> |
|
|
#include <avx512vbmivlintrin.h>]]></programlisting> |
|
|
|
|
|
but they do not want the applications to include these |
|
|
individually.</para> |
|
|
|
|
|
<para>So <emphasis role="bold">immintrin.h</emphasis> includes everything Intel vector, including all the |
|
|
AVX, AES, SSE, and MMX flavors. |
|
|
<programlisting><![CDATA[#ifndef _IMMINTRIN_H_INCLUDED |
|
|
# error "Never use <avxintrin.h> directly; include <immintrin.h> instead." |
|
|
#endif]]></programlisting></para> |
|
|
|
|
|
<para>So why is this interesting? The include structure provides some strong clues |
|
|
about the order that we should approach this effort. For example if you need |
|
|
to use intrinsics from SSE4 (smmintrin.h) you are likely to need to type definitions |
|
|
from SSE (emmintrin.h). So a bottoms up (MMX, SSE, SSE2, …) approach seems |
|
|
like the best plan of attack. Also saving the AVX parts for later make sense, |
|
|
as most are just wider forms of operations that already exist in SSE.</para> |
|
|
|
|
|
<para>We should use the same include structure to implement our PowerISA |
|
|
equivalent API headers. This will make porting easier (drop-in replacement) and |
|
|
should get the application running quickly on POWER. Then we will be in a position |
|
|
to profile and analyze the resulting application. This will show any hot spots |
|
|
where the simple one-to-one transformation results in bottlenecks and |
|
|
additional tuning is needed. For these cases we should improve our tools (SDK |
|
|
MA/SCA) to identify opportunities for, and perhaps propose, alternative |
|
|
sequences that are better tuned to PowerISA and our micro-architecture.</para> |
|
|
|
|
|
</section> |
|
|
|
|
|
|