You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Programming-Guides/Porting_Vector_Intrinsics/sec_intel_intrinsic_include...

83 lines
3.6 KiB
XML

This file contains invisible Unicode characters!

This file contains invisible Unicode characters that may be processed differently from what appears below. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to reveal hidden characters.

<?xml version="1.0" encoding="UTF-8"?>
<!--
Copyright (c) 2017 OpenPOWER Foundation
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="sec_intel_intrinsic_includes">
<title>The structure of the intrinsic includes</title>
<para>The GCC x86 intrinsic functions for vector were initially grouped by
technology (MMX and SSE), which starts with MMX and continues with SSE through
SSE4.1 stacked like a set of Russian dolls.</para>
<para>Basically each higher layer include needs typedefs and helper macros
defined by the lower level intrinsic includes. mm_malloc.h simply provides
wrappers for posix_memalign and free. Then it gets a little weird, starting
with the crypto extensions:
<programlisting><![CDATA[wmmintrin.h (AES) includes emmintrin.h]]></programlisting></para>
<para>For AVX, AVX2, and AVX512 they must have decided
that the Russian Dolls thing was getting out of hand. AVX et al. is split
across 14 files:
<programlisting><![CDATA[#include <avxintrin.h>
#include <avx2intrin.h>
#include <avx512fintrin.h>
#include <avx512erintrin.h>
#include <avx512pfintrin.h>
#include <avx512cdintrin.h>
#include <avx512vlintrin.h>
#include <avx512bwintrin.h>
#include <avx512dqintrin.h>
#include <avx512vlbwintrin.h>
#include <avx512vldqintrin.h>
#include <avx512ifmaintrin.h>
#include <avx512ifmavlintrin.h>
#include <avx512vbmiintrin.h>
#include <avx512vbmivlintrin.h>]]></programlisting>
but they do not want the applications to include these
individually.</para>
<para>So <emphasis role="bold">immintrin.h</emphasis> includes everything Intel vector, including all the
AVX, AES, SSE, and MMX flavors.
<programlisting><![CDATA[#ifndef _IMMINTRIN_H_INCLUDED
# error "Never use <avxintrin.h> directly; include <immintrin.h> instead."
#endif]]></programlisting></para>
<para>So why is this interesting? The include structure provides some strong clues
about the order that we should approach this effort.  For example if you need
to use intrinsics from SSE4 (smmintrin.h) you are likely to need to type definitions
from SSE (emmintrin.h). So a bottoms up (MMX, SSE, SSE2, …) approach seems
like the best plan of attack. Also saving the AVX parts for later make sense,
as most are just wider forms of operations that already exist in SSE.</para>
<para>We should use the same include structure to implement our PowerISA
equivalent API headers. This will make porting easier (drop-in replacement) and
should get the application running quickly on POWER. Then we will be in a position
to profile and analyze the resulting application. This will show any hot spots
where the simple one-to-one transformation results in bottlenecks and
additional tuning is needed. For these cases we should improve our tools (SDK
MA/SCA) to identify opportunities for, and perhaps propose, alternative
sequences that are better tuned to PowerISA and our micro-architecture.</para>
</section>