Programming-Guides/Porting_Vector_Intrinsics/sec_intel_intrinsic_include...

<?xml version="1.0" encoding="UTF-8"?>
<!--
  Copyright (c) 2017 OpenPOWER Foundation

  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License.

-->
<section xmlns="http://docbook.org/ns/docbook"
  xmlns:xi="http://www.w3.org/2001/XInclude"
  xmlns:xlink="http://www.w3.org/1999/xlink"
  version="5.0"
  xml:id="sec_intel_intrinsic_includes">
  <title>The structure of the intrinsic includes</title>

  <para>The GCC x86 intrinsic functions for vector were initially grouped by
  technology (MMX and SSE), which starts with MMX and continues with SSE through
  SSE4.1 stacked like a set of Russian dolls.</para>

  <para>Basically each higher layer include needs typedefs and helper macros
  defined by the lower level intrinsic includes. mm_malloc.h simply provides
  wrappers for posix_memalign and free. Then it gets a little weird, starting
  with the crypto extensions:

  <programlisting><![CDATA[wmmintrin.h  (AES)	includes emmintrin.h]]></programlisting></para>

  <para>For AVX, AVX2, and AVX512 they must have decided
  that the Russian Dolls thing was getting out of hand. AVX et al. is split
  across 14 files:

  <programlisting><![CDATA[#include <avxintrin.h>
#include <avx2intrin.h>
#include <avx512fintrin.h>
#include <avx512erintrin.h>
#include <avx512pfintrin.h>
#include <avx512cdintrin.h>
#include <avx512vlintrin.h>
#include <avx512bwintrin.h>
#include <avx512dqintrin.h>
#include <avx512vlbwintrin.h>
#include <avx512vldqintrin.h>
#include <avx512ifmaintrin.h>
#include <avx512ifmavlintrin.h>
#include <avx512vbmiintrin.h>
#include <avx512vbmivlintrin.h>]]></programlisting>

  but they do not want the applications to include these
  individually.</para>

  <para>So <emphasis role="bold">immintrin.h</emphasis> includes everything Intel vector, including all the
  AVX, AES, SSE, and MMX flavors.
  <programlisting><![CDATA[#ifndef _IMMINTRIN_H_INCLUDED
# error "Never use <avxintrin.h> directly; include <immintrin.h> instead."
#endif]]></programlisting></para>

  <para>So why is this interesting? The include structure provides some strong clues
  about the order that we should approach this effort.  For example if you need
  to use intrinsics from SSE4 (smmintrin.h) you are likely to need to type definitions
  from SSE (emmintrin.h). So a bottoms up (MMX, SSE, SSE2, …) approach seems
  like the best plan of attack. Also saving the AVX parts for later make sense,
  as most are just wider forms of operations that already exist in SSE.</para>

  <para>We should use the same include structure to implement our PowerISA
  equivalent API headers. This will make porting easier (drop-in replacement) and
  should get the application running quickly on POWER. Then we will be in a position
  to profile and analyze the resulting application. This will show any hot spots
  where the simple one-to-one transformation results in bottlenecks and
  additional tuning is needed. For these cases we should improve our tools (SDK
  MA/SCA) to identify opportunities for, and perhaps propose, alternative
  sequences that are better tuned to PowerISA and our micro-architecture.</para>

</section>