Programming-Guides/Porting_Vector_Intrinsics/sec_intel_intrinsic_functio...

<?xml version="1.0" encoding="UTF-8"?>
<!--
  Copyright (c) 2017 OpenPOWER Foundation

  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License.

-->
<section xmlns="http://docbook.org/ns/docbook"
  xmlns:xi="http://www.w3.org/2001/XInclude"
  xmlns:xlink="http://www.w3.org/1999/xlink"
  version="5.0"
  xml:id="sec_intel_intrinsic_functions">
  <title>Intel Intrinsic functions</title>

  <para>So what is an intrinsic function? From Wikipedia:

    <blockquote><para>In <link xlink:href="https://en.wikipedia.org/wiki/Compiler_theory">compiler theory</link>, an
    <emphasis role="bold">intrinsic function</emphasis> is a function available for use in a given
    <link xlink:href="https://en.wikipedia.org/wiki/Programming_language">programming
    language</link> whose implementation is handled specially by the compiler.
    Typically, it substitutes a sequence of automatically generated instructions
    for the original function call, similar to an
    <link xlink:href="https://en.wikipedia.org/wiki/Inline_function">inline function</link>.
    Unlike an inline function though, the compiler has an intimate knowledge of the
    intrinsic function and can therefore better integrate it and optimize it for
    the situation. This is also called builtin function in many languages.</para></blockquote></para>

  <para>The “Intel Intrinsics” API provides access to the many
  instruction set extensions (Intel Technologies) that Intel has added (and
  continues to add) over the years. The intrinsics provided access to new
  instruction capabilities before the compilers could exploit them directly.
  Initially these intrinsic functions where defined for the Intel and Microsoft
  compiler and where eventually implemented and contributed to GCC.</para>

  <para>The Intel Intrinsics have a specific type and naming structure. In
  this naming structure, functions starts with a common prefix (MMX and SSE use
  '_mm' prefix, while AVX added the '_mm256' '_mm512' prefixes), then a short
  functional name ('set', 'load', 'store', 'add', 'mul', 'blend', 'shuffle', '…') and a suffix
  ('_pd', '_sd', '_pi32'...) with type and packing information. See
  <xref linkend="app_intel_suffixes"/> for the list of common intrisic suffixes.</para>

  <para>Oddly many of the MMX/SSE operations are not vectors at all. There
  are a lot of scalar operations on a single float, double, or long long type. In
  effect these are scalars that can take advantage of the larger (xmm) register
  space. Also in the Intel 32-bit architecture they provided IEEE754 float and
  double types, and 64-bit integers that did not exist or were hard to implement
  in the base i386/387 instruction set. These scalar operations use a suffix
  starting with '_s' (<literal>_sd</literal> for scalar double float,
  <literal>_ss</literal> scalar float, and <literal>_si64</literal>
  for scalar long long).</para>

  <para>True vector operations use the packed or extended packed suffixes,
  starting with '_p' or '_ep' (<literal>_pd</literal> for vector double,
  <literal>_ps</literal> for vector float, and
  <literal>_epi32</literal> for vector int). The use of '_ep'
  seems to be reserved to disambiguate
  intrinsics that existed in the (64-bit vector) MMX extension from the extended
  (128-bit vector) SSE equivalent. For example
  <emphasis role="bold"><literal>_mm_add_pi32</literal></emphasis> is a MMX operation on
  a pair of 32-bit integers, while
  <emphasis role="bold"><literal>_mm_add_epi32</literal></emphasis> is an SSE2 operation on vector
  of 4 32-bit integers.</para>

  <para>The GCC  builtins for the
  <link xlink:href="https://gcc.gnu.org/onlinedocs/gcc-6.3.0/gcc/x86-Built-in-Functions.html#x86-Built-in-Functions">i386.target</link>
  (includes x86 and x86_64) are not
  the same as the Intel Intrinsics. While they have similar intent and cover most
  of the same functions, they use a different naming (prefixed with
  <literal>__builtin_ia32_</literal>, then function name with type suffix) and uses GCC vector type
  modes for operand types. For example:
  <programlisting><![CDATA[v8qi __builtin_ia32_paddb (v8qi, v8qi)
v4hi __builtin_ia32_paddw (v4hi, v4hi)
v2si __builtin_ia32_paddd (v2si, v2si)
v2di __builtin_ia32_paddq (v2di, v2di)]]></programlisting></para>

  <note><para>A key difference between GCC built-ins for i386 and PowerPC is
  that the x86 built-ins have different names of each operation and type while the
  PowerPC altivec built-ins tend to have a single overloaded
  built-in for each operation,
  across a set of compatible operand types. </para></note>

  <para>In GCC the Intel Intrinsic header (*intrin.h) files are implemented
  as a set of inline functions using the Intel Intrinsic API names and types.
  These functions are implemented as either GCC C vector extension code or via
  one or more GCC builtins for the i386 target. So lets take a look at some
  examples from GCC's SSE2 intrinsic header emmintrin.h:
  <programlisting><![CDATA[extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_add_pd (__m128d __A, __m128d __B)
{
  return (__m128d) ((__v2df)__A + (__v2df)__B);
}

extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_add_sd (__m128d __A, __m128d __B)
{
  return (__m128d)__builtin_ia32_addsd ((__v2df)__A, (__v2df)__B);
}]]></programlisting></para>

  <para>Note that the
  <emphasis role="bold"><literal>_mm_add_pd</literal></emphasis> is implemented direct as GCC C vector
  extension code., while
  <emphasis role="bold"><literal>_mm_add_sd</literal></emphasis> is implemented via the GCC builtin
  <emphasis role="bold"><literal>__builtin_ia32_addsd</literal></emphasis>. From the
  discussion above we know the <literal>_pd</literal> suffix
  indicates a packed vector double while the <literal>_sd</literal> suffix indicates a scalar double
  in a XMM register. </para>

  <xi:include href="sec_packed_vs_scalar_intrinsics.xml"/>
  <xi:include href="sec_vec_or_not.xml"/>
  <xi:include href="sec_crossing_lanes.xml"/>

</section>