You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Programming-Guides/Vector_Intrinsics/sec_extra_attributes.xml

138 lines
6.6 KiB
XML

<?xml version="1.0" encoding="UTF-8"?>
<!--
Copyright (c) 2017 OpenPOWER Foundation
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="sec_extra_attributes">
<title>Those extra attributes</title>
<para>You may have noticed there are some special attributes:
<literallayout>__gnu_inline__
This attribute should be used with a function that is also declared with the
inline keyword. It directs GCC to treat the function as if it were defined in
gnu90 mode even when compiling in C99 or gnu99 mode.
If the function is declared extern, then this definition of the function is used
only for inlining. In no case is the function compiled as a standalone function,
not even if you take its address explicitly. Such an address becomes an external
reference, as if you had only declared the function, and had not defined it. This
has almost the effect of a macro. The way to use this is to put a function
definition in a header file with this attribute, and put another copy of the
function, without extern, in a library file. The definition in the header file
causes most calls to the function to be inlined.
__always_inline__
Generally, functions are not inlined unless optimization is specified. For func-
tions declared inline, this attribute inlines the function independent of any
restrictions that otherwise apply to inlining. Failure to inline such a function
is diagnosed as an error.
__artificial__
This attribute is useful for small inline wrappers that if possible should appear
during debugging as a unit. Depending on the debug info format it either means
marking the function as artificial or using the caller location for all instructions
within the inlined body.
__extension__
... -pedantic and other options cause warnings for many GNU C extensions.
You can prevent such warnings within one expression by writing __extension__</literallayout></para>
<para>So far I have been using these attributes unchanged.</para>
<para>But most intrinsics map the Intel intrinsic to one or more target
specific GCC builtins. For example:
<programlisting><![CDATA[/* Load two DPFP values from P. The address must be 16-byte aligned. */
extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_load_pd (double const *__P)
{
return *(__m128d *)__P;
}
/* Load two DPFP values from P. The address need not be 16-byte aligned. */
extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_loadu_pd (double const *__P)
{
return __builtin_ia32_loadupd (__P);
}]]></programlisting></para>
<para>The first intrinsic (_mm_load_pd ) is implement as a C vector pointer
reference, but from the comment assumes the compiler will use a
<emphasis role="bold">movapd</emphasis>
instruction that requires 16-byte alignment (will raise a general-protection
exception if not aligned). This  implies that there is a performance advantage
for at least some Intel processors to keep the vector aligned. The second
intrinsic uses the explicit GCC builtin
<emphasis role="bold"><literal>__builtin_ia32_loadupd</literal></emphasis> to generate the
<emphasis role="bold"><literal>movupd</literal></emphasis> instruction which handles unaligned references.</para>
<para>The opposite assumption applies to POWER and PPC64LE, where GCC
generates the VSX <emphasis role="bold"><literal>lxvd2x</literal></emphasis> /
<emphasis role="bold"><literal>xxswapd</literal></emphasis>
instruction sequence by default, which
allows unaligned references. The PowerISA equivalent for aligned vector access
is the VMX <emphasis role="bold"><literal>lvx</literal></emphasis> instruction and the
<emphasis role="bold"><literal>vec_ld</literal></emphasis> builtin, which forces quadword
aligned access (by ignoring the low order 4 bits of the effective address). The
<emphasis role="bold"><literal>lvx</literal></emphasis> instruction does not raise
alignment exceptions, but perhaps should as part
of our implementation of the Intel intrinsic. This requires that we use
PowerISA VMX/VSX built-ins to insure we get the expected results.</para>
<para>The current prototype defines the following:
<programlisting><![CDATA[/* Load two DPFP values from P. The address must be 16-byte aligned. */
extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_load_pd (double const *__P)
{
assert(((unsigned long)__P & 0xfUL) == 0UL);
return ((__m128d)vec_ld(0, (__v16qu*)__P));
}
/* Load two DPFP values from P. The address need not be 16-byte aligned. */
extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_loadu_pd (double const *__P)
{
return (vec_vsx_ld(0, __P));
}]]></programlisting></para>
<para>The aligned  load intrinsic adds an assert which checks alignment
(to match the Intel semantic) and uses  the GCC builtin
<emphasis role="bold"><literal>vec_ld</literal></emphasis> (generates an
<emphasis role="bold"><literal>lvx</literal></emphasis>).  The assert
generates extra code but this can be eliminated by defining
<emphasis role="bold"><literal>NDEBUG</literal></emphasis> at compile time.
The unaligned load intrinsic uses the GCC builtin
<literal>vec_vsx_ld</literal>  (for PPC64LE generates
<emphasis role="bold"><literal>lxvd2x</literal></emphasis> /
<emphasis role="bold"><literal>xxswapd</literal></emphasis> for POWER8  and will
simplify to <emphasis role="bold"><literal>lxv</literal></emphasis>
or <emphasis role="bold"><literal>lxvx</literal></emphasis>
for POWER9).  And similarly for <emphasis role="bold"><literal>__mm_store_pd</literal></emphasis> /
<emphasis role="bold"><literal>__mm_storeu_pd</literal></emphasis>, using
<emphasis role="bold"><literal>vec_st</literal></emphasis>
and <emphasis role="bold"><literal>vec_vsx_st</literal></emphasis>. These concepts extent to the
load/store intrinsics for vector float and vector int.</para>
</section>