The structure of the intrinsic includes
The GCC x86 intrinsic functions for vector were initially grouped by
technology (MMX and SSE), which starts with MMX and continues with SSE through
SSE4.1 stacked like a set of Russian dolls.
Basically each higher layer include needs typedefs and helper macros
defined by the lower level intrinsic includes. mm_malloc.h simply provides
wrappers for posix_memalign and free. Then it gets a little weird, starting
with the crypto extensions:
For AVX, AVX2, and AVX512 they must have decided
that the Russian Dolls thing was getting out of hand. AVX et al. is split
across 14 files:
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include ]]>
but they do not want the applications to include these
individually.
So immintrin.h includes everything Intel vector, including all the
AVX, AES, SSE, and MMX flavors.
directly; include instead."
#endif]]>
So why is this interesting? The include structure provides some strong clues
about the order that we should approach this effort. For example if you need
to use intrinsics from SSE4 (smmintrin.h) you are likely to need to type definitions
from SSE (emmintrin.h). So a bottoms up (MMX, SSE, SSE2, …) approach seems
like the best plan of attack. Also saving the AVX parts for later make sense,
as most are just wider forms of operations that already exist in SSE.
We should use the same include structure to implement our PowerISA
equivalent API headers. This will make porting easier (drop-in replacement) and
should get the application running quickly on POWER. Then we will be in a position
to profile and analyze the resulting application. This will show any hot spots
where the simple one-to-one transformation results in bottlenecks and
additional tuning is needed. For these cases we should improve our tools (SDK
MA/SCA) to identify opportunities for, and perhaps propose, alternative
sequences that are better tuned to PowerISA and our micro-architecture.