The structure of the intrinsic includes The GCC x86 intrinsic functions for vector were initially grouped by technology (MMX and SSE), which starts with MMX and continues with SSE through SSE4.1 stacked like a set of Russian dolls. Basically each higher layer include needs typedefs and helper macros defined by the lower level intrinsic includes. mm_malloc.h simply provides wrappers for posix_memalign and free. Then it gets a little weird, starting with the crypto extensions: For AVX, AVX2, and AVX512 they must have decided that the Russian Dolls thing was getting out of hand. AVX et al. is split across 14 files: #include #include #include #include #include #include #include #include #include #include #include #include #include #include ]]> but they do not want the applications to include these individually. So immintrin.h includes everything Intel vector, including all the AVX, AES, SSE, and MMX flavors. directly; include instead." #endif]]> So why is this interesting? The include structure provides some strong clues about the order that we should approach this effort.  For example if you need to use intrinsics from SSE4 (smmintrin.h) you are likely to need to type definitions from SSE (emmintrin.h). So a bottoms up (MMX, SSE, SSE2, …) approach seems like the best plan of attack. Also saving the AVX parts for later make sense, as most are just wider forms of operations that already exist in SSE. We should use the same include structure to implement our PowerISA equivalent API headers. This will make porting easier (drop-in replacement) and should get the application running quickly on POWER. Then we will be in a position to profile and analyze the resulting application. This will show any hot spots where the simple one-to-one transformation results in bottlenecks and additional tuning is needed. For these cases we should improve our tools (SDK MA/SCA) to identify opportunities for, and perhaps propose, alternative sequences that are better tuned to PowerISA and our micro-architecture.