Browse Source

Inital port from SJM R4 document

pull/69/head
Jeff Scheel 5 years ago
parent
commit
dfa363ccaa
  1. BIN
      Intrinsic-porting-guide-draft-R4.odt
  2. BIN
      Intrinsic-porting-guide-draft-R4_hack.odt
  3. 55
      Intrinsic-porting-guide-draft-R4_hack.xml
  4. 176
      LICENSE
  5. 93
      README.md
  6. 318
      Vector_Intrinsics/app_intel_suffixes.xml
  7. 70
      Vector_Intrinsics/app_references.xml
  8. 103
      Vector_Intrinsics/bk_main.xml
  9. 115
      Vector_Intrinsics/ch_howto_start.xml
  10. 46
      Vector_Intrinsics/ch_intel_intrinsic_porting.xml
  11. 148
      Vector_Intrinsics/pom.xml
  12. 35
      Vector_Intrinsics/sec_api_implemented.xml
  13. 111
      Vector_Intrinsics/sec_crossing_lanes.xml
  14. 40
      Vector_Intrinsics/sec_differences.xml
  15. 137
      Vector_Intrinsics/sec_extra_attributes.xml
  16. 73
      Vector_Intrinsics/sec_floatingpoint_exceptions.xml
  17. 33
      Vector_Intrinsics/sec_floatingpoint_rounding.xml
  18. 113
      Vector_Intrinsics/sec_gcc_vector_extensions.xml
  19. 91
      Vector_Intrinsics/sec_handling_avx.xml
  20. 72
      Vector_Intrinsics/sec_handling_mmx.xml
  21. 60
      Vector_Intrinsics/sec_how_findout.xml
  22. 122
      Vector_Intrinsics/sec_intel_intrinsic_functions.xml
  23. 82
      Vector_Intrinsics/sec_intel_intrinsic_includes.xml
  24. 89
      Vector_Intrinsics/sec_intel_intrinsic_types.xml
  25. 76
      Vector_Intrinsics/sec_more_examples.xml
  26. 68
      Vector_Intrinsics/sec_other_intrinsic_examples.xml
  27. 302
      Vector_Intrinsics/sec_packed_vs_scalar_intrinsics.xml
  28. 49
      Vector_Intrinsics/sec_performance.xml
  29. 41
      Vector_Intrinsics/sec_performance_mmx.xml
  30. 44
      Vector_Intrinsics/sec_performance_sse.xml
  31. 147
      Vector_Intrinsics/sec_power_vector_permute_format.xml
  32. 67
      Vector_Intrinsics/sec_power_vmx.xml
  33. 186
      Vector_Intrinsics/sec_power_vsx.xml
  34. 33
      Vector_Intrinsics/sec_powerisa.xml
  35. 46
      Vector_Intrinsics/sec_powerisa_vector_facilities.xml
  36. 79
      Vector_Intrinsics/sec_powerisa_vector_intrinsics.xml
  37. 119
      Vector_Intrinsics/sec_powerisa_vector_size_type.xml
  38. 57
      Vector_Intrinsics/sec_prefered_methods.xml
  39. 66
      Vector_Intrinsics/sec_prepare.xml
  40. 64
      Vector_Intrinsics/sec_review_source.xml
  41. 62
      Vector_Intrinsics/sec_simple_examples.xml
  42. 134
      Vector_Intrinsics/sec_vec_or_not.xml
  43. 1518
      intrinsic.xml
  44. 22
      pom.xml

BIN
Intrinsic-porting-guide-draft-R4.odt

Binary file not shown.

BIN
Intrinsic-porting-guide-draft-R4_hack.odt

Binary file not shown.

55
Intrinsic-porting-guide-draft-R4_hack.xml

File diff suppressed because one or more lines are too long

176
LICENSE

@ -0,0 +1,176 @@ @@ -0,0 +1,176 @@

Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/

TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION

1. Definitions.

"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.

"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.

"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.

"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.

"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.

"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.

"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).

"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.

"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."

"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.

2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.

3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.

4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:

(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and

(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and

(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and

(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.

You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.

5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.

6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.

7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.

8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.

9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.

93
README.md

@ -0,0 +1,93 @@ @@ -0,0 +1,93 @@
# Porting Guide for Linux on Power
TBD...

To build this project, one must ensure that the Docs-Master project has
also been cloned at the same directory level as the Docs-Template project.
This can be accomplished with the following steps:

1. Clone the master documentation project (Docs-Master) using the following command:

```
$ git clone https://github.com/OpenPOWERFoundation/Docs-Master.git
```
2. Clone this project (Docs-Template) using the following command:

```
$ git clone https://ibm.github.com/scheel/SJM-Porting-Guide.git
```
3. Build the project with these commands:
```
$ cd SJM-Porting-Guide
$ mvn clean generate-sources
```

The online version of the document can be found in the OpenPOWER Foundation
Document library at [TBD](http://openpowerfoundation.org/?resource_lib=TBD).

The project which controls the look and feel of the document is the
[Docs-Maven-Plugin project](https://github.com/OpenPOWERFoundation/Docs-Maven-Plugin), an
OpenPOWER Foundation private project on GitHub. To obtain access to the Maven Plugin project,
contact Jeff Scheel \([scheel@us.ibm.com](mailto://scheel@us.ibm.com)\) or
Jeff Brown \([jeffdb@us.ibm.com](mailto://jeffdb@us.ibm.com)\).

## License
This project is licensed under the Apache V2 license. More information
can be found in the LICENSE file or online at

http://www.apache.org/licenses/LICENSE-2.0

## Community
TBD...

## Contributions
TBD...

Contributions to this project should conform to the `Developer Certificate
of Origin` as defined at http://elinux.org/Developer_Certificate_Of_Origin.
Commits to this project need to contain the following line to indicate
the submitter accepts the DCO:
```
Signed-off-by: Your Name <your_email@domain.com>
```
By contributing in this way, you agree to the terms as follows:
```
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
660 York Street, Suite 102,
San Francisco, CA 94110 USA

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
have the right to submit it under the open source license
indicated in the file; or

(b) The contribution is based upon previous work that, to the best
of my knowledge, is covered under an appropriate open source
license and I have the right under that license to submit that
work with modifications, whether created in whole or in part
by me, under the same open source license (unless I am
permitted to submit under a different license), as indicated
in the file; or

(c) The contribution was provided directly to me by some other
person who certified (a), (b) or (c) and I have not modified
it.

(d) I understand and agree that this project and the contribution
are public and that a record of the contribution (including all
personal information I submit with it, including my sign-off) is
maintained indefinitely and may be redistributed consistent with
this project or the open source license(s) involved.
```

318
Vector_Intrinsics/app_intel_suffixes.xml

@ -0,0 +1,318 @@ @@ -0,0 +1,318 @@
<?xml version="1.0" encoding="UTF-8"?>
<!--
Copyright (c) 2017 OpenPOWER Foundation
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<appendix xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="app_intel_suffixes">
<?dbhtml stop-chunking?>
<title>Intel Intrinsic suffixes</title>
<section>
<title>MMX</title>
<variablelist>
<varlistentry>
<term><emphasis role="bold"><literal>_pi16</literal></emphasis></term>
<listitem><para>4 x packed short int</para></listitem>
</varlistentry>

<varlistentry>
<term><emphasis role="bold"><literal>_pi32</literal></emphasis></term>
<listitem><para>2 x packed int</para></listitem>
</varlistentry>

<varlistentry>
<term><emphasis role="bold"><literal>_pi8</literal></emphasis></term>
<listitem><para>8 x packed signed char</para></listitem>
</varlistentry>

<varlistentry>
<term><emphasis role="bold"><literal>_pu16</literal></emphasis></term>
<listitem><para>4 x packed unsigned short int</para></listitem>
</varlistentry>

<varlistentry>
<term><emphasis role="bold"><literal>_pu8</literal></emphasis></term>
<listitem><para>8 x packed unsigned char</para></listitem>
</varlistentry>

<varlistentry>
<term><emphasis role="bold"><literal>_si64</literal></emphasis></term>
<listitem><para>single 64-bit binary (logical)</para></listitem>
</varlistentry>
</variablelist>
</section>
<section>
<title>SSE</title>
<variablelist>
<varlistentry>
<term><emphasis role="bold"><literal>_ps</literal></emphasis></term>
<listitem><para>4 x packed float</para></listitem>
</varlistentry>

<varlistentry>
<term><emphasis role="bold"><literal>_ss</literal></emphasis></term>
<listitem><para>single scalar float</para></listitem>
</varlistentry>

<varlistentry>
<term><emphasis role="bold"><literal>_si32</literal></emphasis></term>
<listitem><para>single 32-bit int</para></listitem>
</varlistentry>

<varlistentry>
<term><emphasis role="bold"><literal>_si64</literal></emphasis></term>
<listitem><para>single 64-bit long int</para></listitem>
</varlistentry>
</variablelist>
</section>
<section>
<title>SSE2</title>
<variablelist>
<varlistentry>
<term><emphasis role="bold"><literal>_epi16</literal></emphasis></term>
<listitem><para>8 x packed short int</para></listitem>
</varlistentry>

<varlistentry>
<term><emphasis role="bold"><literal>_epi32</literal></emphasis></term>
<listitem><para>4 x packed int</para></listitem>
</varlistentry>

<varlistentry>
<term><emphasis role="bold"><literal>_epi64</literal></emphasis></term>
<listitem><para>2 x packed long int</para></listitem>
</varlistentry>

<varlistentry>
<term><emphasis role="bold"><literal>_epi8</literal></emphasis></term>
<listitem><para>16 x packed signed char</para></listitem>
</varlistentry>

<varlistentry>
<term><emphasis role="bold"><literal>_epu16</literal></emphasis></term>
<listitem><para>8 x packed unsigned short int</para></listitem>
</varlistentry>

<varlistentry>
<term><emphasis role="bold"><literal>_epu32</literal></emphasis></term>
<listitem><para>4 x packed unsigned int</para></listitem>
</varlistentry>

<varlistentry>
<term><emphasis role="bold"><literal>_epu8</literal></emphasis></term>
<listitem><para>16 x packed unsigned char</para></listitem>
</varlistentry>
</variablelist>

<!-- Is this break really desired? -->
<para/>
<variablelist>
<varlistentry>
<term><emphasis role="bold"><literal>_pd</literal></emphasis></term>
<listitem><para>2 x packed double</para></listitem>
</varlistentry>

<varlistentry>
<term><emphasis role="bold"><literal>_sd</literal></emphasis></term>
<listitem><para>single scalar double</para></listitem>
</varlistentry>
<varlistentry>
<term><emphasis role="bold"><literal>_pi64</literal></emphasis></term>
<listitem><para>single long int</para></listitem>
</varlistentry>

<varlistentry>
<term><emphasis role="bold"><literal>_si128</literal></emphasis></term>
<listitem><para>single 128-bit binary (logical)</para></listitem>
</varlistentry>
</variablelist>
</section>
<section>
<title>AVX/AVX2 __m256_*</title>
<variablelist>
<varlistentry>
<term><emphasis role="bold"><literal>_ps</literal></emphasis></term>
<listitem><para>8 x packed float</para></listitem>
</varlistentry>

<varlistentry>
<term><emphasis role="bold"><literal>_pd</literal></emphasis></term>
<listitem><para>4 x packed double</para></listitem>
</varlistentry>

<varlistentry>
<term><emphasis role="bold"><literal>_epi16</literal></emphasis></term>
<listitem><para>16 x packed short int</para></listitem>
</varlistentry>

<varlistentry>
<term><emphasis role="bold"><literal>_epi32</literal></emphasis></term>
<listitem><para>8 x packed int</para></listitem>
</varlistentry>

<varlistentry>
<term><emphasis role="bold"><literal>_epi64</literal></emphasis></term>
<listitem><para>4 x packed long int</para></listitem>
</varlistentry>

<varlistentry>
<term><emphasis role="bold"><literal>_epi8</literal></emphasis></term>
<listitem><para>32 x packed signed char</para></listitem>
</varlistentry>

<varlistentry>
<term><emphasis role="bold"><literal>_epu16</literal></emphasis></term>
<listitem><para>16 x packed unsigned short int</para></listitem>
</varlistentry>

<varlistentry>
<term><emphasis role="bold"><literal>_epu32</literal></emphasis></term>
<listitem><para>8 x packed unsigned int</para></listitem>
</varlistentry>

<varlistentry>
<term><emphasis role="bold"><literal>_epu8</literal></emphasis></term>
<listitem><para>32 x packed unsigned char</para></listitem>
</varlistentry>

<varlistentry>
<term><emphasis role="bold"><literal>_ss</literal></emphasis></term>
<listitem><para>single scalar float (broadcast/splat)</para></listitem>
</varlistentry>

<varlistentry>
<term><emphasis role="bold"><literal>_sd</literal></emphasis></term>
<listitem><para>single scalar double</para></listitem>
</varlistentry>

<varlistentry>
<term><emphasis role="bold"><literal>_si256</literal></emphasis></term>
<listitem><para>single 256-bit binary (logical)</para></listitem>
</varlistentry>

<varlistentry>
<term><emphasis role="bold"><literal>_pd256</literal></emphasis></term>
<listitem><para>cast / zero extend</para></listitem>
</varlistentry>

<varlistentry>
<term><emphasis role="bold"><literal>_ps256</literal></emphasis></term>
<listitem><para>cast / zero extend</para></listitem>
</varlistentry>

<varlistentry>
<term><emphasis role="bold"><literal>_pd128</literal></emphasis></term>
<listitem><para>cast</para></listitem>
</varlistentry>

<varlistentry>
<term><emphasis role="bold"><literal>_ps128</literal></emphasis></term>
<listitem><para>cast</para></listitem>
</varlistentry>
</variablelist>
</section>
<section>
<title>AVX512 __m512_*</title>
<variablelist>
<varlistentry>
<term><emphasis role="bold"><literal>_ps</literal></emphasis></term>
<listitem><para>16 x packed float</para></listitem>
</varlistentry>

<varlistentry>
<term><emphasis role="bold"><literal>_pd</literal></emphasis></term>
<listitem><para>8 x packed double</para></listitem>
</varlistentry>

<varlistentry>
<term><emphasis role="bold"><literal>_epi16</literal></emphasis></term>
<listitem><para>32 x packed short int</para></listitem>
</varlistentry>

<varlistentry>
<term><emphasis role="bold"><literal>_epi32</literal></emphasis></term>
<listitem><para>16 x packed int</para></listitem>
</varlistentry>

<varlistentry>
<term><emphasis role="bold"><literal>_epi64</literal></emphasis></term>
<listitem><para>8 x packed long int</para></listitem>
</varlistentry>

<varlistentry>
<term><emphasis role="bold"><literal>_epi8</literal></emphasis></term>
<listitem><para>64 x packed signed char</para></listitem>
</varlistentry>

<varlistentry>
<term><emphasis role="bold"><literal>_epu16</literal></emphasis></term>
<listitem><para>32 x packed unsigned short int</para></listitem>
</varlistentry>

<varlistentry>
<term><emphasis role="bold"><literal>_epu32</literal></emphasis></term>
<listitem><para>16 x packed unsigned int</para></listitem>
</varlistentry>

<varlistentry>
<term><emphasis role="bold"><literal>_epu64</literal></emphasis></term>
<listitem><para>8 x packed unsigned long int</para></listitem>
</varlistentry>

<varlistentry>
<term><emphasis role="bold"><literal>_epu8</literal></emphasis></term>
<listitem><para>64 x packed unsigned char</para></listitem>
</varlistentry>

<varlistentry>
<term><emphasis role="bold"><literal>_ss</literal></emphasis></term>
<listitem><para>single scalar float</para></listitem>
</varlistentry>

<varlistentry>
<term><emphasis role="bold"><literal>_sd</literal></emphasis></term>
<listitem><para>single scalar double</para></listitem>
</varlistentry>

<varlistentry>
<term><emphasis role="bold"><literal>_si512</literal></emphasis></term>
<listitem><para>single 512-bit binary (logical)</para></listitem>
</varlistentry>

<varlistentry>
<term><emphasis role="bold"><literal>_pd512</literal></emphasis></term>
<listitem><para>cast / zero extend</para></listitem>
</varlistentry>

<varlistentry>
<term><emphasis role="bold"><literal>_ps512</literal></emphasis></term>
<listitem><para>cast / zero extend</para></listitem>
</varlistentry>
</variablelist>
</section>

</appendix>

70
Vector_Intrinsics/app_references.xml

@ -0,0 +1,70 @@ @@ -0,0 +1,70 @@
<?xml version="1.0" encoding="UTF-8"?>
<!--
Copyright (c) 2017 OpenPOWER Foundation
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<appendix xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="app_references">
<?dbhtml stop-chunking?>
<title>Document references</title>
<section>
<title>OpenPOWER and Power documents</title>
<para>
<link xlink:href="https://openpowerfoundation.org/technical/technical-resources/technical-specifications/">OpenPOWER™ Technical Specifications</link>
</para>
<para>
<link xlink:href="https://openpowerfoundation.org/?resource_lib=ibm-power-isa-version-2-07-b">Power ISA™ Version 2.07 B</link>
</para>
<para>
<link xlink:href="https://www.docdroid.net/tWT7hjD/powerisa-v30.pdf.html">Power ISA™ Version 3.0</link>
</para>
<para>
<link xlink:href="https://openpowerfoundation.org/technical/technical-resources/technical-specifications/">Power Architecture 64-bit ELF ABI Specification (AKA OpenPower ABI for Linux Supplement)</link>
</para>
<para>
<link xlink:href="http://www.nxp.com/assets/documents/data/en/reference-manuals/ALTIVECPEM.pdf">AltiVec™ Technology Programming Environments Manual</link>
</para>

</section>
<section>
<title>A.2 Intel documents</title>
<para>
<link xlink:href="https://software.intel.com/en-us/articles/intel-sdm">Intel® 64 and IA-32 Architectures Software Developer’s Manual</link>
</para>
<para>
<link xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/">Intel™ Intrinsics Guide</link>
</para>
<para/>
</section>
<section>
<title>A.3 GNU Compiler Collection (GCC) documents</title>
<para>
<link xlink:href="https://gcc.gnu.org/onlinedocs/">GCC online documentation</link>
</para>
<para>
<link xlink:href="https://gcc.gnu.org/onlinedocs/gcc-6.3.0/gcc/">GCC Manual (GCC 6.3)</link>
</para>
<para>
<link xlink:href="https://gcc.gnu.org/onlinedocs/gccint/">GCC Internals Manual</link>
</para>
<para/>
</section>

</appendix>

103
Vector_Intrinsics/bk_main.xml

@ -0,0 +1,103 @@ @@ -0,0 +1,103 @@
<?xml version="1.0" encoding="UTF-8"?>
<!--
Copyright (c) 2017 OpenPOWER Foundation
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<book xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="bk_main">

<title>Linux on Power Porting Guide</title>
<subtitle>Vector Intrinsic</subtitle>

<info>
<author>
<personname>
<surname>System Software Work Group</surname>
</personname>
<email>syssw-chair@openpowerfoundation.org</email>
<affiliation>
<orgname>OpenPOWER Foundation</orgname>
</affiliation>
</author>
<copyright>
<year>2017</year>
<holder>OpenPOWER Foundation</holder>
</copyright>
<!-- TODO: Set the correct document releaseinfo -->
<releaseinfo>Revision 0.1</releaseinfo>
<productname>OpenPOWER</productname>
<pubdate/>

<legalnotice role="apache2">

<annotation>
<remark>Copyright details are filled in by the template.</remark>
</annotation>
</legalnotice>
<!-- TODO: Update the following text with the correct document description (first paragraph),
Work Group name, and Work Product track (both in second paragraph). -->
<abstract>
<para>The goal of this project is to provide functional equivalents of the
Intel MMX, SSE, and AVX intrinsic functions, that are commonly used in Linux
applications, and make them (or equivalents) available for the PowerPC64LE
platform.</para>
<para>This document is a Standard Track, Work Group Note work product owned by the
System Software Workgroup and handled in compliance with the requirements outlined in the
<citetitle>OpenPOWER Foundation Work Group (WG) Process</citetitle> document. It was
created using the <citetitle>Master Template Guide</citetitle> version 0.9.5. Comments,
questions, etc. can be submitted to the public mailing list for this document at
<link xlink:href="http://tbd.openpowerfoundation.org">TBD</link>.</para>
</abstract>

<revhistory>
<!-- TODO: Update as new revisions created -->
<revision>
<date>2017-07-26</date>
<revdescription>
<itemizedlist spacing="compact">
<listitem>
<para>Revision 0.1 - initial draft from Steve Munroe</para>
</listitem>
</itemizedlist>
</revdescription>
</revision>
</revhistory>
</info>

<!-- The ch_preface.xml file is required by all documents -->
<xi:include href="../../Docs-Master/common/ch_preface.xml"/>

<!-- Chapter heading files -->
<xi:include href="ch_intel_intrinsic_porting.xml"/>
<xi:include href="ch_howto_start.xml"/>
<!-- Placeholder files ATM -->
<!--chapter><title>Placeholders</title>
</chapter-->

<!-- Document specific appendices -->
<xi:include href="app_references.xml"/>
<xi:include href="app_intel_suffixes.xml"/>

<!-- The app_foundation.xml appendix file is required by all documents. -->
<xi:include href="../../Docs-Master/common/app_foundation.xml"/>

</book>

115
Vector_Intrinsics/ch_howto_start.xml

@ -0,0 +1,115 @@ @@ -0,0 +1,115 @@
<?xml version="1.0" encoding="UTF-8"?>
<!--
Copyright (c) 2017 OpenPOWER Foundation
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<chapter xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="ch_howto_start">
<title>How do we work this?</title>
<para>The working assumption is to start with the existing GCC headers from
./gcc/config/i386/, then convert them to PowerISA and add them to
./gcc/config/rs6000/. I assume we will replicate the existing header structure
and retain the existing header file and intrinsic names. This also allows us to
reuse existing DejaGNU test cases from ./gcc/testsuite/gcc.target/i386, modify
them as needed for the POWER target, and them to the
./gcc/testsuite/gcc.target/powerpc.</para>

<para>We can be flexible on the sequence that headers/intrinsics and test
cases are ported.  This should be based on customer need and resolving
internal dependencies.  This implies an oldest-to-newest / bottoms-up (MMX,
SSE, SSE2, …) strategy. The assumption is, existing community and user
application codes, are more likely to have optimized code for previous
generation ubiquitous (SSE, SSE2, ...) processors than the latest (and rare)
SkyLake AVX512.</para>

<para>I would start with an existing header from the current GCC
 ./gcc/config/i386/ and copy the header comment (including FSF copyright) down
to any vector typedefs used in the API or implementation. Skip the Intel
intrinsic implementation code for now, but add the ending #end if matching the
headers conditional guard against multiple inclusion. You can add  #include
&lt;alternative&gt; as needed. For examples:
<programlisting><![CDATA[/* Copyright (C) 2003-2017 Free Software Foundation, Inc
...
/* This header provides a best effort implementation of the Intel X86
* SSE2 intrinsics for the PowerPC target. This implementation is a
* combination of compiled C vector codes or equivalent sequences of
* GCC vector builtins from the GCC PowerPC Altivec target.
*
* However some details of this implementation will differ from
* the X86 due to differences in the underlying hardware or GCC
* implementation. For example the PowerPC target only uses unordered
* floating point compares. */

#ifndef EMMINTRIN_H_
#define EMMINTRIN_H_

#include <altivec.h>
#include <assert.h>

/* We need definitions from the SSE header files. */
#include <xmmintrin.h>

/* The Intel API is flexible enough that we must allow aliasing with other
vector types, and their scalar components. */
typedef float __m128 __attribute__ ((__vector_size__ (16), __may_alias__));

/* Internal data types for implementing the intrinsics. */
typedef float __v4sf __attribute__ ((__vector_size__ (16)));
/* more typedefs. */

/* The intrinsic implmentations go here. */

#endif /* EMMINTRIN_H_ */]]></programlisting></para>

<para>Then you can start adding small groups of related intrinsic
implementations to the header to be compiled and  examine the generated code.
Once you have what looks like reasonable code you can grep through
 ./gcc/testsuite/gcc.target/i386 for examples using the intrinsic names you
just added. You should be able to find functional tests for most X86
intrinsics. </para>

<para>The
<link xlink:href="https://gcc.gnu.org/onlinedocs/gccint/Testsuites.html#Testsuites">GCC
testsuite</link> uses the DejaGNU  test framework as documented in the
<link xlink:href="https://gcc.gnu.org/onlinedocs/gccint/">GNU Compiler Collection (GCC)
Internals</link> manual. GCC adds its own DejaGNU directives and extensions,
that are embedded in the testsuite source as comments.  Some are platform
specific and will need to be adjusted for tests that are ported to our
platform. For example
<programlisting><![CDATA[/* { dg-do run } */
/* { dg-options "-O2 -msse2" } */
/* { dg-require-effective-target sse2 } */]]></programlisting></para>

<para>should become something like
<programlisting><![CDATA[/* { dg-do run } */
/* { dg-options "-O3 -mpower8-vector" } */
/* { dg-require-effective-target lp64 } */
/* { dg-require-effective-target p8vector_hw { target powerpc*-*-* } } */]]></programlisting></para>

<para>Repeat this process until you have equivalent implementations for all
the intrinsics in that header and associated test cases that execute without
error.</para>

<xi:include href="sec_prefered_methods.xml"/>
<xi:include href="sec_prepare.xml"/>
<xi:include href="sec_differences.xml"/>


</chapter>

46
Vector_Intrinsics/ch_intel_intrinsic_porting.xml

@ -0,0 +1,46 @@ @@ -0,0 +1,46 @@
<?xml version="1.0" encoding="UTF-8"?>
<!--
Copyright (c) 2017 OpenPOWER Foundation
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<chapter xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="ch_intel_intrinsic_porting">
<title>Intel Intrinsic porting guide for Power64LE</title>
<para>The goal of this project is to provide functional equivalents of the
Intel MMX, SSE, and AVX intrinsic functions, that are commonly used in Linux
applications, and make them (or equivalents) available for the PowerPC64LE
platform. These X86 intrinsics started with the Intel and Microsoft compilers
but were then ported to the GCC compiler. The GCC implementation is a set of
headers with inline functions. These inline functions provide a implementation
mapping from the Intel/Microsoft dialect intrinsic names to the corresponding
GCC Intel built-in's or directly via C language vector extension syntax.</para>

<para>The current proposal is to start with the existing X86 GCC intrinsic
headers and port them (copy and change the source)  to POWER using C language
vector extensions, VMX and VSX built-ins. Another key assumption is that we
will be able to use many of existing Intel DejaGNU test cases on
./gcc/testsuite/gcc.target/i386. This document is intended as a guide to
developers participating in this effort. However this document provides
guidance and examples that should be useful to developers who may encounter X86
intrinsics in code that they are porting to another platform.</para>

<xi:include href="sec_review_source.xml"/>

</chapter>

148
Vector_Intrinsics/pom.xml

@ -0,0 +1,148 @@ @@ -0,0 +1,148 @@
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<parent>

<groupId>org.openpowerfoundation.docs</groupId>
<artifactId>workgroup-pom</artifactId>
<version>1.0.0-SNAPSHOT</version>
<relativePath>../pom.xml</relativePath>
</parent>
<modelVersion>4.0.0</modelVersion>

<!-- TODO: Rename the artifactID field to some appropriate for your new document -->
<artifactId>Porting-Guide-Vector-Intrinsics</artifactId>

<packaging>jar</packaging>
<!-- TODO: Rename the name field to some appropriate for your new document -->
<name>Porting-Guide-Vector-Intrinsics</name>

<properties>
<!-- This is set by Jenkins according to the branch. -->
<release.path.name></release.path.name>
<comments.enabled>0</comments.enabled>
</properties>
<!-- ################################################ -->
<!-- USE "mvn clean generate-sources" to run this POM -->
<!-- ################################################ -->
<build>
<plugins>
<plugin>

<groupId>org.openpowerfoundation.docs</groupId>

<artifactId>openpowerdocs-maven-plugin</artifactId>
<!-- version set in ../pom.xml -->
<executions>
<execution>
<id>generate-webhelp</id>
<goals>
<goal>generate-webhelp</goal>
</goals>
<phase>generate-sources</phase>
<configuration>
<!-- These parameters only apply to webhelp -->
<enableDisqus>${comments.enabled}</enableDisqus>
<disqusShortname>LoPAR-Virtualization</disqusShortname>
<enableGoogleAnalytics>1</enableGoogleAnalytics>
<googleAnalyticsId>UA-17511903-1</googleAnalyticsId>
<generateToc>
appendix toc,title
article/appendix nop
article toc,title
book toc,title,figure,table,example,equation
book/appendix nop
book/chapter nop
chapter toc,title
chapter/section nop
section toc
part toc,title
qandadiv toc
qandaset toc
reference toc,title
set toc,title
</generateToc>
<!-- The following elements sets the autonumbering of sections in output for chapter numbers but no numbered sections-->
<sectionAutolabel>1</sectionAutolabel>
<tocSectionDepth>3</tocSectionDepth>
<sectionLabelIncludesComponentLabel>1</sectionLabelIncludesComponentLabel>

<!-- TODO: Rename the webhelpDirname field to the new directory for new document -->
<webhelpDirname>Vector-Intrinsics</webhelpDirname>

<!-- TODO: Rename the pdfFilenameBase field to the PDF name for new document -->
<pdfFilenameBase>Vector-Intrinsics</pdfFilenameBase>

<!-- TODO: Define the appropriate work product type. These values are defined by the IPR Policy.
Consult with the Work Group Chair or a Technical Steering Committee member if you have
questions about which value to select.
If no value is provided below, the document will default to "Work Group Notes".-->
<workProduct>workgroupNotes</workProduct>
<!--workProduct>workgroupSpecification</workProduct-->
<!-- workProduct>candidateStandard</workProduct -->
<!-- workProduct>openpowerStandard</workProduct -->

<!-- TODO: Set the appropriate security policy for the document. For documents
which are not "public" this will affect the document title page and
create a vertical running ribbon on the internal margin of the
security status in all CAPS. Values and definitions are formally
defined by the IPR policy. A layman's definition follows:

public = this document may be shared outside the
foundation and thus this setting must be
used only when completely sure it allowed
foundationConfidential = this document may be shared freely with
OpenPOWER Foundation members but may not be
shared publicly
workgroupConfidential = this document may only be shared within the
work group and should not be shared with
other Foundation members or the public

The appropriate starting security for a new document is "workgroupConfidential". -->
<!--security>workgroupConfidential</security -->
<!-- security>foundationConfidential</security -->
<security>public</security>

<!-- TODO: Set the appropriate work flow status for the document. For documents
which are not "published" this will affect the document title page
and create a vertical running ribbon on the internal margin of the
security status in all CAPS. Values and definitions are formally
defined by the IPR policy. A layman's definition follows:

published = this document has completed all reviews and has
been published
draft = this document is actively being updated and has
not yet been reviewed
review = this document is presently being reviewed

The appropriate starting security for a new document is "draft". -->
<documentStatus>draft</documentStatus>
<!-- documentStatus>review</documentStatus -->
<!-- documentStatus>publish</documentStatus -->

</configuration>
</execution>
</executions>
<configuration>
<!-- These parameters apply to pdf and webhelp -->
<xincludeSupported>true</xincludeSupported>
<sourceDirectory>.</sourceDirectory>
<includes>
<!-- TODO: If you desire, you may change the following filename to something more appropriate for the new document -->
bk_main.xml
</includes>

<!-- **TODO: Set to the correct project URL. This likely needs input from the TSC. -->
<!-- canonicalUrlBase>http://openpowerfoundation.org/docs/template-guide/content</canonicalUrlBase -->
<glossaryCollection>${basedir}/../glossary/glossary-terms.xml</glossaryCollection>
<includeCoverLogo>1</includeCoverLogo>
<coverUrl>www.openpowerfoundation.org</coverUrl>
</configuration>
</plugin>
</plugins>
</build>
</project>

35
Vector_Intrinsics/sec_api_implemented.xml

@ -0,0 +1,35 @@ @@ -0,0 +1,35 @@
<?xml version="1.0" encoding="UTF-8"?>
<!--
Copyright (c) 2017 OpenPOWER Foundation
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="sec_api_implemented">
<title>How the API is implemented</title>
<para>One pleasant surprise is that many (at least for the older Intel)
Intrinsics are implemented directly in C vector extension code and/or a simple
mapping to GCC target specific builtins. </para>
<xi:include href="sec_simple_examples.xml"/>
<xi:include href="sec_extra_attributes.xml"/>
<xi:include href="sec_how_findout.xml"/>
<xi:include href="sec_other_intrinsic_examples.xml"/>

</section>

111
Vector_Intrinsics/sec_crossing_lanes.xml

@ -0,0 +1,111 @@ @@ -0,0 +1,111 @@
<?xml version="1.0" encoding="UTF-8"?>
<!--
Copyright (c) 2017 OpenPOWER Foundation
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="sec_crossing_lanes">
<title>Crossing lanes</title>
<para>We have seen that, most of the time, vector SIMD units prefer to keep
computations in the same “lane” (element number) as the input elements. The
only exception in the examples so far are the occasional splat (copy one
element to all the other elements of the vector) operations. Splat is an
example of the general category of “permute” operations (Intel would call
this a “shuffle” or “blend”). Permutes selects and rearrange the
elements of (usually) a concatenated pair of vectors and delivers those
selected elements, in a specific order, to a result vector. The selection and
order of elements in the result is controlled by a third vector, either as 3rd
input vector or and immediate field of the instruction.</para>

<para>For example the Intel intrisics for
<link xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_hadd&amp;expand=2757,4767,409,2757">Horizontal Add / Subtract</link>
added with SSE3. These instrinsics add (subtract) adjacent element pairs, across pair of
input vectors, placing the sum of the adjacent elements in the result vector.
For example
<link xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_hadd_ps&amp;expand=2757,4767,409,2757,2757">_mm_hadd_ps</link>  
which implments the operation on float:
<programlisting><![CDATA[ result[0] = __A[1] + __A[0];
result[1] = __A[3] + __A[2];
result[2] = __B[1] + __B[0];
result[3] = __B[3] + __B[2];]]></programlisting></para>

<para>Horizontal Add (hadd) provides an incremental vector “sum across”
operation commonly needed in matrix and vector transform math. Horizontal Add
is incremental as you need three hadd instructions to sum across 4 vectors of 4
elements ( 7 for 8 x 8, 15 for 16 x 16, …).</para>
<para>The PowerISA does not have a sum-across operation for float or
double. We can user the vector float add instruction after we rearrange the
inputs so that element pairs line up for the horizontal add. For example we
would need to permute the input vectors {1, 2, 3, 4} and {101, 102, 103, 104}
into vectors {2, 4, 102, 104} and {1, 3, 101, 103} before
the  <literal>vec_add</literal>. This
requires two vector permutes to align the elements into the correct lanes for
the vector add (to implement Horizontal Add).  </para>

<para>The PowerISA provides generalized byte-level vector permute (vperm)
based a vector register pair source as input and a control vector. The control
vector provides 16 indexes (0-31) to select bytes from the concatenated input
vector register pair (VRA, VRB). A more specific set of permutes (pack, unpack,
merge, splat) operations (across element sizes) are encoded as separate
 instruction opcodes or instruction immediate fields.</para>

<para>Unfortunately only the general <literal>vec_perm</literal>
can provide the realignment
we need the _mm_hadd_ps operation or any of the int, short variants of hadd.
For example:
<programlisting><![CDATA[extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_hadd_ps (__m128 __X, __m128 __Y)
{
__vector unsigned char xform2 = {
0x00, 0x01, 0x02, 0x03, 0x08, 0x09, 0x0A, 0x0B,
0x10, 0x11, 0x12, 0x13, 0x18, 0x19, 0x1A, 0x1B
};
__vector unsigned char xform1 = {
0x04, 0x05, 0x06, 0x07, 0x0C, 0x0D, 0x0E, 0x0F,
0x14, 0x15, 0x16, 0x17, 0x1C, 0x1D, 0x1E, 0x1F
};
return (__m128) vec_add (vec_perm ((__v4sf) __X, (__v4sf) __Y, xform1),
vec_perm ((__v4sf) __X, (__v4sf) __Y, xform2));
}]]></programlisting></para>

<para>This requires two permute control vectors; one to select the even
word elements across <literal>__X</literal> and <literal>__Y</literal>,
and another to select the odd word elements
across <literal>__X</literal> and <literal>__Y</literal>.
The result of these permutes (<literal>vec_perm</literal>) are inputs to the
<literal>vec_add</literal> and completes the add operation. </para>

<para>Fortunately the permute required for the double (64-bit) case (IE
_mm_hadd_pd) reduces to the equivalent of <literal>vec_mergeh</literal> /
<literal>vec_mergel</literal>  doubleword
(which are variants of  VSX Permute Doubleword Immediate). So the
implementation of _mm_hadd_pd can be simplified to this:
<programlisting><![CDATA[extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_hadd_pd (__m128d __X, __m128d __Y)
{
return (__m128d) vec_add (vec_mergeh ((__v2df) __X, (__v2df)__Y),
vec_mergel ((__v2df) __X, (__v2df)__Y));
}]]></programlisting></para>

<para>This eliminates the load of the control vectors required by the
previous example.</para>

</section>

40
Vector_Intrinsics/sec_differences.xml

@ -0,0 +1,40 @@ @@ -0,0 +1,40 @@
<?xml version="1.0" encoding="UTF-8"?>
<!--
Copyright (c) 2017 OpenPOWER Foundation
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="sec_differences">
<title>Profound differences</title>
<para>We have already mentioned above a number of architectural differences
that effect porting of codes containing Intel intrinsics to POWER. The fact
that Intel supports multiple vector extensions with different vector widths
(64, 128, 256, and 512-bits) while the PowerISA only supports vectors of
128-bits is one issue. Another is the difference in how the respective ISAs
support scalars in vector registers is another.  In the text above we propose
workable alternatives for the PowerPC port. There also differences in the
handling of floating point exceptions and rounding modes that may impact the
application's performance or behavior.</para>
<xi:include href="sec_floatingpoint_exceptions.xml"/>
<xi:include href="sec_floatingpoint_rounding.xml"/>
<xi:include href="sec_performance.xml"/>

</section>

137
Vector_Intrinsics/sec_extra_attributes.xml

@ -0,0 +1,137 @@ @@ -0,0 +1,137 @@
<?xml version="1.0" encoding="UTF-8"?>
<!--
Copyright (c) 2017 OpenPOWER Foundation
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="sec_extra_attributes">
<title>Those extra attributes</title>
<para>You may have noticed there are some special attributes:
<literallayout>__gnu_inline__

This attribute should be used with a function that is also declared with the
inline keyword. It directs GCC to treat the function as if it were defined in
gnu90 mode even when compiling in C99 or gnu99 mode.

If the function is declared extern, then this definition of the function is used
only for inlining. In no case is the function compiled as a standalone function,
not even if you take its address explicitly. Such an address becomes an external
reference, as if you had only declared the function, and had not defined it. This
has almost the effect of a macro. The way to use this is to put a function
definition in a header file with this attribute, and put another copy of the
function, without extern, in a library file. The definition in the header file
causes most calls to the function to be inlined.

__always_inline__

Generally, functions are not inlined unless optimization is specified. For func-
tions declared inline, this attribute inlines the function independent of any
restrictions that otherwise apply to inlining. Failure to inline such a function
is diagnosed as an error.

__artificial__

This attribute is useful for small inline wrappers that if possible should appear
during debugging as a unit. Depending on the debug info format it either means
marking the function as artificial or using the caller location for all instructions
within the inlined body.

__extension__

... -pedantic’ and other options cause warnings for many GNU C extensions.
You can prevent such warnings within one expression by writing __extension__</literallayout></para>

<para>So far I have been using these attributes unchanged.</para>

<para>But most intrinsics map the Intel intrinsic to one or more target
specific GCC builtins. For example:
<programlisting><![CDATA[/* Load two DPFP values from P. The address must be 16-byte aligned. */
extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_load_pd (double const *__P)
{
return *(__m128d *)__P;
}

/* Load two DPFP values from P. The address need not be 16-byte aligned. */
extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_loadu_pd (double const *__P)
{
return __builtin_ia32_loadupd (__P);
}]]></programlisting></para>

<para>The first intrinsic (_mm_load_pd ) is implement as a C vector pointer
reference, but from the comment assumes the compiler will use a
<emphasis role="bold">movapd</emphasis>
instruction that requires 16-byte alignment (will raise a general-protection
exception if not aligned). This  implies that there is a performance advantage
for at least some Intel processors to keep the vector aligned. The second
intrinsic uses the explicit GCC builtin
<emphasis role="bold"><literal>__builtin_ia32_loadupd</literal></emphasis> to generate the
<emphasis role="bold"><literal>movupd</literal></emphasis> instruction which handles unaligned references.</para>

<para>The opposite assumption applies to POWER and PPC64LE, where GCC
generates the VSX <emphasis role="bold"><literal>lxvd2x</literal></emphasis> /
<emphasis role="bold"><literal>xxswapd</literal></emphasis>
instruction sequence by default, which
allows unaligned references. The PowerISA equivalent for aligned vector access
is the VMX <emphasis role="bold"><literal>lvx</literal></emphasis> instruction and the
<emphasis role="bold"><literal>vec_ld</literal></emphasis> builtin, which forces quadword
aligned access (by ignoring the low order 4 bits of the effective address). The
<emphasis role="bold"><literal>lvx</literal></emphasis> instruction does not raise
alignment exceptions, but perhaps should as part
of our implementation of the Intel intrinsic. This requires that we use
PowerISA VMX/VSX built-ins to insure we get the expected results.</para>

<para>The current prototype defines the following:
<programlisting><![CDATA[/* Load two DPFP values from P. The address must be 16-byte aligned. */
extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_load_pd (double const *__P)
{
assert(((unsigned long)__P & 0xfUL) == 0UL);
return ((__m128d)vec_ld(0, (__v16qu*)__P));
}

/* Load two DPFP values from P. The address need not be 16-byte aligned. */
extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_loadu_pd (double const *__P)
{
return (vec_vsx_ld(0, __P));
}]]></programlisting></para>

<para>The aligned  load intrinsic adds an assert which checks alignment
(to match the Intel semantic) and uses  the GCC builtin
<emphasis role="bold"><literal>vec_ld</literal></emphasis> (generates an
<emphasis role="bold"><literal>lvx</literal></emphasis>).  The assert
generates extra code but this can be eliminated by defining
<emphasis role="bold"><literal>NDEBUG</literal></emphasis> at compile time.
The unaligned load intrinsic uses the GCC builtin
<literal>vec_vsx_ld</literal>  (for PPC64LE generates
<emphasis role="bold"><literal>lxvd2x</literal></emphasis> /
<emphasis role="bold"><literal>xxswapd</literal></emphasis> for POWER8  and will
simplify to <emphasis role="bold"><literal>lxv</literal></emphasis>
or <emphasis role="bold"><literal>lxvx</literal></emphasis>
for POWER9).  And similarly for <emphasis role="bold"><literal>__mm_store_pd</literal></emphasis> /
<emphasis role="bold"><literal>__mm_storeu_pd</literal></emphasis>, using
<emphasis role="bold"><literal>vec_st</literal></emphasis>
and <emphasis role="bold"><literal>vec_vsx_st</literal></emphasis>. These concepts extent to the
load/store intrinsics for vector float and vector int.</para>

</section>

73
Vector_Intrinsics/sec_floatingpoint_exceptions.xml

@ -0,0 +1,73 @@ @@ -0,0 +1,73 @@
<?xml version="1.0" encoding="UTF-8"?>
<!--
Copyright (c) 2017 OpenPOWER Foundation
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="sec_floatingpoint_exceptions">
<title>Floating Point Exceptions</title>
<para>Nominally both ISAs support the IEEE754 specifications, but there are
some subtle differences. Both architecture define a status and control register
to record exceptions and enable / disable floating exceptions for program
interrupt or default action. Intel has a MXCSR and PowerISA has a FPSCR which
basically do the same thing but with different bit layout. </para>

<para>Intel provides <literal>_mm_setcsr</literal> / <literal>_mm_getcsr</literal>
intrinsics to allow direct
access to the MXCSR. In the early days before the OS POSIX run-times where
updated  to manage the MXCSR, this might have been useful. Today this would be
highly discouraged with a strong preference to use the POSIX APIs
(<literal>feclearexceptflag</literal>,
<literal>fegetexceptflag</literal>,
<literal>fesetexceptflag</literal>, ...) instead.</para>

<para>If we implement <literal>_mm_setcsr</literal> /
<literal>_mm_getcs</literal> at all, we should simply
redirect the implementation to use the POSIX APIs from
<literal>&lt;fenv.h&gt;</literal>. But it
might be simpler just to replace these intrinsics with macros that generate
#error.</para>

<para>The Intel MXCSR does have some none (POSIX/IEEE754) standard quirks;
Flush-To-Zero and Denormals-Are-Zeros flags. This simplifies the hardware
response to what should be a rare condition (underflows where the result can
not be represented in the exponent range and precision of the format) by simply
returning a signed 0.0 value. The intrinsic header implementation does provide
constant masks for <literal>_MM_DENORMALS_ZERO_ON</literal>
(<literal>&lt;pmmintrin.h&gt;</literal>) and
<literal>_MM_FLUSH_ZERO_ON</literal> (<literal>&lt;xmmintrin.h&gt;</literal>,
so technically it is available to users
of the Intel Intrinsics API.</para>

<para>The VMX Vector facility provides a separate Vector Status and Control
register (VSCR) with a Non-Java Mode control bit. This control combines the
flush-to-zero semantics for floating Point underflow and denormal values. But
this control only applies to VMX vector float instructions and does not apply
to VSX scalar floating Point or vector double instructions. The FPSCR does
define a Floating-Point non-IEEE mode which is optional in the architecture.
This would apply to Scalar and VSX floating-point operations if it was
implemented. This was largely intended for embedded processors and is not
implemented in the POWER processor line.</para>

<para>As the flush-to-zero is primarily a performance enhansement and is
clearly outside the IEEE754 standard, it may be best to simply ignore this
option for the intrinsic port.</para>

</section>

33
Vector_Intrinsics/sec_floatingpoint_rounding.xml

@ -0,0 +1,33 @@ @@ -0,0 +1,33 @@
<?xml version="1.0" encoding="UTF-8"?>
<!--
Copyright (c) 2017 OpenPOWER Foundation
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="sec_floatingpoint_rounding">
<title>Floating-point rounding modes</title>
<para>The Intel (x86 / x86_64) and PowerISA architectures both support the
4 IEEE754 rounding modes. Again while the Intel Intrinsic API allows the
application to change rounding modes via updates to the
<literal>MXCSR</literal> it is a bad idea
and should be replaced with the POSIX APIs (<literal>fegetround</literal> and
<literal>fesetround</literal>). </para>
</section>

113
Vector_Intrinsics/sec_gcc_vector_extensions.xml

@ -0,0 +1,113 @@ @@ -0,0 +1,113 @@
<?xml version="1.0" encoding="UTF-8"?>
<!--
Copyright (c) 2017 OpenPOWER Foundation
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="sec_gcc_vector_extensions">
<title>GCC Vector Extensions</title>
<para>The GCC vector extensions are common syntax but implemented in a
target specific way. Using the C vector extensions require the
<literal>__gnu_inline__</literal>
attribute to avoid syntax errors in case the user specified  C standard
compliance (<literal>-std=c90</literal>, <literal>-std=c11</literal>,
etc) that would normally disallow such
extensions. </para>

<para>The GCC implementation for PowerPC64 Little Endian is (mostly)
functionally compatible with x86_64 vector extension usage. We can use the same
type definitions (at least for  vector_size (16)), operations, syntax
<literal>&lt;</literal><emphasis role="bold"><literal>{</literal></emphasis><literal>...</literal><emphasis role="bold"><literal>}</literal></emphasis><literal>&gt;</literal>
for vector initializers and constants, and array syntax
<literal>&lt;</literal><emphasis role="bold"><literal>[]</literal></emphasis><literal>&gt;</literal>
for vector element access. So simple arithmetic / logical operations
on whole vectors should work as is. </para>

<para>The caveat is that the interface data type of the Intel Intrinsic may
not match the data types of the operation, so it may be necessary to cast the
operands to the specific type for the operation. This also applies to vector
initializers and accessing vector elements. You need to use the appropriate
type to get the expected results. Of course this applies to X86_64 as well. For
example:
<programlisting><![CDATA[/* Perform the respective operation on the four SPFP values in A and B. */
extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_add_ps (__m128 __A, __m128 __B)
{
return (__m128) ((__v4sf)__A + (__v4sf)__B);
}

/* Stores the lower SPFP value. */
extern __inline void __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_store_ss (float *__P, __m128 __A)
{
*__P = ((__v4sf)__A)[0];
}]]></programlisting></para>

<para>Note the cast from the interface type (<literal>__m128</literal>} to the implementation
type (<literal>__v4sf</literal>, defined in the intrinsic header) for the vector float add (+)
operation. This is enough for the compiler to select the appropriate vector add
instruction for the float type. Then the result (which is
<literal>__v4sf</literal>) needs to be
cast back to the expected interface type (<literal>__m128</literal>). </para>

<para>Note also the use of <emphasis>array syntax</emphasis> (<literal>__A)[0]</literal>)
to extract the lowest
(left most<footnote><para>Here we are using logical left and logical right
which will not match the PowerISA register view in Little endian. Logical left
is the left most element for initializers {left, … , right}, storage order
and array  order where the left most element is [0].</para></footnote>)
element of a vector. The cast (<literal>__v4sf</literal>) insures that the compiler knows we are
extracting the left most 32-bit float. The compiler insures the code generated
matches the Intel behavior for PowerPC64 Little Endian. </para>

<para>The code generation is complicated by the fact that PowerISA vector
registers are Big Endian (element 0 is the left most word of the vector) and
X86 scalar stores are from the left most (work/dword) for the vector register.
Application code with extensive use of scalar (vs packed) intrinsic loads /
stores should be flagged for rewrite to native PPC code using exisiing scalar
types (float, double, int, long, etc.). </para>

<para>Another example is the set reverse order:
<programlisting><![CDATA[/* Create the vector [Z Y X W]. */
extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_set_ps (const float __Z, const float __Y, const float __X, const float __W)
{
return __extension__ (__m128)(__v4sf){ __W, __X, __Y, __Z };
}

/* Create the vector [W X Y Z]. */
extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_setr_ps (float __Z, float __Y, float __X, float __W)
{
return __extension__ (__m128)(__v4sf){ __Z, __Y, __X, __W };
}]]></programlisting></para>

<para>Note the use of <emphasis>initializer syntax</emphasis> used to collect a set of scalars
into a vector. Code with constant initializer values will generate a vector
constant of the appropriate endian. However code with variables in the
initializer can get complicated as it often requires transfers between register
sets and perhaps format conversions. We can assume that the compiler will
generate the correct code, but if this class of intrinsics shows up a hot spot,
a rewrite to native PPC vector built-ins may be appropriate. For example
initializer of a variable replicated to all the vector fields might not be
recognized as a “load and splat” and making this explicit may help the
compiler generate better code.</para>

</section>

91
Vector_Intrinsics/sec_handling_avx.xml

@ -0,0 +1,91 @@ @@ -0,0 +1,91 @@
<?xml version="1.0" encoding="UTF-8"?>
<!--
Copyright (c) 2017 OpenPOWER Foundation
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="sec_handling_avx">
<title>Dealing with AVX and AVX512</title>
<para>AVX is a bit easier for PowerISA and the ELF V2 ABI. First we have
lots (64) of vector registers and a super scalar vector pipe-line (can execute
two or more independent 128-bit vector operations concurrently). Second the ELF
V2 ABI was designed to pass and return larger aggregates in vector
registers:</para>

<itemizedlist>
<listitem>
<para>Up to 12 qualified vector arguments can be passed in
v2–v13.</para>
</listitem>
<listitem>
<para>A qualified vector argument corresponds to:
<itemizedlist>
<listitem>
<para>A vector data type</para>
</listitem>

<listitem>
<para>A member of a homogeneous aggregate of multiple like data types
passed in up to eight vector registers.</para>
</listitem>

<listitem>
<para>Homogeneous floating-point or vector aggregate return values
that consist of up to eight registers with up to eight elements will
be returned in floating-point or vector registers that correspond to
the parameter registers that would be used if the return value type
were the first input parameter to a function.</para>
</listitem>
</itemizedlist>
</para>
</listitem>
</itemizedlist>

<para>So the ABI allows for passing up to three structures each
representing 512-bit vectors and returning such (512-bit) structure all in VMX
registers. This can be extended further by spilling parameters (beyond 12 X
128-bit vectors) to the parameter save area, but we should not need that, as
most intrinsics only use 2 or 3 operands.. Vector registers not needed for
parameter passing, along with an additional 8 volatile vector registers, are
available for scratch and local variables. All can be used by the application
without requiring register spill to the save area. So most intrinsic operations
on 256- or 512-bit vectors can be held within existing PowerISA vector
registers. </para>

<para>For larger functions that might use multiple AVX 256 or 512-bit
intrinsics and, as a result, push beyond the 20 volatile vector registers, the
compiler will just allocate non-volatile vector registers by allocating a stack
frame and spilling non-volatile vector registers to the save area (as needed in
the function prologue). This frees up to 64 vectors (32 x 256-bit or 16 x
512-bit structs) for code optimization. </para>