Inital port from SJM R4 document

pull/30/head
Jeff Scheel 8 years ago
parent fb227f66cb
commit af0f65de25

File diff suppressed because one or more lines are too long

@ -0,0 +1,176 @@

Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/

TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION

1. Definitions.

"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.

"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.

"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.

"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.

"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.

"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.

"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).

"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.

"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."

"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.

2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.

3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.

4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:

(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and

(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and

(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and

(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.

You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.

5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.

6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.

7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.

8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.

9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.

@ -0,0 +1,93 @@
# Porting Guide for Linux on Power
TBD...

To build this project, one must ensure that the Docs-Master project has
also been cloned at the same directory level as the Docs-Template project.
This can be accomplished with the following steps:

1. Clone the master documentation project (Docs-Master) using the following command:

```
$ git clone https://github.com/OpenPOWERFoundation/Docs-Master.git
```
2. Clone this project (Docs-Template) using the following command:

```
$ git clone https://ibm.github.com/scheel/SJM-Porting-Guide.git
```
3. Build the project with these commands:
```
$ cd SJM-Porting-Guide
$ mvn clean generate-sources
```

The online version of the document can be found in the OpenPOWER Foundation
Document library at [TBD](http://openpowerfoundation.org/?resource_lib=TBD).

The project which controls the look and feel of the document is the
[Docs-Maven-Plugin project](https://github.com/OpenPOWERFoundation/Docs-Maven-Plugin), an
OpenPOWER Foundation private project on GitHub. To obtain access to the Maven Plugin project,
contact Jeff Scheel \([scheel@us.ibm.com](mailto://scheel@us.ibm.com)\) or
Jeff Brown \([jeffdb@us.ibm.com](mailto://jeffdb@us.ibm.com)\).

## License
This project is licensed under the Apache V2 license. More information
can be found in the LICENSE file or online at

http://www.apache.org/licenses/LICENSE-2.0

## Community
TBD...

## Contributions
TBD...

Contributions to this project should conform to the `Developer Certificate
of Origin` as defined at http://elinux.org/Developer_Certificate_Of_Origin.
Commits to this project need to contain the following line to indicate
the submitter accepts the DCO:
```
Signed-off-by: Your Name <your_email@domain.com>
```
By contributing in this way, you agree to the terms as follows:
```
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
660 York Street, Suite 102,
San Francisco, CA 94110 USA

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
have the right to submit it under the open source license
indicated in the file; or

(b) The contribution is based upon previous work that, to the best
of my knowledge, is covered under an appropriate open source
license and I have the right under that license to submit that
work with modifications, whether created in whole or in part
by me, under the same open source license (unless I am
permitted to submit under a different license), as indicated
in the file; or

(c) The contribution was provided directly to me by some other
person who certified (a), (b) or (c) and I have not modified
it.

(d) I understand and agree that this project and the contribution
are public and that a record of the contribution (including all
personal information I submit with it, including my sign-off) is
maintained indefinitely and may be redistributed consistent with
this project or the open source license(s) involved.
```

@ -0,0 +1,318 @@
<?xml version="1.0" encoding="UTF-8"?>
<!--
Copyright (c) 2017 OpenPOWER Foundation
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<appendix xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="app_intel_suffixes">
<?dbhtml stop-chunking?>
<title>Intel Intrinsic suffixes</title>
<section>
<title>MMX</title>
<variablelist>
<varlistentry>
<term><emphasis role="bold"><literal>_pi16</literal></emphasis></term>
<listitem><para>4 x packed short int</para></listitem>
</varlistentry>

<varlistentry>
<term><emphasis role="bold"><literal>_pi32</literal></emphasis></term>
<listitem><para>2 x packed int</para></listitem>
</varlistentry>

<varlistentry>
<term><emphasis role="bold"><literal>_pi8</literal></emphasis></term>
<listitem><para>8 x packed signed char</para></listitem>
</varlistentry>

<varlistentry>
<term><emphasis role="bold"><literal>_pu16</literal></emphasis></term>
<listitem><para>4 x packed unsigned short int</para></listitem>
</varlistentry>

<varlistentry>
<term><emphasis role="bold"><literal>_pu8</literal></emphasis></term>
<listitem><para>8 x packed unsigned char</para></listitem>
</varlistentry>

<varlistentry>
<term><emphasis role="bold"><literal>_si64</literal></emphasis></term>
<listitem><para>single 64-bit binary (logical)</para></listitem>
</varlistentry>
</variablelist>
</section>
<section>
<title>SSE</title>
<variablelist>
<varlistentry>
<term><emphasis role="bold"><literal>_ps</literal></emphasis></term>
<listitem><para>4 x packed float</para></listitem>
</varlistentry>

<varlistentry>
<term><emphasis role="bold"><literal>_ss</literal></emphasis></term>
<listitem><para>single scalar float</para></listitem>
</varlistentry>

<varlistentry>
<term><emphasis role="bold"><literal>_si32</literal></emphasis></term>
<listitem><para>single 32-bit int</para></listitem>
</varlistentry>

<varlistentry>
<term><emphasis role="bold"><literal>_si64</literal></emphasis></term>
<listitem><para>single 64-bit long int</para></listitem>
</varlistentry>
</variablelist>
</section>
<section>
<title>SSE2</title>
<variablelist>
<varlistentry>
<term><emphasis role="bold"><literal>_epi16</literal></emphasis></term>
<listitem><para>8 x packed short int</para></listitem>
</varlistentry>

<varlistentry>
<term><emphasis role="bold"><literal>_epi32</literal></emphasis></term>
<listitem><para>4 x packed int</para></listitem>
</varlistentry>

<varlistentry>
<term><emphasis role="bold"><literal>_epi64</literal></emphasis></term>
<listitem><para>2 x packed long int</para></listitem>
</varlistentry>

<varlistentry>
<term><emphasis role="bold"><literal>_epi8</literal></emphasis></term>
<listitem><para>16 x packed signed char</para></listitem>
</varlistentry>

<varlistentry>
<term><emphasis role="bold"><literal>_epu16</literal></emphasis></term>
<listitem><para>8 x packed unsigned short int</para></listitem>
</varlistentry>

<varlistentry>
<term><emphasis role="bold"><literal>_epu32</literal></emphasis></term>
<listitem><para>4 x packed unsigned int</para></listitem>
</varlistentry>

<varlistentry>
<term><emphasis role="bold"><literal>_epu8</literal></emphasis></term>
<listitem><para>16 x packed unsigned char</para></listitem>
</varlistentry>
</variablelist>

<!-- Is this break really desired? -->
<para/>
<variablelist>
<varlistentry>
<term><emphasis role="bold"><literal>_pd</literal></emphasis></term>
<listitem><para>2 x packed double</para></listitem>
</varlistentry>

<varlistentry>
<term><emphasis role="bold"><literal>_sd</literal></emphasis></term>
<listitem><para>single scalar double</para></listitem>
</varlistentry>
<varlistentry>
<term><emphasis role="bold"><literal>_pi64</literal></emphasis></term>
<listitem><para>single long int</para></listitem>
</varlistentry>

<varlistentry>
<term><emphasis role="bold"><literal>_si128</literal></emphasis></term>
<listitem><para>single 128-bit binary (logical)</para></listitem>
</varlistentry>
</variablelist>
</section>
<section>
<title>AVX/AVX2 __m256_*</title>
<variablelist>
<varlistentry>
<term><emphasis role="bold"><literal>_ps</literal></emphasis></term>
<listitem><para>8 x packed float</para></listitem>
</varlistentry>

<varlistentry>
<term><emphasis role="bold"><literal>_pd</literal></emphasis></term>
<listitem><para>4 x packed double</para></listitem>
</varlistentry>

<varlistentry>
<term><emphasis role="bold"><literal>_epi16</literal></emphasis></term>
<listitem><para>16 x packed short int</para></listitem>
</varlistentry>

<varlistentry>
<term><emphasis role="bold"><literal>_epi32</literal></emphasis></term>
<listitem><para>8 x packed int</para></listitem>
</varlistentry>

<varlistentry>
<term><emphasis role="bold"><literal>_epi64</literal></emphasis></term>
<listitem><para>4 x packed long int</para></listitem>
</varlistentry>

<varlistentry>
<term><emphasis role="bold"><literal>_epi8</literal></emphasis></term>
<listitem><para>32 x packed signed char</para></listitem>
</varlistentry>

<varlistentry>
<term><emphasis role="bold"><literal>_epu16</literal></emphasis></term>
<listitem><para>16 x packed unsigned short int</para></listitem>
</varlistentry>

<varlistentry>
<term><emphasis role="bold"><literal>_epu32</literal></emphasis></term>
<listitem><para>8 x packed unsigned int</para></listitem>
</varlistentry>

<varlistentry>
<term><emphasis role="bold"><literal>_epu8</literal></emphasis></term>
<listitem><para>32 x packed unsigned char</para></listitem>
</varlistentry>

<varlistentry>
<term><emphasis role="bold"><literal>_ss</literal></emphasis></term>
<listitem><para>single scalar float (broadcast/splat)</para></listitem>
</varlistentry>

<varlistentry>
<term><emphasis role="bold"><literal>_sd</literal></emphasis></term>
<listitem><para>single scalar double</para></listitem>
</varlistentry>

<varlistentry>
<term><emphasis role="bold"><literal>_si256</literal></emphasis></term>
<listitem><para>single 256-bit binary (logical)</para></listitem>
</varlistentry>

<varlistentry>
<term><emphasis role="bold"><literal>_pd256</literal></emphasis></term>
<listitem><para>cast / zero extend</para></listitem>
</varlistentry>

<varlistentry>
<term><emphasis role="bold"><literal>_ps256</literal></emphasis></term>
<listitem><para>cast / zero extend</para></listitem>
</varlistentry>

<varlistentry>
<term><emphasis role="bold"><literal>_pd128</literal></emphasis></term>
<listitem><para>cast</para></listitem>
</varlistentry>

<varlistentry>
<term><emphasis role="bold"><literal>_ps128</literal></emphasis></term>
<listitem><para>cast</para></listitem>
</varlistentry>
</variablelist>
</section>
<section>
<title>AVX512 __m512_*</title>
<variablelist>
<varlistentry>
<term><emphasis role="bold"><literal>_ps</literal></emphasis></term>
<listitem><para>16 x packed float</para></listitem>
</varlistentry>

<varlistentry>
<term><emphasis role="bold"><literal>_pd</literal></emphasis></term>
<listitem><para>8 x packed double</para></listitem>
</varlistentry>

<varlistentry>
<term><emphasis role="bold"><literal>_epi16</literal></emphasis></term>
<listitem><para>32 x packed short int</para></listitem>
</varlistentry>

<varlistentry>
<term><emphasis role="bold"><literal>_epi32</literal></emphasis></term>
<listitem><para>16 x packed int</para></listitem>
</varlistentry>

<varlistentry>
<term><emphasis role="bold"><literal>_epi64</literal></emphasis></term>
<listitem><para>8 x packed long int</para></listitem>
</varlistentry>

<varlistentry>
<term><emphasis role="bold"><literal>_epi8</literal></emphasis></term>
<listitem><para>64 x packed signed char</para></listitem>
</varlistentry>

<varlistentry>
<term><emphasis role="bold"><literal>_epu16</literal></emphasis></term>
<listitem><para>32 x packed unsigned short int</para></listitem>
</varlistentry>

<varlistentry>
<term><emphasis role="bold"><literal>_epu32</literal></emphasis></term>
<listitem><para>16 x packed unsigned int</para></listitem>
</varlistentry>

<varlistentry>
<term><emphasis role="bold"><literal>_epu64</literal></emphasis></term>
<listitem><para>8 x packed unsigned long int</para></listitem>
</varlistentry>

<varlistentry>
<term><emphasis role="bold"><literal>_epu8</literal></emphasis></term>
<listitem><para>64 x packed unsigned char</para></listitem>
</varlistentry>

<varlistentry>
<term><emphasis role="bold"><literal>_ss</literal></emphasis></term>
<listitem><para>single scalar float</para></listitem>
</varlistentry>

<varlistentry>
<term><emphasis role="bold"><literal>_sd</literal></emphasis></term>
<listitem><para>single scalar double</para></listitem>
</varlistentry>

<varlistentry>
<term><emphasis role="bold"><literal>_si512</literal></emphasis></term>
<listitem><para>single 512-bit binary (logical)</para></listitem>
</varlistentry>

<varlistentry>
<term><emphasis role="bold"><literal>_pd512</literal></emphasis></term>
<listitem><para>cast / zero extend</para></listitem>
</varlistentry>

<varlistentry>
<term><emphasis role="bold"><literal>_ps512</literal></emphasis></term>
<listitem><para>cast / zero extend</para></listitem>
</varlistentry>
</variablelist>
</section>

</appendix>

@ -0,0 +1,70 @@
<?xml version="1.0" encoding="UTF-8"?>
<!--
Copyright (c) 2017 OpenPOWER Foundation
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<appendix xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="app_references">
<?dbhtml stop-chunking?>
<title>Document references</title>
<section>
<title>OpenPOWER and Power documents</title>
<para>
<link xlink:href="https://openpowerfoundation.org/technical/technical-resources/technical-specifications/">OpenPOWER™ Technical Specifications</link>
</para>
<para>
<link xlink:href="https://openpowerfoundation.org/?resource_lib=ibm-power-isa-version-2-07-b">Power ISA™ Version 2.07 B</link>
</para>
<para>
<link xlink:href="https://www.docdroid.net/tWT7hjD/powerisa-v30.pdf.html">Power ISA™ Version 3.0</link>
</para>
<para>
<link xlink:href="https://openpowerfoundation.org/technical/technical-resources/technical-specifications/">Power Architecture 64-bit ELF ABI Specification (AKA OpenPower ABI for Linux Supplement)</link>
</para>
<para>
<link xlink:href="http://www.nxp.com/assets/documents/data/en/reference-manuals/ALTIVECPEM.pdf">AltiVec™ Technology Programming Environments Manual</link>
</para>

</section>
<section>
<title>A.2 Intel documents</title>
<para>
<link xlink:href="https://software.intel.com/en-us/articles/intel-sdm">Intel® 64 and IA-32 Architectures Software Developers Manual</link>
</para>
<para>
<link xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/">Intel™ Intrinsics Guide</link>
</para>
<para/>
</section>
<section>
<title>A.3 GNU Compiler Collection (GCC) documents</title>
<para>
<link xlink:href="https://gcc.gnu.org/onlinedocs/">GCC online documentation</link>
</para>
<para>
<link xlink:href="https://gcc.gnu.org/onlinedocs/gcc-6.3.0/gcc/">GCC Manual (GCC 6.3)</link>
</para>
<para>
<link xlink:href="https://gcc.gnu.org/onlinedocs/gccint/">GCC Internals Manual</link>
</para>
<para/>
</section>

</appendix>

@ -0,0 +1,103 @@
<?xml version="1.0" encoding="UTF-8"?>
<!--
Copyright (c) 2017 OpenPOWER Foundation
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<book xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="bk_main">

<title>Linux on Power Porting Guide</title>
<subtitle>Vector Intrinsic</subtitle>

<info>
<author>
<personname>
<surname>System Software Work Group</surname>
</personname>
<email>syssw-chair@openpowerfoundation.org</email>
<affiliation>
<orgname>OpenPOWER Foundation</orgname>
</affiliation>
</author>
<copyright>
<year>2017</year>
<holder>OpenPOWER Foundation</holder>
</copyright>
<!-- TODO: Set the correct document releaseinfo -->
<releaseinfo>Revision 0.1</releaseinfo>
<productname>OpenPOWER</productname>
<pubdate/>

<legalnotice role="apache2">

<annotation>
<remark>Copyright details are filled in by the template.</remark>
</annotation>
</legalnotice>
<!-- TODO: Update the following text with the correct document description (first paragraph),
Work Group name, and Work Product track (both in second paragraph). -->
<abstract>
<para>The goal of this project is to provide functional equivalents of the
Intel MMX, SSE, and AVX intrinsic functions, that are commonly used in Linux
applications, and make them (or equivalents) available for the PowerPC64LE
platform.</para>
<para>This document is a Standard Track, Work Group Note work product owned by the
System Software Workgroup and handled in compliance with the requirements outlined in the
<citetitle>OpenPOWER Foundation Work Group (WG) Process</citetitle> document. It was
created using the <citetitle>Master Template Guide</citetitle> version 0.9.5. Comments,
questions, etc. can be submitted to the public mailing list for this document at
<link xlink:href="http://tbd.openpowerfoundation.org">TBD</link>.</para>
</abstract>

<revhistory>
<!-- TODO: Update as new revisions created -->
<revision>
<date>2017-07-26</date>
<revdescription>
<itemizedlist spacing="compact">
<listitem>
<para>Revision 0.1 - initial draft from Steve Munroe</para>
</listitem>
</itemizedlist>
</revdescription>
</revision>
</revhistory>
</info>

<!-- The ch_preface.xml file is required by all documents -->
<xi:include href="../../Docs-Master/common/ch_preface.xml"/>

<!-- Chapter heading files -->
<xi:include href="ch_intel_intrinsic_porting.xml"/>
<xi:include href="ch_howto_start.xml"/>
<!-- Placeholder files ATM -->
<!--chapter><title>Placeholders</title>
</chapter-->

<!-- Document specific appendices -->
<xi:include href="app_references.xml"/>
<xi:include href="app_intel_suffixes.xml"/>

<!-- The app_foundation.xml appendix file is required by all documents. -->
<xi:include href="../../Docs-Master/common/app_foundation.xml"/>

</book>

@ -0,0 +1,115 @@
<?xml version="1.0" encoding="UTF-8"?>
<!--
Copyright (c) 2017 OpenPOWER Foundation
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<chapter xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="ch_howto_start">
<title>How do we work this?</title>
<para>The working assumption is to start with the existing GCC headers from
./gcc/config/i386/, then convert them to PowerISA and add them to
./gcc/config/rs6000/. I assume we will replicate the existing header structure
and retain the existing header file and intrinsic names. This also allows us to
reuse existing DejaGNU test cases from ./gcc/testsuite/gcc.target/i386, modify
them as needed for the POWER target, and them to the
./gcc/testsuite/gcc.target/powerpc.</para>

<para>We can be flexible on the sequence that headers/intrinsics and test
cases are ported.  This should be based on customer need and resolving
internal dependencies.  This implies an oldest-to-newest / bottoms-up (MMX,
SSE, SSE2, …) strategy. The assumption is, existing community and user
application codes, are more likely to have optimized code for previous
generation ubiquitous (SSE, SSE2, ...) processors than the latest (and rare)
SkyLake AVX512.</para>

<para>I would start with an existing header from the current GCC
 ./gcc/config/i386/ and copy the header comment (including FSF copyright) down
to any vector typedefs used in the API or implementation. Skip the Intel
intrinsic implementation code for now, but add the ending #end if matching the
headers conditional guard against multiple inclusion. You can add  #include
&lt;alternative&gt; as needed. For examples:
<programlisting><![CDATA[/* Copyright (C) 2003-2017 Free Software Foundation, Inc
...
/* This header provides a best effort implementation of the Intel X86
* SSE2 intrinsics for the PowerPC target. This implementation is a
* combination of compiled C vector codes or equivalent sequences of
* GCC vector builtins from the GCC PowerPC Altivec target.
*
* However some details of this implementation will differ from
* the X86 due to differences in the underlying hardware or GCC
* implementation. For example the PowerPC target only uses unordered
* floating point compares. */

#ifndef EMMINTRIN_H_
#define EMMINTRIN_H_

#include <altivec.h>
#include <assert.h>

/* We need definitions from the SSE header files. */
#include <xmmintrin.h>

/* The Intel API is flexible enough that we must allow aliasing with other
vector types, and their scalar components. */
typedef float __m128 __attribute__ ((__vector_size__ (16), __may_alias__));

/* Internal data types for implementing the intrinsics. */
typedef float __v4sf __attribute__ ((__vector_size__ (16)));
/* more typedefs. */

/* The intrinsic implmentations go here. */

#endif /* EMMINTRIN_H_ */]]></programlisting></para>

<para>Then you can start adding small groups of related intrinsic
implementations to the header to be compiled and  examine the generated code.
Once you have what looks like reasonable code you can grep through
 ./gcc/testsuite/gcc.target/i386 for examples using the intrinsic names you
just added. You should be able to find functional tests for most X86
intrinsics. </para>

<para>The
<link xlink:href="https://gcc.gnu.org/onlinedocs/gccint/Testsuites.html#Testsuites">GCC
testsuite</link> uses the DejaGNU  test framework as documented in the
<link xlink:href="https://gcc.gnu.org/onlinedocs/gccint/">GNU Compiler Collection (GCC)
Internals</link> manual. GCC adds its own DejaGNU directives and extensions,
that are embedded in the testsuite source as comments.  Some are platform
specific and will need to be adjusted for tests that are ported to our
platform. For example
<programlisting><![CDATA[/* { dg-do run } */
/* { dg-options "-O2 -msse2" } */
/* { dg-require-effective-target sse2 } */]]></programlisting></para>

<para>should become something like
<programlisting><![CDATA[/* { dg-do run } */
/* { dg-options "-O3 -mpower8-vector" } */
/* { dg-require-effective-target lp64 } */
/* { dg-require-effective-target p8vector_hw { target powerpc*-*-* } } */]]></programlisting></para>

<para>Repeat this process until you have equivalent implementations for all
the intrinsics in that header and associated test cases that execute without
error.</para>

<xi:include href="sec_prefered_methods.xml"/>
<xi:include href="sec_prepare.xml"/>
<xi:include href="sec_differences.xml"/>


</chapter>

@ -0,0 +1,46 @@
<?xml version="1.0" encoding="UTF-8"?>
<!--
Copyright (c) 2017 OpenPOWER Foundation
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<chapter xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="ch_intel_intrinsic_porting">
<title>Intel Intrinsic porting guide for Power64LE</title>
<para>The goal of this project is to provide functional equivalents of the
Intel MMX, SSE, and AVX intrinsic functions, that are commonly used in Linux
applications, and make them (or equivalents) available for the PowerPC64LE
platform. These X86 intrinsics started with the Intel and Microsoft compilers
but were then ported to the GCC compiler. The GCC implementation is a set of
headers with inline functions. These inline functions provide a implementation
mapping from the Intel/Microsoft dialect intrinsic names to the corresponding
GCC Intel built-in's or directly via C language vector extension syntax.</para>

<para>The current proposal is to start with the existing X86 GCC intrinsic
headers and port them (copy and change the source)  to POWER using C language
vector extensions, VMX and VSX built-ins. Another key assumption is that we
will be able to use many of existing Intel DejaGNU test cases on
./gcc/testsuite/gcc.target/i386. This document is intended as a guide to
developers participating in this effort. However this document provides
guidance and examples that should be useful to developers who may encounter X86
intrinsics in code that they are porting to another platform.</para>

<xi:include href="sec_review_source.xml"/>

</chapter>

@ -0,0 +1,148 @@
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<parent>

<groupId>org.openpowerfoundation.docs</groupId>
<artifactId>workgroup-pom</artifactId>
<version>1.0.0-SNAPSHOT</version>
<relativePath>../pom.xml</relativePath>
</parent>
<modelVersion>4.0.0</modelVersion>

<!-- TODO: Rename the artifactID field to some appropriate for your new document -->
<artifactId>Porting-Guide-Vector-Intrinsics</artifactId>

<packaging>jar</packaging>
<!-- TODO: Rename the name field to some appropriate for your new document -->
<name>Porting-Guide-Vector-Intrinsics</name>

<properties>
<!-- This is set by Jenkins according to the branch. -->
<release.path.name></release.path.name>
<comments.enabled>0</comments.enabled>
</properties>
<!-- ################################################ -->
<!-- USE "mvn clean generate-sources" to run this POM -->
<!-- ################################################ -->
<build>
<plugins>
<plugin>

<groupId>org.openpowerfoundation.docs</groupId>

<artifactId>openpowerdocs-maven-plugin</artifactId>
<!-- version set in ../pom.xml -->
<executions>
<execution>
<id>generate-webhelp</id>
<goals>
<goal>generate-webhelp</goal>
</goals>
<phase>generate-sources</phase>
<configuration>
<!-- These parameters only apply to webhelp -->
<enableDisqus>${comments.enabled}</enableDisqus>
<disqusShortname>LoPAR-Virtualization</disqusShortname>
<enableGoogleAnalytics>1</enableGoogleAnalytics>
<googleAnalyticsId>UA-17511903-1</googleAnalyticsId>
<generateToc>
appendix toc,title
article/appendix nop
article toc,title
book toc,title,figure,table,example,equation
book/appendix nop
book/chapter nop
chapter toc,title
chapter/section nop
section toc
part toc,title
qandadiv toc
qandaset toc
reference toc,title
set toc,title
</generateToc>
<!-- The following elements sets the autonumbering of sections in output for chapter numbers but no numbered sections-->
<sectionAutolabel>1</sectionAutolabel>
<tocSectionDepth>3</tocSectionDepth>
<sectionLabelIncludesComponentLabel>1</sectionLabelIncludesComponentLabel>

<!-- TODO: Rename the webhelpDirname field to the new directory for new document -->
<webhelpDirname>Vector-Intrinsics</webhelpDirname>

<!-- TODO: Rename the pdfFilenameBase field to the PDF name for new document -->
<pdfFilenameBase>Vector-Intrinsics</pdfFilenameBase>

<!-- TODO: Define the appropriate work product type. These values are defined by the IPR Policy.
Consult with the Work Group Chair or a Technical Steering Committee member if you have
questions about which value to select.
If no value is provided below, the document will default to "Work Group Notes".-->
<workProduct>workgroupNotes</workProduct>
<!--workProduct>workgroupSpecification</workProduct-->
<!-- workProduct>candidateStandard</workProduct -->
<!-- workProduct>openpowerStandard</workProduct -->

<!-- TODO: Set the appropriate security policy for the document. For documents
which are not "public" this will affect the document title page and
create a vertical running ribbon on the internal margin of the
security status in all CAPS. Values and definitions are formally
defined by the IPR policy. A layman's definition follows:

public = this document may be shared outside the
foundation and thus this setting must be
used only when completely sure it allowed
foundationConfidential = this document may be shared freely with
OpenPOWER Foundation members but may not be
shared publicly
workgroupConfidential = this document may only be shared within the
work group and should not be shared with
other Foundation members or the public

The appropriate starting security for a new document is "workgroupConfidential". -->
<!--security>workgroupConfidential</security -->
<!-- security>foundationConfidential</security -->
<security>public</security>

<!-- TODO: Set the appropriate work flow status for the document. For documents
which are not "published" this will affect the document title page
and create a vertical running ribbon on the internal margin of the
security status in all CAPS. Values and definitions are formally
defined by the IPR policy. A layman's definition follows:

published = this document has completed all reviews and has
been published
draft = this document is actively being updated and has
not yet been reviewed
review = this document is presently being reviewed

The appropriate starting security for a new document is "draft". -->
<documentStatus>draft</documentStatus>
<!-- documentStatus>review</documentStatus -->
<!-- documentStatus>publish</documentStatus -->

</configuration>
</execution>
</executions>
<configuration>
<!-- These parameters apply to pdf and webhelp -->
<xincludeSupported>true</xincludeSupported>
<sourceDirectory>.</sourceDirectory>
<includes>
<!-- TODO: If you desire, you may change the following filename to something more appropriate for the new document -->
bk_main.xml
</includes>

<!-- **TODO: Set to the correct project URL. This likely needs input from the TSC. -->
<!-- canonicalUrlBase>http://openpowerfoundation.org/docs/template-guide/content</canonicalUrlBase -->
<glossaryCollection>${basedir}/../glossary/glossary-terms.xml</glossaryCollection>
<includeCoverLogo>1</includeCoverLogo>
<coverUrl>www.openpowerfoundation.org</coverUrl>
</configuration>
</plugin>
</plugins>
</build>
</project>

@ -0,0 +1,35 @@
<?xml version="1.0" encoding="UTF-8"?>
<!--
Copyright (c) 2017 OpenPOWER Foundation
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="sec_api_implemented">
<title>How the API is implemented</title>
<para>One pleasant surprise is that many (at least for the older Intel)
Intrinsics are implemented directly in C vector extension code and/or a simple
mapping to GCC target specific builtins. </para>
<xi:include href="sec_simple_examples.xml"/>
<xi:include href="sec_extra_attributes.xml"/>
<xi:include href="sec_how_findout.xml"/>
<xi:include href="sec_other_intrinsic_examples.xml"/>

</section>

@ -0,0 +1,111 @@
<?xml version="1.0" encoding="UTF-8"?>
<!--
Copyright (c) 2017 OpenPOWER Foundation
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="sec_crossing_lanes">
<title>Crossing lanes</title>
<para>We have seen that, most of the time, vector SIMD units prefer to keep
computations in the same “lane” (element number) as the input elements. The
only exception in the examples so far are the occasional splat (copy one
element to all the other elements of the vector) operations. Splat is an
example of the general category of “permute” operations (Intel would call
this a “shuffle” or “blend”). Permutes selects and rearrange the
elements of (usually) a concatenated pair of vectors and delivers those
selected elements, in a specific order, to a result vector. The selection and
order of elements in the result is controlled by a third vector, either as 3rd
input vector or and immediate field of the instruction.</para>

<para>For example the Intel intrisics for
<link xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_hadd&amp;expand=2757,4767,409,2757">Horizontal Add / Subtract</link>
added with SSE3. These instrinsics add (subtract) adjacent element pairs, across pair of
input vectors, placing the sum of the adjacent elements in the result vector.
For example
<link xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_hadd_ps&amp;expand=2757,4767,409,2757,2757">_mm_hadd_ps</link>  
which implments the operation on float:
<programlisting><![CDATA[ result[0] = __A[1] + __A[0];
result[1] = __A[3] + __A[2];
result[2] = __B[1] + __B[0];
result[3] = __B[3] + __B[2];]]></programlisting></para>

<para>Horizontal Add (hadd) provides an incremental vector “sum across”
operation commonly needed in matrix and vector transform math. Horizontal Add
is incremental as you need three hadd instructions to sum across 4 vectors of 4
elements ( 7 for 8 x 8, 15 for 16 x 16, …).</para>
<para>The PowerISA does not have a sum-across operation for float or
double. We can user the vector float add instruction after we rearrange the
inputs so that element pairs line up for the horizontal add. For example we
would need to permute the input vectors {1, 2, 3, 4} and {101, 102, 103, 104}
into vectors {2, 4, 102, 104} and {1, 3, 101, 103} before
the  <literal>vec_add</literal>. This
requires two vector permutes to align the elements into the correct lanes for
the vector add (to implement Horizontal Add).  </para>

<para>The PowerISA provides generalized byte-level vector permute (vperm)
based a vector register pair source as input and a control vector. The control
vector provides 16 indexes (0-31) to select bytes from the concatenated input
vector register pair (VRA, VRB). A more specific set of permutes (pack, unpack,
merge, splat) operations (across element sizes) are encoded as separate
 instruction opcodes or instruction immediate fields.</para>

<para>Unfortunately only the general <literal>vec_perm</literal>
can provide the realignment
we need the _mm_hadd_ps operation or any of the int, short variants of hadd.
For example:
<programlisting><![CDATA[extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_hadd_ps (__m128 __X, __m128 __Y)
{
__vector unsigned char xform2 = {
0x00, 0x01, 0x02, 0x03, 0x08, 0x09, 0x0A, 0x0B,
0x10, 0x11, 0x12, 0x13, 0x18, 0x19, 0x1A, 0x1B
};
__vector unsigned char xform1 = {
0x04, 0x05, 0x06, 0x07, 0x0C, 0x0D, 0x0E, 0x0F,
0x14, 0x15, 0x16, 0x17, 0x1C, 0x1D, 0x1E, 0x1F
};
return (__m128) vec_add (vec_perm ((__v4sf) __X, (__v4sf) __Y, xform1),
vec_perm ((__v4sf) __X, (__v4sf) __Y, xform2));
}]]></programlisting></para>

<para>This requires two permute control vectors; one to select the even
word elements across <literal>__X</literal> and <literal>__Y</literal>,
and another to select the odd word elements
across <literal>__X</literal> and <literal>__Y</literal>.
The result of these permutes (<literal>vec_perm</literal>) are inputs to the
<literal>vec_add</literal> and completes the add operation. </para>

<para>Fortunately the permute required for the double (64-bit) case (IE
_mm_hadd_pd) reduces to the equivalent of <literal>vec_mergeh</literal> /
<literal>vec_mergel</literal>  doubleword
(which are variants of  VSX Permute Doubleword Immediate). So the
implementation of _mm_hadd_pd can be simplified to this:
<programlisting><![CDATA[extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_hadd_pd (__m128d __X, __m128d __Y)
{
return (__m128d) vec_add (vec_mergeh ((__v2df) __X, (__v2df)__Y),
vec_mergel ((__v2df) __X, (__v2df)__Y));
}]]></programlisting></para>

<para>This eliminates the load of the control vectors required by the
previous example.</para>

</section>

@ -0,0 +1,40 @@
<?xml version="1.0" encoding="UTF-8"?>
<!--
Copyright (c) 2017 OpenPOWER Foundation
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="sec_differences">
<title>Profound differences</title>
<para>We have already mentioned above a number of architectural differences
that effect porting of codes containing Intel intrinsics to POWER. The fact
that Intel supports multiple vector extensions with different vector widths
(64, 128, 256, and 512-bits) while the PowerISA only supports vectors of
128-bits is one issue. Another is the difference in how the respective ISAs
support scalars in vector registers is another.  In the text above we propose
workable alternatives for the PowerPC port. There also differences in the
handling of floating point exceptions and rounding modes that may impact the
application's performance or behavior.</para>
<xi:include href="sec_floatingpoint_exceptions.xml"/>
<xi:include href="sec_floatingpoint_rounding.xml"/>
<xi:include href="sec_performance.xml"/>

</section>

@ -0,0 +1,137 @@
<?xml version="1.0" encoding="UTF-8"?>
<!--
Copyright (c) 2017 OpenPOWER Foundation
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="sec_extra_attributes">
<title>Those extra attributes</title>
<para>You may have noticed there are some special attributes:
<literallayout>__gnu_inline__

This attribute should be used with a function that is also declared with the
inline keyword. It directs GCC to treat the function as if it were defined in
gnu90 mode even when compiling in C99 or gnu99 mode.

If the function is declared extern, then this definition of the function is used
only for inlining. In no case is the function compiled as a standalone function,
not even if you take its address explicitly. Such an address becomes an external
reference, as if you had only declared the function, and had not defined it. This
has almost the effect of a macro. The way to use this is to put a function
definition in a header file with this attribute, and put another copy of the
function, without extern, in a library file. The definition in the header file
causes most calls to the function to be inlined.

__always_inline__

Generally, functions are not inlined unless optimization is specified. For func-
tions declared inline, this attribute inlines the function independent of any
restrictions that otherwise apply to inlining. Failure to inline such a function
is diagnosed as an error.

__artificial__

This attribute is useful for small inline wrappers that if possible should appear
during debugging as a unit. Depending on the debug info format it either means
marking the function as artificial or using the caller location for all instructions
within the inlined body.

__extension__

... -pedantic and other options cause warnings for many GNU C extensions.
You can prevent such warnings within one expression by writing __extension__</literallayout></para>

<para>So far I have been using these attributes unchanged.</para>

<para>But most intrinsics map the Intel intrinsic to one or more target
specific GCC builtins. For example:
<programlisting><![CDATA[/* Load two DPFP values from P. The address must be 16-byte aligned. */
extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_load_pd (double const *__P)
{
return *(__m128d *)__P;
}

/* Load two DPFP values from P. The address need not be 16-byte aligned. */
extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_loadu_pd (double const *__P)
{
return __builtin_ia32_loadupd (__P);
}]]></programlisting></para>

<para>The first intrinsic (_mm_load_pd ) is implement as a C vector pointer
reference, but from the comment assumes the compiler will use a
<emphasis role="bold">movapd</emphasis>
instruction that requires 16-byte alignment (will raise a general-protection
exception if not aligned). This  implies that there is a performance advantage
for at least some Intel processors to keep the vector aligned. The second
intrinsic uses the explicit GCC builtin
<emphasis role="bold"><literal>__builtin_ia32_loadupd</literal></emphasis> to generate the
<emphasis role="bold"><literal>movupd</literal></emphasis> instruction which handles unaligned references.</para>

<para>The opposite assumption applies to POWER and PPC64LE, where GCC
generates the VSX <emphasis role="bold"><literal>lxvd2x</literal></emphasis> /
<emphasis role="bold"><literal>xxswapd</literal></emphasis>
instruction sequence by default, which
allows unaligned references. The PowerISA equivalent for aligned vector access
is the VMX <emphasis role="bold"><literal>lvx</literal></emphasis> instruction and the
<emphasis role="bold"><literal>vec_ld</literal></emphasis> builtin, which forces quadword
aligned access (by ignoring the low order 4 bits of the effective address). The
<emphasis role="bold"><literal>lvx</literal></emphasis> instruction does not raise
alignment exceptions, but perhaps should as part
of our implementation of the Intel intrinsic. This requires that we use
PowerISA VMX/VSX built-ins to insure we get the expected results.</para>

<para>The current prototype defines the following:
<programlisting><![CDATA[/* Load two DPFP values from P. The address must be 16-byte aligned. */
extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_load_pd (double const *__P)
{
assert(((unsigned long)__P & 0xfUL) == 0UL);
return ((__m128d)vec_ld(0, (__v16qu*)__P));
}

/* Load two DPFP values from P. The address need not be 16-byte aligned. */
extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_loadu_pd (double const *__P)
{
return (vec_vsx_ld(0, __P));
}]]></programlisting></para>

<para>The aligned  load intrinsic adds an assert which checks alignment
(to match the Intel semantic) and uses  the GCC builtin
<emphasis role="bold"><literal>vec_ld</literal></emphasis> (generates an
<emphasis role="bold"><literal>lvx</literal></emphasis>).  The assert
generates extra code but this can be eliminated by defining
<emphasis role="bold"><literal>NDEBUG</literal></emphasis> at compile time.
The unaligned load intrinsic uses the GCC builtin
<literal>vec_vsx_ld</literal>  (for PPC64LE generates
<emphasis role="bold"><literal>lxvd2x</literal></emphasis> /
<emphasis role="bold"><literal>xxswapd</literal></emphasis> for POWER8  and will
simplify to <emphasis role="bold"><literal>lxv</literal></emphasis>
or <emphasis role="bold"><literal>lxvx</literal></emphasis>
for POWER9).  And similarly for <emphasis role="bold"><literal>__mm_store_pd</literal></emphasis> /
<emphasis role="bold"><literal>__mm_storeu_pd</literal></emphasis>, using
<emphasis role="bold"><literal>vec_st</literal></emphasis>
and <emphasis role="bold"><literal>vec_vsx_st</literal></emphasis>. These concepts extent to the
load/store intrinsics for vector float and vector int.</para>

</section>

@ -0,0 +1,73 @@
<?xml version="1.0" encoding="UTF-8"?>
<!--
Copyright (c) 2017 OpenPOWER Foundation
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="sec_floatingpoint_exceptions">
<title>Floating Point Exceptions</title>
<para>Nominally both ISAs support the IEEE754 specifications, but there are
some subtle differences. Both architecture define a status and control register
to record exceptions and enable / disable floating exceptions for program
interrupt or default action. Intel has a MXCSR and PowerISA has a FPSCR which
basically do the same thing but with different bit layout. </para>

<para>Intel provides <literal>_mm_setcsr</literal> / <literal>_mm_getcsr</literal>
intrinsics to allow direct
access to the MXCSR. In the early days before the OS POSIX run-times where
updated  to manage the MXCSR, this might have been useful. Today this would be
highly discouraged with a strong preference to use the POSIX APIs
(<literal>feclearexceptflag</literal>,
<literal>fegetexceptflag</literal>,
<literal>fesetexceptflag</literal>, ...) instead.</para>

<para>If we implement <literal>_mm_setcsr</literal> /
<literal>_mm_getcs</literal> at all, we should simply
redirect the implementation to use the POSIX APIs from
<literal>&lt;fenv.h&gt;</literal>. But it
might be simpler just to replace these intrinsics with macros that generate
#error.</para>

<para>The Intel MXCSR does have some none (POSIX/IEEE754) standard quirks;
Flush-To-Zero and Denormals-Are-Zeros flags. This simplifies the hardware
response to what should be a rare condition (underflows where the result can
not be represented in the exponent range and precision of the format) by simply
returning a signed 0.0 value. The intrinsic header implementation does provide
constant masks for <literal>_MM_DENORMALS_ZERO_ON</literal>
(<literal>&lt;pmmintrin.h&gt;</literal>) and
<literal>_MM_FLUSH_ZERO_ON</literal> (<literal>&lt;xmmintrin.h&gt;</literal>,
so technically it is available to users
of the Intel Intrinsics API.</para>

<para>The VMX Vector facility provides a separate Vector Status and Control
register (VSCR) with a Non-Java Mode control bit. This control combines the
flush-to-zero semantics for floating Point underflow and denormal values. But
this control only applies to VMX vector float instructions and does not apply
to VSX scalar floating Point or vector double instructions. The FPSCR does
define a Floating-Point non-IEEE mode which is optional in the architecture.
This would apply to Scalar and VSX floating-point operations if it was
implemented. This was largely intended for embedded processors and is not
implemented in the POWER processor line.</para>

<para>As the flush-to-zero is primarily a performance enhansement and is
clearly outside the IEEE754 standard, it may be best to simply ignore this
option for the intrinsic port.</para>

</section>

@ -0,0 +1,33 @@
<?xml version="1.0" encoding="UTF-8"?>
<!--
Copyright (c) 2017 OpenPOWER Foundation
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="sec_floatingpoint_rounding">
<title>Floating-point rounding modes</title>
<para>The Intel (x86 / x86_64) and PowerISA architectures both support the
4 IEEE754 rounding modes. Again while the Intel Intrinsic API allows the
application to change rounding modes via updates to the
<literal>MXCSR</literal> it is a bad idea
and should be replaced with the POSIX APIs (<literal>fegetround</literal> and
<literal>fesetround</literal>). </para>
</section>

@ -0,0 +1,113 @@
<?xml version="1.0" encoding="UTF-8"?>
<!--
Copyright (c) 2017 OpenPOWER Foundation
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="sec_gcc_vector_extensions">
<title>GCC Vector Extensions</title>
<para>The GCC vector extensions are common syntax but implemented in a
target specific way. Using the C vector extensions require the
<literal>__gnu_inline__</literal>
attribute to avoid syntax errors in case the user specified  C standard
compliance (<literal>-std=c90</literal>, <literal>-std=c11</literal>,
etc) that would normally disallow such
extensions. </para>

<para>The GCC implementation for PowerPC64 Little Endian is (mostly)
functionally compatible with x86_64 vector extension usage. We can use the same
type definitions (at least for  vector_size (16)), operations, syntax
<literal>&lt;</literal><emphasis role="bold"><literal>{</literal></emphasis><literal>...</literal><emphasis role="bold"><literal>}</literal></emphasis><literal>&gt;</literal>
for vector initializers and constants, and array syntax
<literal>&lt;</literal><emphasis role="bold"><literal>[]</literal></emphasis><literal>&gt;</literal>
for vector element access. So simple arithmetic / logical operations
on whole vectors should work as is. </para>

<para>The caveat is that the interface data type of the Intel Intrinsic may
not match the data types of the operation, so it may be necessary to cast the
operands to the specific type for the operation. This also applies to vector
initializers and accessing vector elements. You need to use the appropriate
type to get the expected results. Of course this applies to X86_64 as well. For
example:
<programlisting><![CDATA[/* Perform the respective operation on the four SPFP values in A and B. */
extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_add_ps (__m128 __A, __m128 __B)
{
return (__m128) ((__v4sf)__A + (__v4sf)__B);
}

/* Stores the lower SPFP value. */
extern __inline void __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_store_ss (float *__P, __m128 __A)
{
*__P = ((__v4sf)__A)[0];
}]]></programlisting></para>

<para>Note the cast from the interface type (<literal>__m128</literal>} to the implementation
type (<literal>__v4sf</literal>, defined in the intrinsic header) for the vector float add (+)
operation. This is enough for the compiler to select the appropriate vector add
instruction for the float type. Then the result (which is
<literal>__v4sf</literal>) needs to be
cast back to the expected interface type (<literal>__m128</literal>). </para>

<para>Note also the use of <emphasis>array syntax</emphasis> (<literal>__A)[0]</literal>)
to extract the lowest
(left most<footnote><para>Here we are using logical left and logical right
which will not match the PowerISA register view in Little endian. Logical left
is the left most element for initializers {left, … , right}, storage order
and array  order where the left most element is [0].</para></footnote>)
element of a vector. The cast (<literal>__v4sf</literal>) insures that the compiler knows we are
extracting the left most 32-bit float. The compiler insures the code generated
matches the Intel behavior for PowerPC64 Little Endian. </para>

<para>The code generation is complicated by the fact that PowerISA vector
registers are Big Endian (element 0 is the left most word of the vector) and
X86 scalar stores are from the left most (work/dword) for the vector register.
Application code with extensive use of scalar (vs packed) intrinsic loads /
stores should be flagged for rewrite to native PPC code using exisiing scalar
types (float, double, int, long, etc.). </para>

<para>Another example is the set reverse order:
<programlisting><![CDATA[/* Create the vector [Z Y X W]. */
extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_set_ps (const float __Z, const float __Y, const float __X, const float __W)
{
return __extension__ (__m128)(__v4sf){ __W, __X, __Y, __Z };
}

/* Create the vector [W X Y Z]. */
extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_setr_ps (float __Z, float __Y, float __X, float __W)
{
return __extension__ (__m128)(__v4sf){ __Z, __Y, __X, __W };
}]]></programlisting></para>

<para>Note the use of <emphasis>initializer syntax</emphasis> used to collect a set of scalars
into a vector. Code with constant initializer values will generate a vector
constant of the appropriate endian. However code with variables in the
initializer can get complicated as it often requires transfers between register
sets and perhaps format conversions. We can assume that the compiler will
generate the correct code, but if this class of intrinsics shows up a hot spot,
a rewrite to native PPC vector built-ins may be appropriate. For example
initializer of a variable replicated to all the vector fields might not be
recognized as a “load and splat” and making this explicit may help the
compiler generate better code.</para>

</section>

@ -0,0 +1,91 @@
<?xml version="1.0" encoding="UTF-8"?>
<!--
Copyright (c) 2017 OpenPOWER Foundation
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="sec_handling_avx">
<title>Dealing with AVX and AVX512</title>
<para>AVX is a bit easier for PowerISA and the ELF V2 ABI. First we have
lots (64) of vector registers and a super scalar vector pipe-line (can execute
two or more independent 128-bit vector operations concurrently). Second the ELF
V2 ABI was designed to pass and return larger aggregates in vector
registers:</para>

<itemizedlist>
<listitem>
<para>Up to 12 qualified vector arguments can be passed in
v2v13.</para>
</listitem>
<listitem>
<para>A qualified vector argument corresponds to:
<itemizedlist>
<listitem>
<para>A vector data type</para>
</listitem>

<listitem>
<para>A member of a homogeneous aggregate of multiple like data types
passed in up to eight vector registers.</para>
</listitem>

<listitem>
<para>Homogeneous floating-point or vector aggregate return values
that consist of up to eight registers with up to eight elements will
be returned in floating-point or vector registers that correspond to
the parameter registers that would be used if the return value type
were the first input parameter to a function.</para>
</listitem>
</itemizedlist>
</para>
</listitem>
</itemizedlist>

<para>So the ABI allows for passing up to three structures each
representing 512-bit vectors and returning such (512-bit) structure all in VMX
registers. This can be extended further by spilling parameters (beyond 12 X
128-bit vectors) to the parameter save area, but we should not need that, as
most intrinsics only use 2 or 3 operands.. Vector registers not needed for
parameter passing, along with an additional 8 volatile vector registers, are
available for scratch and local variables. All can be used by the application
without requiring register spill to the save area. So most intrinsic operations
on 256- or 512-bit vectors can be held within existing PowerISA vector
registers. </para>

<para>For larger functions that might use multiple AVX 256 or 512-bit
intrinsics and, as a result, push beyond the 20 volatile vector registers, the
compiler will just allocate non-volatile vector registers by allocating a stack
frame and spilling non-volatile vector registers to the save area (as needed in
the function prologue). This frees up to 64 vectors (32 x 256-bit or 16 x
512-bit structs) for code optimization. </para>

<para>Based on the specifics of our ISA and ABI we will not not use
<literal>__vector_size__</literal> (32) or (64) in the PowerPC implementation of
<literal>__m256</literal> and <literal>__m512</literal>
types. Instead we will typedef structs of 2 or 4 vector (<literal>__m128</literal>) fields. This
allows efficient handling of these larger data types without require new GCC
language extensions. </para>

<para>In the end we should use the same type names and definitions as the
GCC X86 intrinsic headers where possible. Where that is not possible we can
define new typedefs that provide the best mapping to the underlying PowerISA
hardware.</para>

</section>

@ -0,0 +1,72 @@
<?xml version="1.0" encoding="UTF-8"?>
<!--
Copyright (c) 2017 OpenPOWER Foundation
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="sec_handling_mmx">
<title>Dealing with MMX</title>
<para>MMX is actually the hard case. The <literal>__m64</literal>
type supports SIMD vector
int types (char, short, int, long).  The  Intel API defines  
<literal>__m64</literal> as:
<programlisting><![CDATA[typedef int __m64 __attribute__ ((__vector_size__ (8), __may_alias__));]]></programlisting></para>

<para>Which is problematic for the PowerPC target (not really supported in
GCC) and we would prefer to use a native PowerISA type that can be passed in a
single register.  The PowerISA Rotate Under Mask instructions can easily
extract and insert integer fields of a General Purpose Register (GPR). This
implies that MMX integer types can be handled as a internal union of arrays for
the supported element types. So an 64-bit unsigned long long is the best type
for parameter passing and return values. Especially for the 64-bit (_si64)
operations as these normally generate a single PowerISA instruction.</para>

<para>The SSE extensions include some convert operations for
<literal>_m128</literal> to /
from <literal>_m64</literal> and this includes some int to / from float conversions. However in
these cases the float operands always reside in SSE (XMM) registers (which
match the PowerISA vector registers) and the MMX registers only contain integer
values. POWER8 (PowerISA-2.07) has direct move instructions between GPRs and
VSRs. So these transfers are normally a single instruction and any conversions
can be handed in the vector unit.</para>

<para>When transferring a <literal>__m64</literal> value to a vector register we should also
execute a xxsplatd instruction to insure there is valid data in all four
element lanes before doing floating point operations. This avoids generating
extraneous floating point exceptions that might be generated by uninitialized
parts of the vector. The top two lanes will have the floating point results
that are in position for direct transfer to a GPR or stored via Store Float
Double (stfd). These operation are internal to the intrinsic implementation and
there is no requirement to keep temporary vectors in correct Little Endian
form.</para>

<para>Also for the smaller element sizes and higher element counts (MMX
<literal>_pi8</literal> and <literal>_p16</literal> types) the number of  Rotate Under Mask instructions required to
disassemble the 64-bit <literal>__m64</literal>
into elements, perform the element calculations,
and reassemble the elements in a single <literal>__m64</literal>
value can get larger. In this
case we can generate shorter instruction sequences by transfering (via direct
move instruction) the GPR <literal>__m64</literal> value to the
a vector register, performance the
SIMD operation there, then transfer the <literal>__m64</literal>
result back to a GPR.</para>
</section>

@ -0,0 +1,60 @@
<?xml version="1.0" encoding="UTF-8"?>
<!--
Copyright (c) 2017 OpenPOWER Foundation
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="sec_how_findout">
<title>How did I find this out?</title>
<para>The next question is where did I get the details above. The GCC
documentation for <emphasis role="bold"><literal>__builtin_ia32_loadupd</literal></emphasis>
provides minimal information (the
builtin name, parameters and return types). Not very informative. </para>

<para>Looking up the Intel intrinsic description is more informative. You
can Google the intrinsic name or use the
<link xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/">Intel
Intrinsic guide</link> for this. The Intrinsic Guide is interactive and
includes  Intel (Chip) technology and text based search capabilities. Clicking
on the intrinsic name opens to a synopsis including; the underlying instruction
name, text description, operation pseudo code, and in some cases performance
information (latency and throughput).</para>

<para>The key is to get a description of the intrinsic (operand fields and
types, and which fields are updated for the result) and the underlying Intel
instruction. If the Intrinsic guide is not clear you can look up the
instruction details in the
“<link xlink:href="https://software.intel.com/en-us/articles/intel-sdm">Intel® 64 and IA-32
Architectures Software Developers Manual</link>”.</para>

<para>Information about the PowerISA vector facilities is found in the
<link xlink:href="https://openpowerfoundation.org/?resource_lib=ibm-power-isa-version-2-07-b">PowerISA Version 2.07B</link> (for POWER8 and
<link xlink:href="https://www.docdroid.net/tWT7hjD/powerisa-v30.pdf.html">3.0 for
POWER9</link>) manual, Book I, Chapter 6. Vector Facility and Chapter 7.
Vector-Scalar Floating-Point Operations. Another good reference is the
<link xlink:href="https://openpowerfoundation.org/technical/technical-resources/technical-specifications/">OpenPOWER ELF V2 application binary interface</link> (ABI)
document, Chapter 6. Vector Programming Interfaces and Appendix A. Predefined
Functions for Vector Programming.</para>

<para>Another useful document is the original <link xlink:href="http://www.nxp.com/assets/documents/data/en/reference-manuals/ALTIVECPEM.pdf">Altivec Technology Programers Interface Manual</link>
with a  user friendly structure and many helpful diagrams. But alas the PIM does does not
cover the resent PowerISA (power7,  power8, and power9) enhancements.</para>

</section>

@ -0,0 +1,122 @@
<?xml version="1.0" encoding="UTF-8"?>
<!--
Copyright (c) 2017 OpenPOWER Foundation
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="sec_intel_intrinsic_functions">
<title>Intel Intrinsic functions</title>
<para>So what is an intrinsic function? From Wikipedia:

<blockquote><para>In <link xlink:href="https://en.wikipedia.org/wiki/Compiler_theory">compiler theory</link>, an
<emphasis role="bold">intrinsic function</emphasis> is a function available for use in a given
<link xlink:href="https://en.wikipedia.org/wiki/Programming_language">programming
language</link> whose implementation is handled specially by the compiler.
Typically, it substitutes a sequence of automatically generated instructions
for the original function call, similar to an
<link xlink:href="https://en.wikipedia.org/wiki/Inline_function">inline function</link>.
Unlike an inline function though, the compiler has an intimate knowledge of the
intrinsic function and can therefore better integrate it and optimize it for
the situation. This is also called builtin function in many languages.</para></blockquote></para>

<para>The “Intel Intrinsics” API provides access to the many
instruction set extensions (Intel Technologies) that Intel has added (and
continues to add) over the years. The intrinsics provided access to new
instruction capabilities before the compilers could exploit them directly.
Initially these intrinsic functions where defined for the Intel and Microsoft
compiler and where eventually implemented and contributed to GCC.</para>

<para>The Intel Intrinsics have a specific type and naming structure. In
this naming structure, functions starts with a common prefix (MMX and SSE use
'_mm' prefix, while AVX added the '_mm256' '_mm512' prefixes), then a short
functional name ('set', 'load', 'store', 'add', 'mul', 'blend', 'shuffle', '…') and a suffix
('_pd', '_sd', '_pi32'...) with type and packing information. See
<xref linkend="app_intel_suffixes"/> for the list of common intrisic suffixes.</para>

<para>Oddly many of the MMX/SSE operations are not vectors at all. There
are a lot of scalar operations on a single float, double, or long long type. In
effect these are scalars that can take advantage of the larger (xmm) register
space. Also in the Intel 32-bit architecture they provided IEEE754 float and
double types, and 64-bit integers that did not exist or where hard to implement
in the base i386/387 instruction set. These scalar operation use a suffix
starting with '_s' (<literal>_sd</literal> for scalar double float,
<literal>_ss</literal> scalar float, and <literal>_si64</literal>
for scalar long long).</para>

<para>True vector operations use the packed or extended packed suffixes,
starting with '_p' or '_ep' (<literal>_pd</literal> for vector double,
<literal>_ps</literal> for vector float, and
<literal>_epi32</literal> for vector int). The use of '_ep'  
seems to be reserved to disambiguate
intrinsics that existed in the (64-bit vector) MMX extension from the extended
(128-bit vector) SSE equivalent. For example
<emphasis role="bold"><literal>_mm_add_pi32</literal></emphasis> is a MMX operation on
a pair of 32-bit integers, while
<emphasis role="bold"><literal>_mm_add_epi32</literal></emphasis> is an SSE2 operation on vector
of 4 32-bit integers.</para>

<para>The GCC  builtins for the
<link xlink:href="https://gcc.gnu.org/onlinedocs/gcc-6.3.0/gcc/x86-Built-in-Functions.html#x86-Built-in-Functions">i386.target</link>,
(includes x86 and x86_64) are not
the same as the Intel Intrinsics. While they have similar intent and cover most
of the same functions, they use a different naming (prefixed with
<literal>__builtin_ia32_</literal>, then function name with type suffix) and uses GCC vector type
modes for operand types. For example:
<programlisting><![CDATA[v8qi __builtin_ia32_paddb (v8qi, v8qi)
v4hi __builtin_ia32_paddw (v4hi, v4hi)
v2si __builtin_ia32_paddd (v2si, v2si)
v2di __builtin_ia32_paddq (v2di, v2di)]]></programlisting></para>

<para>Note: A key difference between GCC builtins for i386 and Powerpc is
that the x86 builtins have different names of each operation and type while the
powerpc altivec builtins tend to have a single generatic builtin for  each
operation, across a set of compatible operand types. </para>

<para>In GCC the Intel Intrinsic header (*intrin.h) files are implemented
as a set of inline functions using the Intel Intrinsic API names and types.
These functions are implemented as either GCC C vector extension code or via
one or more GCC builtins for the i386 target. So lets take a look at some
examples from GCC's SSE2 intrinsic header emmintrin.h:
<programlisting><![CDATA[extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_add_pd (__m128d __A, __m128d __B)
{
return (__m128d) ((__v2df)__A + (__v2df)__B);
}

extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_add_sd (__m128d __A, __m128d __B)
{
return (__m128d)__builtin_ia32_addsd ((__v2df)__A, (__v2df)__B);
}]]></programlisting></para>

<para>Note that the  
<emphasis role="bold"><literal>_mm_add_pd</literal></emphasis> is implemented direct as C vector
extension code., while
<emphasis role="bold"><literal>_mm_add_sd</literal></emphasis> is implemented via the GCC builtin
<emphasis role="bold"><literal>__builtin_ia32_addsd</literal></emphasis>. From the
discussion above we know the <literal>_pd</literal> suffix
indicates a packed vector double while the <literal>_sd</literal> suffix indicates a scalar double
in a XMM register. </para>
<xi:include href="sec_packed_vs_scalar_intrinsics.xml"/>
<xi:include href="sec_vec_or_not.xml"/>
<xi:include href="sec_crossing_lanes.xml"/>

</section>

@ -0,0 +1,82 @@
<?xml version="1.0" encoding="UTF-8"?>
<!--
Copyright (c) 2017 OpenPOWER Foundation
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="sec_intel_intrinsic_includes">
<title>The structure of the intrinsic includes</title>
<para>The GCC x86 intrinsic functions for vector were initially grouped by
technology (MMX and SSE), which starts with MMX continues with SSE through
SSE4.1 stacked like a set of Russian dolls.</para>

<para>Basically each higher layer include, needs typedefs and helper macros
defined by the lower level intrinsic includes. mm_malloc.h simply provides
wrappers for posix_memalign and free. Then it gets a little weird, starting
with the crypto extensions:
<programlisting><![CDATA[wmmintrin.h (AES) includes emmintrin.h]]></programlisting></para>
<para>For AVX, AVX2, and AVX512 they must have decided
that the Russian Dolls thing was getting out of hand. AVX et all is split
across 14 files
<programlisting><![CDATA[#include <avxintrin.h>
#include <avx2intrin.h>
#include <avx512fintrin.h>
#include <avx512erintrin.h>
#include <avx512pfintrin.h>
#include <avx512cdintrin.h>
#include <avx512vlintrin.h>
#include <avx512bwintrin.h>
#include <avx512dqintrin.h>
#include <avx512vlbwintrin.h>
#include <avx512vldqintrin.h>
#include <avx512ifmaintrin.h>
#include <avx512ifmavlintrin.h>
#include <avx512vbmiintrin.h>
#include <avx512vbmivlintrin.h>]]></programlisting>
but they do not want the applications include these
individually.</para>
<para>So <emphasis role="bold">immintrin.h</emphasis> includes everything Intel vector, include all the
AVX, AES, SSE and MMX flavors.
<programlisting><![CDATA[#ifndef _IMMINTRIN_H_INCLUDED
# error "Never use <avxintrin.h> directly; include <immintrin.h> instead."
#endif]]></programlisting></para>

<para>So what is the net? The include structure provides some strong clues
about the order that we should approach this effort.  For example if you need
to intrinsic from SSE4 (smmintrin.h) we are likely to need to type definitions
from SSE (emmintrin.h). So a bottoms up (MMX, SSE, SSE2, …) approach seems
like the best plan of attack. Also saving the AVX parts for latter make sense,
as most are just wider forms of operations that already exists in SSE.</para>

<para>We should use the same include structure to implement our PowerISA
equivalent API headers. This will make porting easier (drop-in replacement) and
should get the application running quickly on POWER. Then we are in a position
to profile and analyze the resulting application. This will show any hot spots
where the simple one-to-one transformation results in bottlenecks and
additional tuning is needed. For these cases we should improve our tools (SDK
MA/SCA) to identify opportunities for, and perhaps propose, alternative
sequences that are better tuned to PowerISA and our micro-architecture.</para>

</section>

@ -0,0 +1,89 @@
<?xml version="1.0" encoding="UTF-8"?>
<!--
Copyright (c) 2017 OpenPOWER Foundation
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="sec_intel_intrinsic_types">
<title>The types used for intrinsics</title>
<para>The type system for Intel intrinsics is a little strange. For example
from xmmintrin.h:
<programlisting><![CDATA[/* The Intel API is flexible enough that we must allow aliasing with other
vector types, and their scalar components. */
typedef float __m128 __attribute__ ((__vector_size__ (16), __may_alias__));

/* Internal data types for implementing the intrinsics. */
typedef float __v4sf __attribute__ ((__vector_size__ (16)));]]></programlisting></para>

<para>So there is one set of types that are used in the function prototypes
of the API, and the internal types that are used in the implementation. Notice
the special attribute <literal>__may_alias__</literal>. From the GCC documentation:

<blockquote><para>
Accesses through pointers to types with this attribute are not subject
to type-based alias analysis, but are instead assumed to be able to alias any
other type of objects. ... This extension exists to support some vector APIs,
in which pointers to one vector type are permitted to alias pointers to a
different vector type.</para></blockquote></para>
<para>So there are a
couple of issues here: 1)  the API seem to force the compiler to assume
aliasing of any parameter passed by reference. Normally the compiler assumes
that parameters of different size do not overlap in storage, which allows more
optimization. 2) the data type used at the interface may not be the correct
type for the implied operation. So parameters of type
<literal>__m128i</literal> (which is defined
as vector long long) is also used for parameters and return values of vector
[char | short | int ]. </para>

<para>This may not matter when using x86 built-in's but does matter when
the implementation uses C vector extensions or in our case use PowerPC generic
vector built-ins
(<xref linkend="sec_powerisa_vector_intrinsics"/>).
For the later cases the type must be correct for
the compiler to generate the correct type (char, short, int, long)
(<xref linkend="sec_api_implemented"/>) for the generic
builtin operation. There is also concern that excessive use of
<literal>__may_alias__</literal>
will limit compiler optimization. We are not sure how important this attribute
is to the correct operation of the API.  So at a later stage we should
experiment with removing it from our implementation for PowerPC</para>

<para>The good news is that PowerISA has good support for 128-bit vectors
and (with the addition of VSX) all the required vector data (char, short, int,
long, float, double) types. However Intel supports a wider variety of the
vector sizes  than PowerISA does. This started with the 64-bit MMX vector
support that preceded SSE and extends to 256-bit and 512-bit vectors of AVX,
AVX2, and AVX512 that followed SSE.</para>

<para>Within the GCC Intel intrinsic implementation these are all
implemented as vector attribute extensions of the appropriate  size (  
<literal>__vector_size__</literal> ({8 | 16 | 32, and 64}). For the PowerPC target  GCC currently
only supports the native <literal>__vector_size__</literal> ( 16 ). These we can support directly
in VMX/VSX registers and associated instructions. The GCC will compile with
other   <literal>__vector_size__</literal> values, but the resulting types are treated as simple
arrays of the element type. This does not allow the compiler to use the vector
registers and vector instructions for these (nonnative) vectors.   So what is
a programmer to do?</para>
<xi:include href="sec_handling_mmx.xml"/>
<xi:include href="sec_handling_avx.xml"/>

</section>

@ -0,0 +1,76 @@
<?xml version="1.0" encoding="UTF-8"?>
<!--
Copyright (c) 2017 OpenPOWER Foundation
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="sec_more_examples">
<title>Some more intrinsic examples</title>
<para>The intrinsic
<link xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_cvtpd_ps&amp;expand=1624">_mm_cvtpd_ps</link>
converts a packed vector double into
a packed vector single float. Since only 2 doubles fit into a 128-bit vector
only 2 floats are returned and occupy only half (64-bits) of the XMM register.
For this intrinsic the 64-bit are packed into the logical left half of the
registers and the logical right half of the register is set to zero (as per the
Intel <literal>cvtpd2ps</literal> instruction).</para>

<para>The PowerISA provides the VSX Vector round and Convert
Double-Precision to Single-Precision format (xvcvdpsp) instruction. In the ABI
this is <literal>vec_floato</literal> (vector double) .  
This instruction convert each double
element then transfers converted element 0 to float element 1, and converted
element 1 to float element 3. Float elements 0 and 2 are undefined (the
hardware can do what ever). This does not match the expected results for
<literal>_mm_cvtpd_ps</literal>.
<programlisting><![CDATA[vec_floato ({1.0, 2.0}) result = {<undefined>, 1.0, <undefined>, 2.0}
_mm_cvtpd_ps ({1.0, 2.0}) result = {1.0, 2.0, 0.0, 0.0}]]></programlisting></para>

<para>So we need to re-position the results to word elements 0 and 2, which
allows a pack operation to deliver the correct format. Here the merge odd
splats element 1 to 0 and element 3 to 2. The Pack operation combines the low
half of each doubleword from the vector result and vector of zeros to generate
the require format.
<programlisting><![CDATA[extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_cvtpd_ps (__m128d __A)
{
__v4sf result;
__v4si temp;
const __v4si vzero = {0,0,0,0};
// temp = (__v4si)vec_floato (__A); /* GCC 8 */

__asm__(
"xvcvdpsp %x0,%x1;\n"
: "=wa" (temp)
: "wa" (__A)
: );

temp = vec_mergeo (temp, temp);
result = (__v4sf)vec_vpkudum ((vector long)temp, (vector long)vzero);
return (result);
}]]></programlisting></para>

<para>This  technique is also used to implement  
<link xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_cvttpd_epi32&amp;expand=1624,1859">_mm_cvttpd_epi32</link>
which converts a packed vector double in to a packed vector int. The PowerISA instruction
<literal>xvcvdpsxws</literal> uses a similar layout for the result as
<literal>xvcvdpsp</literal> and requires the same fix up.</para>

</section>

@ -0,0 +1,68 @@
<?xml version="1.0" encoding="UTF-8"?>
<!--
Copyright (c) 2017 OpenPOWER Foundation
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="sec_other_intrinsic_examples">
<title>Examples implemented using other intrinsics</title>
<para>Some intrinsic implementations are defined in terms of other
intrinsics. For example.
<programlisting><![CDATA[/* Create a vector with element [0] as F and the rest zero. */
extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_set_sd (double __F)
{
return __extension__ (__m128d){ __F, 0.0 };
}

/* Create a vector with element [0] as *P and the rest zero. */
extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_load_sd (double const *__P)
{
return _mm_set_sd (*__P);
}]]></programlisting></para>

<para>This notion of using part (one fourth or half) of the SSE XMM
register and leaving the rest unchanged (or forced to zero) is specific to SSE
scalar operations and can generate some complicated (sub-optimal) PowerISA
code.  In this case <emphasis role="bold"><literal>_mm_load_sd</literal></emphasis>
passes the dereferenced double value  to
<emphasis role="bold"><literal>_mm_set_sd</literal></emphasis> which
uses C vector initializer notation to combine (merge) that
double scalar value with a scalar 0.0 constant into a vector double.</para>

<para>While code like this should work as-is for PPC64LE, you should look
at the generated code and assess if it is reasonable.  In this case the code
is not awful (a load double splat, vector xor to generate 0.0s, then a
<literal>xxmrghd</literal>
to combine __F and 0.0).  Other examples may generate sub-optimal code and
justify a rewrite to PowerISA scalar or vector code (<link
xlink:href="https://gcc.gnu.org/onlinedocs/gcc-6.3.0/gcc/PowerPC-AltiVec_002fVSX-Built-
in-Functions.html#PowerPC-AltiVec_002fVSX-Built-in-Functions">GCC PowerPC
AltiVec Built-in Functions</link> or inline assembler). </para>

<para>Net: try using the existing C code if you can, but check on what the
compiler generates.  If the generated code is horrendous, it may be worth the
effort to write a PowerISA specific equivalent. For codes making extensive use
of MMX or SSE scalar intrinsics you will be better off rewriting to use
standard C scalar types and letting the the GCC compiler handle the details
(see <link linkend="sec_prefered_methods"/>).</para>

</section>

@ -0,0 +1,302 @@
<?xml version="1.0" encoding="UTF-8"?>
<!--
Copyright (c) 2017 OpenPOWER Foundation
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="sec_packed_vs_scalar_intrinsics">
<title>Packed vs scalar intrinsics</title>
<para>So what is actually going on here? The vector code is clear enough if
you know that '+' operator is applied to each vector element. The the intent of
the builtin is a little less clear, as the GCC documentation for
<literal>__builtin_ia32_addsd</literal> is not very
helpful (nonexistent). So perhaps the
<link xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_add_pd&amp;expand=97">Intel Intrinsic Guide</link>
will be more enlightening. To paraphrase:
<blockquote>

<para>From the
<link xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_add_pd&amp;expand=97"><literal>_mm_add_dp</literal> description</link> ;
for each double float
element ([0] and [1] or bits [63:0] and [128:64]) for operands a and b are
added and resulting vector is returned. </para>

<para>From the
<link xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_add_sd&amp;expand=97,130"><literal>_mm_add_sd</literal> description</link> ;
Add element 0 of first operand
(a[0]) to element 0 of the second operand (b[0]) and return the packed vector
double {(a[0] + b[0]), a[1]}. Or said differently the sum of the logical left
most half of the the operands are returned in the logical left most half
(element [0]) of the  result, along with the logical right half (element [1])
of the first operand (unchanged) in the logical right half of the result.</para></blockquote></para>

<para>So the packed double is easy enough but the scalar double details are
more complicated. One source of complication is that while both Instruction Set
Architectures (SSE vs VSX) support scalar floating point operations in vector
registers the semantics are different. </para>

<itemizedlist>
<listitem>
<para>The vector bit and field numbering is different (reversed).
<itemizedlist>
<listitem>
<para>For Intel the scalar is always placed in the low order (right most)
bits of the XMM register (and the low order address for load and store).</para>
</listitem>
<listitem>
<para>For PowerISA and VSX, scalar floating point operations and Floating
Point Registers (FPRs) are on the low numbered bits which is the left hand
side of the vector / scalar register (VSR). </para>
</listitem>

<listitem>
<para>For the PowerPC64 ELF V2 little endian ABI we also make point of
making the GCC vector extensions and vector built ins, appear to be little
endian. So vector element 0 corresponds to the low order address and low
order (right hand) bits of the vector register (VSR).</para>
</listitem>
</itemizedlist></para>
</listitem>
<listitem>
<para>The handling of the non-scalar part of the register for scalar
operations are different.
<itemizedlist>
<listitem>
<para>For Intel ISA the scalar operations either leaves the high order part
of the XMM vector unchanged or in some cases force it to 0.0.</para>
</listitem>
<listitem>
<para>For PowerISA scalar operations on the combined FPR/VSR register leaves
the remainder (right half of the VSR) <emphasis role="bold">undefined</emphasis>.</para>
</listitem>
</itemizedlist></para>
</listitem>
</itemizedlist>

<para>To minimize confusion and use consistent nomenclature, I will try to
use the terms logical left and logical right elements based on the order they
apprear in a C vector initializers and element index order. So in the vector
<literal>(__v2df){1.0, 20.}</literal>, The value 1.0 is the in the logical left element [0] and
the value 2.0 is logical right element [1].</para>

<para>So lets look at how to implement these intrinsics for the PowerISA.
For example in this case we can use the GCC vector extension, like so:
<programlisting><![CDATA[extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_add_pd (__m128d __A, __m128d __B)
{
return (__m128d) ((__v2df)__A + (__v2df)__B);
}

extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_add_sd (__m128d __A, __m128d __B)
{
__A[0] = __A[0] + __B[0];
return (__A);
}]]></programlisting></para>

<para>The packed double implementation operates on the vector as a whole.
The scalar double implementation operates on and updates only [0] element of
the vector and leaves the <literal>__A[1]</literal> element unchanged.  
Form this source the GCC
compiler generates the following code for PPC64LE target.:</para>

<para>The packed vector double generated the corresponding VSX vector
double add (xvadddp). But the scalar implementation is bit more complicated.
<programlisting><![CDATA[0000000000000720 <test_add_pd>:
720: 07 1b 42 f0 xvadddp vs34,vs34,vs35
...

0000000000000740 <test_add_sd>:
740: 56 13 02 f0 xxspltd vs0,vs34,1
744: 57 1b 63 f0 xxspltd vs35,vs35,1
748: 03 19 60 f0 xsadddp vs35,vs0,vs35
74c: 57 18 42 f0 xxmrghd vs34,vs34,vs35
...
]]></programlisting></para>

<para>First the PPC64LE vector format, element [0] is not in the correct
position for  the scalar operations. So the compiler generates vector splat
double (<literal>xxspltd</literal>) instructions to copy elements <literal>__A[0]</literal> and
<literal>__B[0]</literal> into position
for the VSX scalar add double (xsadddp) that follows. However the VSX scalar
operation leaves the other half of the VSR undefined (which does not match the
expected Intel semantics). So the compiler must generates a vector merge high
double (<literal>xxmrghd</literal>) instruction to combine the original
<literal>__A[1]</literal> element (from <literal>vs34</literal>)
with the scalar add result from <literal>vs35</literal>
element [1]. This merge swings the scalar
result from <literal>vs35[1]</literal> element into the
<literal>vs34[0]</literal> position, while preserving the
original <literal>vs34[1]</literal> (from <literal>__A[1]</literal>)
element (copied to itself).<footnote><para>Fun
fact: The vector registers in PowerISA are decidedly Big Endian. But we decided
to make the PPC64LE ABI behave like a Little Endian system to make application
porting easier. This require the compiler to manipulate the PowerISA vector
instrinsic behind the the scenes to get the correct Little Endian results. For
example the element selector [0|1] for <literal>vec_splat</literal> and the
generation of <literal>vec_mergeh</literal> vs <literal>vec_mergel</literal>
are reversed for the Little Endian.</para></footnote></para>

<para>This technique applies to packed and scalar intrinsics for the the
usual arithmetic operators (add, subtract, multiply, divide). Using GCC vector
extensions in these intrinsic implementations provides the compiler more
opportunity to optimize the whole function. </para>

<para>Now we can look at a slightly more interesting (complicated) case.
Square root (<literal>sqrt</literal>) is not a arithmetic operator in C and is usually handled
with a library call or a compiler builtin. We really want to avoid a library
calls and want to avoid any unexpected side effects. As you see below the
implementation of
<link xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_sqrt_pd&amp;expand=4926"><literal>_mm_sqrt_pd</literal></link> and
<link xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_sqrt_sd&amp;expand=4926,4956"><literal>_mm_sqrt_sd</literal></link>
intrinsics are based on GCC x86 built ins.
<programlisting><![CDATA[extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_sqrt_pd (__m128d __A)
{
return (__m128d)__builtin_ia32_sqrtpd ((__v2df)__A);
}

/* Return pair {sqrt (B[0]), A[1]}. */
extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_sqrt_sd (__m128d __A, __m128d __B)
{
__v2df __tmp = __builtin_ia32_movsd ((__v2df)__A, (__v2df)__B);
return (__m128d)__builtin_ia32_sqrtsd ((__v2df)__tmp);
}]]></programlisting></para>

<para>For the packed vector sqrt, the PowerISA VSX has an equivalent vector
double square root instruction and GCC provides the <literal>vec_sqrt</literal> builtin. But the
scalar implementation involves an additional parameter and an extra move.
 This seems intended to mimick the propagation of the <literal>__A[1]</literal> input to the
logical right half of the XMM result that we saw with <literal>_mm_add_sd above</literal>.</para>

<para>The instinct is to extract the low scalar (<literal>__B[0]</literal>)
from operand <literal>__B</literal>
and pass this to  the GCC <literal>__builtin_sqrt ()</literal> before recombining that scalar
result with <literal>__A[1]</literal> for the vector result. Unfortunately C language standards
force the compiler to call the libm sqrt function unless <literal>-ffast-math</literal> is
specified. The <literal>-ffast-math</literal> option is not commonly used and we want to avoid the
external library dependency for what should be only a few inline instructions.
So this is not a good option.</para>

<para>Thinking outside the box; we do have an inline intrinsic for a
(packed) vector double sqrt, that we just implemented. However we need to
insure the other half of <literal>__B</literal> (<literal>__B[1]</literal>)
does not cause an harmful side effects
(like raising exceptions for NAN or  negative values). The simplest solution
is to splat <literal>__B[0]</literal> to both halves of a temporary value before taking the
<literal>vec_sqrt</literal>. Then this result can be combined with <literal>__A[1]</literal> to return the final
result. For example:
<programlisting><![CDATA[extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_sqrt_pd (__m128d __A)
{
return (vec_sqrt (__A));
}

extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_sqrt_sd (__m128d __A, __m128d __B)
{
__m128d c;
c = _mm_sqrt_pd(_mm_set1_pd (__B[0]));
return (_mm_setr_pd (c[0], __A[1]));
}]]></programlisting></para>
<para>In this  example we use
<link xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_set1_pd&amp;expand=4926,4956,4926,4956,4652"><literal>_mm_set1_pd</literal></link>
to splat the scalar <literal>__B[0]</literal>, before passing that vector to our
<literal>_mm_sqrt_pd</literal> implementation,
then pass the sqrt result (<literal>c[0]</literal>)  with <literal>__A[1]</literal> to  
<link xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_setr_pd&amp;expand=4679"><literal>_mm_setr_pd</literal></link>
to combine the final result. You could also use the <literal>{c[0], __A[1]}</literal>
initializer instead of <literal>_mm_setr_pd</literal>.</para>

<para>Now we can look at vector and scalar compares that add there own
complication: For example, the Intel Intrinsic Guide for
<link xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_cmpeq_pd&amp;expand=779,788,779"><literal>_mm_cmpeq_pd</literal></link>
describes comparing double elements [0|1] and returning
either 0s for not equal and 1s (<literal>0xFFFFFFFFFFFFFFFF</literal>
or long long -1) for equal. The comparison result is intended as a select mask
(predicates) for selecting or ignoring specific elements in later operations.
The scalar version
<link xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_cmpeq_sd&amp;expand=779,788"><literal>_mm_cmpeq_sd</literal></link>
is similar except for the quirk
of only comparing element [0] and combining the result with <literal>__A[1]</literal> to return
the final vector result.</para>

<para>The packed vector implementation for PowerISA is simple as VSX
provides the equivalent instruction and GCC provides the
<literal>vec_cmpeq</literal> builtin
supporting the vector double type. The technique of using scalar comparison
operators on the <literal>__A[0]</literal> and <literal>__B[0]</literal>
does not work as the C comparison operators
return 0 or 1 results while we need the vector select mask (effectively 0 or
-1). Also we need to watch for sequences that mix scalar floats and integers,
generating if/then/else logic or requiring expensive transfers across register
banks.</para>

<para>In this case we are better off using explicit vector built-ins for
<literal>_mm_add_sd</literal> as and example. We can use <literal>vec_splat</literal>
from element [0] to temporaries
where we can safely use <literal>vec_cmpeq</literal> to generate the expect selector mask. Note
that the <literal>vec_cmpeq</literal> returns a bool long type so we need the cast the result back
to <literal>__v2df</literal>. Then use the
<literal>(__m128d){c[0], __A[1]}</literal> initializer to combine the
comparison result with the original <literal>__A[1]</literal> input and cast to the require
interface type.  So we have this example:
<programlisting><![CDATA[extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_cmpeq_pd (__m128d __A, __m128d __B)
{
return ((__m128d)vec_cmpeq (__A, __B));
}

extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_cmpeq_sd(__m128d __A, __m128d __B)
{
__v2df a, b, c;
/* PowerISA VSX does not allow partial (for just left double)
* results. So to insure we don't generate spurious exceptions
* (from the right double values) we splat the left double
* before we to the operation. */
a = vec_splat(__A, 0);
b = vec_splat(__B, 0);
c = (__v2df)vec_cmpeq(a, b);
/* Then we merge the left double result with the original right
* double from __A. */
return ((__m128d){c[0], __A[1]});
}]]></programlisting></para>

<para>Now lets look at a similar example that adds some surprising
complexity. This is the compare not equal case so we should be able to find the
equivalent vec_cmpne builtin:
<programlisting><![CDATA[extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_cmpneq_pd (__m128d __A, __m128d __B)
{
return (__m128d)__builtin_ia32_cmpneqpd ((__v2df)__A, (__v2df)__B);
}

extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_cmpneq_sd (__m128d __A, __m128d __B)
{
return (__m128d)__builtin_ia32_cmpneqsd ((__v2df)__A, (__v2df)__B);
}]]></programlisting></para>

</section>

@ -0,0 +1,49 @@
<?xml version="1.0" encoding="UTF-8"?>
<!--
Copyright (c) 2017 OpenPOWER Foundation
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="sec_performance">
<title>Performance</title>
<para>The performance of a ported intrinsic depends on the specifics of the
intrinsic and the context it is used in. Many of the SIMD operations have
equivalent instructions in both architectures. For example the vector float and
vector double match very closely. However the SSE and VSX scalars have subtle
differences of how the scalar is positioned with the vector registers and what
happens to the rest (non-scalar part) of the register (previously discussed in
<xref linkend="sec_packed_vs_scalar_intrinsics"/>).
This requires additional PowerISA instructions
to preserve the non-scalar portion of the vector registers. This may or may not
be important to the logic of the program being ported, but we have handle the
case where it is.</para>

<para>This is where the context of now the intrinsic is used starts to
matter. If the scalar intrinsics are used within a larger program the compiler
may be able to eliminate the redundant register moves as the results are never
used. In the other cases common set up (like permute vectors or bit masks) can
be common-ed up and hoisted out of the loop. So it is very important to let the
compiler do its job with higher optimization levels (<literal>-O3</literal>,
<literal>-funroll-loops</literal>).</para>
<xi:include href="sec_performance_sse.xml"/>
<xi:include href="sec_performance_mmx.xml"/>

</section>

@ -0,0 +1,41 @@
<?xml version="1.0" encoding="UTF-8"?>
<!--
Copyright (c) 2017 OpenPOWER Foundation
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="sec_performance_mmx">
<title>Using MMX intrinsics</title>
<para>MMX was the first and oldest SIMD extension and initially filled a
need for wider (64-bit) integer and additional register. This is back when
processors were 32-bit and 8 x 32-bit registers was starting to cramp our
programming style. Now 64-bit processors, larger register sets, and 128-bit (or
larger) vector SIMD extensions are common. There is simply no good reasons
write new code using the (now) very limited MMX capabilities. </para>

<para>We recommend that existing MMX codes be rewritten to use the newer
SSE  and VMX/VSX intrinsics or using the more portable GCC  builtin vector
support or in the case of si64 operations use C scalar code. The MMX si64
scalars which are just (64-bit) operations on long long int types and any
modern C compiler can handle this type. The char short in SIMD operations
should all be promoted to 128-bit SIMD operations on GCC builtin vectors. Both
will improve cross platform portability and performance.</para>
</section>

@ -0,0 +1,44 @@
<?xml version="1.0" encoding="UTF-8"?>
<!--
Copyright (c) 2017 OpenPOWER Foundation
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="sec_performance_sse">
<title>Using SSE float and double scalars</title>
<para>SSE scalar float / double intrinsics  “hand” optimization is no
longer necessary. This was important, when SSE was initially introduced, and
compiler support was limited or nonexistent.  Also SSE scalar float / double
provided additional (16) registers and IEEE754 compliance, not available from
the 8087 floating point architecture that preceded it. So application
developers where motivated to use SSE instruction versus what the compiler was
generating at the time.</para>

<para>Modern compilers can now to generate and  optimize these (SSE
scalar) instructions for Intel from C standard scalar code. Of course PowerISA
supported IEEE754 float and double and had 32 dedicated floating point
registers from the start (and now 64 with VSX). So replacing a Intel specific
scalar intrinsic implementation with the equivalent C language scalar
implementation is usually a win; allows the compiler to apply the latest
optimization and tuning for the latest generation processor, and is portable to
other platforms where the compiler can also apply the latest optimization and
tuning for that processors latest generation.</para>
</section>

@ -0,0 +1,147 @@
<?xml version="1.0" encoding="UTF-8"?>
<!--
Copyright (c) 2017 OpenPOWER Foundation
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="sec_power_vector_permute_format">
<title>Vector permute and formatting instructions</title>

<para>The vector Permute and formatting chapter follows and is an important
one to study. These operation operation on the byte, halfword, word (and with
2.07 doubleword) integer types . Plus special Pixel type. The shifts
instructions in this chapter operate on the vector as a whole at either the bit
or the byte (octet) level, This is an important chapter to study for moving
PowerISA vector results into the vector elements that Intel Intrinsics
expect:
<literallayout><literal>6.8 Vector Permute and Formatting Instructions . . . . . . . . . . . 249
6.8.1 Vector Pack and Unpack Instructions . . . . . . . . . . . . . 249
6.8.2 Vector Merge Instructions . . . . . . . . . . . . . . . . . . 256
6.8.3 Vector Splat Instructions . . . . . . . . . . . . . . . . . . 259
6.8.4 Vector Permute Instruction . . . . . . . . . . . . . . . . . . 260
6.8.5 Vector Select Instruction . . . . . . . . . . . . . . . . . . 261
6.8.6 Vector Shift Instructions . . . . . . . . . . . . . . . . . . 262</literal></literallayout></para>

<para>The Vector Integer instructions include the add / subtract / Multiply
/ Multiply Add/Sum / (no divide) operations for the standard integer types.
There are instruction forms that  provide signed, unsigned, modulo, and
saturate results for most operations. The PowerISA 2.07 extension add /
subtract of 128-bit integers with carry and extend to 256, 512-bit and beyond ,
is included here. There are signed / unsigned compares across the standard
integer types (byte, .. doubleword). The usual and bit-wise logical operations.
And the SIMD shift / rotate instructions that operate on the vector elements
for various types.

<literallayout><literal>6.9 Vector Integer Instructions . . . . . . . . . . . . . . . . . . 264
6.9.1 Vector Integer Arithmetic Instructions . . . . . . . . . . . . 264
6.9.2 Vector Integer Compare Instructions. . . . . . . . . . . . . . 294
6.9.3 Vector Logical Instructions . . . . . . . . . . . . . . . . . 300
6.9.4 Vector Integer Rotate and Shift Instructions . . . . . . . . . 302</literal></literallayout></para>

<para>The vector [single] float instructions are grouped into this chapter.
This chapter does not include the double float instructions which are described
in the VSX chapter. VSX also include additional float instructions that operate
on the whole 64 register vector-scalar set.

<literallayout><literal>6.10 Vector Floating-Point Instruction Set . . . . . . . . . . . . . 306
6.10.1 Vector Floating-Point Arithmetic Instructions . . . . . . . . 306
6.10.2 Vector Floating-Point Maximum and Minimum Instructions . . . 308
6.10.3 Vector Floating-Point Rounding and Conversion Instructions. . 309
6.10.4 Vector Floating-Point Compare Instructions . . . . . . . . . 313
6.10.5 Vector Floating-Point Estimate Instructions . . . . . . . . . 316</literal></literallayout></para>

<para>The vector XOR based instructions are new with PowerISA 2.07 (POWER8)
and provide vector  crypto and check-sum operations:

<literallayout><literal>6.11 Vector Exclusive-OR-based Instructions . . . . . . . . . . . . 318
6.11.1 Vector AES Instructions . . . . . . . . . . . . . . . . . . . 318
6.11.2 Vector SHA-256 and SHA-512 Sigma Instructions . . . . . . . . 320
6.11.3 Vector Binary Polynomial Multiplication Instructions. . . . . 321
6.11.4 Vector Permute and Exclusive-OR Instruction . . . . . . . . . 323</literal></literallayout></para>

<para>The vector gather and bit permute support bit level rearrangement of
bits with in the vector. While the vector versions of the count leading zeros
and population count are useful to accelerate specific algorithms.

<literallayout><literal>6.12 Vector Gather Instruction . . . . . . . . . . . . . . . . . . . 324
6.13 Vector Count Leading Zeros Instructions . . . . . . . . . . . . 325
6.14 Vector Population Count Instructions. . . . . . . . . . . . . . 326
6.15 Vector Bit Permute Instruction . . . . . . . . . . . . . . . . 327</literal></literallayout></para>

<para>The Decimal Integer add / subtract instructions complement the
Decimal Floating-Point instructions. They can also be used to accelerated some
binary to/from decimal conversions. The VSCR instruction provides access the
the Non-Java mode floating-point control and the saturation status. These
instruction are not normally of interest in porting Intel intrinsics.

<literallayout><literal>6.16 Decimal Integer Arithmetic Instructions . . . . . . . . . . . . 328
6.17 Vector Status and Control Register Instructions . . . . . . . . 331</literal></literallayout></para>

<para>With PowerISA 2.07B (Power8) several major extension where added to
the Vector Facility:</para>

<itemizedlist>
<listitem>
<para>Vector Crypto: Under “Vector Exclusive-OR-based Instructions
Vector Exclusive-OR-based Instructions”, AES [inverse] Cipher, SHA 256 / 512
Sigma, Polynomial Multiplication, and Permute and XOR instructions.</para>
</listitem>
<listitem>
<para>64-bit Integer; signed and unsigned add / subtract, signed and
unsigned compare, Even / Odd 32 x 32 multiple with 64-bit product, signed /
unsigned max / min, rotate and shift left/right.</para>
</listitem>
<listitem>
<para>Direct Move between GRPs and the FPRs / Left half of Vector
Registers.</para>
</listitem>
<listitem>
<para>128-bit integer add / subtract with carry / extend, direct
support for vector <literal>__int128</literal> and multiple precision arithmetic.</para>
</listitem>
<listitem>
<para>Decimal Integer add subtract for 31 digit BCD.</para>
</listitem>
<listitem>
<para>Miscellaneous SIMD extensions: Count leading Zeros, Population
count, bit gather / permute, and vector forms of eqv, nand, orc.</para>
</listitem>
</itemizedlist>

<para>The rational for why these are included in the Vector Facilities
(VMX) (vs Vector-Scalar Floating-Point Operations (VSX)) has more to do with
how the instruction where encoded then with the type of operations or the ISA
version of introduction. This is primarily a trade-off between the bits
required for register selection vs bits for extended op-code space within in a
fixed 32-bit instruction. Basically accessing 32 vector registers require
5-bits per register, while accessing all 64 vector-scalar registers require
6-bits per register. When you consider the most vector instructions require  3
 and some (select, fused multiply-add) require 4 register operand forms,  the
impact on op-code space is significant. The larger register set of VSX was
justified by queuing theory of larger HPC matrix codes using double float,
while 32 registers are sufficient for most applications.</para>

<para>So by definition the VMX instructions are restricted to the original
32 vector registers while VSX instructions are encoded to  access all 64
floating-point scalar and vector double registers. This distinction can be
troublesome when programming at the assembler level, but the compiler and
compiler built-ins can hide most of this detail from the programmer. </para>
</section>

@ -0,0 +1,67 @@
<?xml version="1.0" encoding="UTF-8"?>
<!--
Copyright (c) 2017 OpenPOWER Foundation
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="sec_power_vmx">
<title>The Vector Facility (VMX)</title>
<para>The orginal VMX supported SIMD integer byte, halfword, and word, and
single float data types within a separate (from GPR and FPR) bank of 32 x
128-bit vector registers. These operations like to stay within their (SIMD)
lanes except where the operation changes the element data size (integer
multiply, pack, and unpack). </para>

<para>This is complimented by bit logical and shift / rotate / permute /
merge instuctions that operate on the vector as a whole.  Some operation
(permute, pack, merge, shift double, select) will select 128 bits from a pair
of vectors (256-bits) and deliver 128-bit vector result. These instructions
will cross lanes or multiple registers to grab fields and assmeble them into
the single register result.</para>

<para>The PowerISA 2.07B Chapter 6. Vector Facility is organised starting
with an overview (chapters 6.1- 6.6):

<literallayout><literal>6.1 Vector Facility Overview . . . . . . . . . . . . . . . . . . . . 227
6.2 Chapter Conventions. . . . . . . . . . . . . . . . . . . . . . . 227
6.2.1 Description of Instruction Operation . . . . . . . . . . . . . 227
6.3 Vector Facility Registers . . . . . . . . . . . . . . . . . . . 234
6.3.1 Vector Registers . . . . . . . . . . . . . . . . . . . . . . . 234
6.3.2 Vector Status and Control Register . . . . . . . . . . . . . . 234
6.3.3 VR Save Register . . . . . . . . . . . . . . . . . . . . . . . 235
6.4 Vector Storage Access Operations . . . . . . . . . . . . . . . . 235
6.4.1 Accessing Unaligned Storage Operands . . . . . . . . . . . . . 237
6.5 Vector Integer Operations . . . . . . . . . . . . . . . . . . . 238
6.5.1 Integer Saturation . . . . . . . . . . . . . . . . . . . . . . 238
6.6 Vector Floating-Point Operations . . . . . . . . . . . . . . . . 240
6.6.1 Floating-Point Overview . . . . . . . . . . . . . . . . . . . 240
6.6.2 Floating-Point Exceptions . . . . . . . . . . . . . . . . . . 240
6.7 Vector Storage Access Instructions . . . . . . . . . . . . . . . 242
6.7.1 Storage Access Exceptions . . . . . . . . . . . . . . . . . . 242
6.7.2 Vector Load Instructions . . . . . . . . . . . . . . . . . . . 243
6.7.3 Vector Store Instructions. . . . . . . . . . . . . . . . . . . 246
6.7.4 Vector Alignment Support Instructions. . . . . . . . . . . . . 248</literal></literallayout></para>
<para>Then a chapter on storage (load/store) access for vector and vector
elements:</para>
<xi:include href="sec_power_vector_permute_format.xml"/>

</section>

@ -0,0 +1,186 @@
<?xml version="1.0" encoding="UTF-8"?>
<!--
Copyright (c) 2017 OpenPOWER Foundation
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="sec_power_vector_scalar_floatingpoint">
<title>Vector-Scalar Floating-Point Operations (VSX)</title>
<para>With PowerISA 2.06 (POWER7) we extended the vector SIMD capabilities
of the PowerISA:</para>
<itemizedlist>
<listitem>
<para>Extend the available vector and floating-point scalar register
sets from 32 registers each to a combined 64 x 64-bit scalar floating-point and
64 x 128-bit vector registers.</para>
</listitem>
<listitem>
<para>Enable scalar double float operations on all 64 scalar
registers.</para>
</listitem>
<listitem>
<para>Enable vector double and vector float operations for all 64
vector registers.</para>
</listitem>
<listitem>
<para>Enable super-scalar execution of vector instructions and support
2 independent vector floating point  pipelines for parallel execution of 4 x
64-bit Floating point Fused Multiply Adds (FMAs) and 8 x 32-bit (FMAs) per
cycle.</para>
</listitem>
</itemizedlist>

<para>With PowerISA 2.07 (POWER8) we added single-precision scalar
floating-point instruction to VSX. This completes the floating-point
computational set for VSX. This ISA release also clarified how these operate in
the Little Endian storage model.</para>

<para>While the focus was on enhanced floating-point computation (for High
Performance Computing),  VSX also extended  the ISA with additional storage
access, logical, and permute (merge, splat, shift) instructions. This was
necessary to extend these operations cover 64 VSX registers, and improves
unaligned storage access for vectors  (not available in VMX).</para>

<para>The PowerISA 2.07B Chapter 7. Vector-Scalar Floating-Point Operations
is organized starting with an introduction and overview (chapters 7.1- 7.5) .
The early sections (7.1 and 7.2) describe the layout of the 64 VSX registers
and how they relate (overlap and inter-operate) to the existing floating point
scalar (FPRs) and (VMX VRs) vector registers.

<literallayout><literal>7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 317
7.1.1 Overview of the Vector-Scalar Extension . . . . . . . . . . . 317
7.2 VSX Registers . . . . . . . . . . . . . . . . . . . . . . . . . 318
7.2.1 Vector-Scalar Registers . . . . . . . . . . . . . . . . . . . 318
7.2.2 Floating-Point Status and Control Register . . . . . . . . . . 321</literal></literallayout></para>

<para>The definitions given in “7.1.1.1 Compatibility with Category
Floating-Point and Category Decimal Floating-Point Operations”, and
“7.1.1.2 Compatibility with Category Vector Operations”
<blockquote>
<para>The instruction sets defined in Chapter 4.
Floating-Point Facility and Chapter 5. Decimal
Floating-Point retain their definition with one primary
difference. The FPRs are mapped to doubleword
element 0 of VSRs 0-31. The contents of doubleword 1
of the VSR corresponding to a source FPR specified
by an instruction are ignored. The contents of
doubleword 1 of a VSR corresponding to the target
FPR specified by an instruction are undefined.</para>
<para>The instruction set defined in Chapter 6. Vector Facility
[Category: Vector], retains its definition with one
primary difference. The VRs are mapped to VSRs
32-63.</para></blockquote></para>

<note><para>The reference to scalar element 0 above is from the big endian
register perspective of the ISA. In the PPC64LE ABI implementation, and for the
purpose of porting Intel intrinsics, this is logical element 1.  Intel SSE
scalar intrinsics operated on logical element [0],  which is in the wrong
position for PowerISA FPU and VSX scalar floating-point  operations. Another
important note is what happens to the other half of the VSR when you execute a
scalar floating-point instruction (<emphasis>The contents of doubleword 1 of a VSR …
are undefined.</emphasis>)</para></note>

<para>The compiler will hide some of this detail when generating code for
little endian vector element [] notation and most vector built-ins. For example
<literal>vec_splat (A, 0)</literal> is transformed for
PPC64LE to <literal>xxspltd VRT,VRA,1</literal>.
What the compiler <emphasis><emphasis role="bold">can not</emphasis></emphasis>
hide is the different placement of scalars within vector registers.</para>

<para>Vector registers (VRs) 0-31 overlay and can be accessed from vector
scalar registers (VSRs) 32-63. The ABI also specifies that VR2-13 are used to
pass parameter and return values. In some cases the same (similar) operations
exist in both VMX and VSX instruction forms, while in the other cases
operations only exist for VMX (byte level permute and shift) or VSX (Vector
double).</para>

<para>So resister selection that; avoids unnecessary vector moves, follows
the ABI, while maintaining the correct instruction specific register numbering,
can be tricky. The
<link xlink:href="https://gcc.gnu.org/onlinedocs/gcc-6.3.0/gcc/Machine-Constraints.html#Machine-Constraints">GCC register constraint</link>
annotations for Inline
assembler using vector instructions  is challenging, even for experts. So only
experts should be writing assembler and then only in extraordinary
circumstances. You should leave these details to the compiler (using vector
extensions and vector built-ins) when ever possible.</para>

<para>The next sections get is into the details of floating point
representation, operations, and exceptions. Basically the implementation
details for the IEEE754R and C/C++ language standards that most developers only
access via higher level APIs. So most programmers will not need this level of
detail, but it is there if needed.

<literallayout><literal>7.3 VSX Operations . . . . . . . . . . . . . . . . . . . . . . . . . 326
7.3.1 VSX Floating-Point Arithmetic Overview . . . . . . . . . . . . 326
7.3.2 VSX Floating-Point Data . . . . . . . . . . . . . . . . . . . 327
7.3.3 VSX Floating-Point Execution Models . . . . . . . . . . . . . 335
7.4 VSX Floating-Point Exceptions . . . . . . . . . . . . . . . . . 338
7.4.1 Floating-Point Invalid Operation Exception . . . . . . . . . . 341
7.4.2 Floating-Point Zero Divide Exception . . . . . . . . . . . . . 347
7.4.3 Floating-Point Overflow Exception. . . . . . . . . . . . . . . 349
7.4.4 Floating-Point Underflow Exception . . . . . . . . . . . . . . 351</literal></literallayout></para>

<para>Finally an overview the VSX storage access instructions for big and
little endian and for aligned and unaligned data addresses. This included
diagrams that illuminate the differences

<literallayout><literal>7.5 VSX Storage Access Operations . . . . . . . . . . . . . . . . . 356
7.5.1 Accessing Aligned Storage Operands . . . . . . . . . . . . . . 356
7.5.2 Accessing Unaligned Storage Operands . . . . . . . . . . . . . 357
7.5.3 Storage Access Exceptions . . . . . . . . . . . . . . . . . . 358</literal></literallayout></para>

<para>Section 7.6 starts with a VSX instruction Set Summary which is the
place to start to get an feel for the types and operations supported.  The
emphasis on float-point, both scalar and vector (especially vector double), is
pronounced. Many of the scalar and single-precision vector instruction look
like duplicates of what we have seen in the Chapter 4 Floating-Point and
Chapter 6 Vector facilities. The difference here is, new instruction encodings
to access the full 64 VSX register space. </para>

<para>In addition there are small number of logical instructions are
include to support predication (selecting / masking vector elements based on
compare results). And set of permute, merge, shift, and splat instructions that
operation on VSX word (float) and doubleword (double) elements. As mentioned
about VMX section 6.8 these instructions are good to study as they are useful
for realigning elements from PowerISA vector results to that required for Intel
Intrinsics.

<literallayout><literal>7.6 VSX Instruction Set . . . . . . . . . . . . . . . . . . . . . . 359
7.6.1 VSX Instruction Set Summary . . . . . . . . . . . . . . . . . 359
7.6.1.1 VSX Storage Access Instructions . . . . . . . . . . . . . . 359
7.6.1.2 VSX Move Instructions . . . . . . . . . . . . . . . . . . . 360
7.6.1.3 VSX Floating-Point Arithmetic Instructions . . . . . . . . 360
7.6.1.4 VSX Floating-Point Compare Instructions . . . . . . . . . . 363
7.6.1.5 VSX DP-SP Conversion Instructions . . . . . . . . . . . . . 364
7.6.1.6 VSX Integer Conversion Instructions . . . . . . . . . . . . 364
7.6.1.7 VSX Round to Floating-Point Integer Instructions . . . . . 366
7.6.1.8 VSX Logical Instructions. . . . . . . . . . . . . . . . . . 366
7.6.1.9 VSX Permute Instructions. . . . . . . . . . . . . . . . . . 367
7.6.2 VSX Instruction Description Conventions . . . . . . . . . . . 368
7.6.3 VSX Instruction Descriptions . . . . . . . . . . . . . . . . 392</literal></literallayout></para>

<para>The VSX Instruction Descriptions section contains the detail
description for each VSX category instruction.  The table entries from the
Instruction Set Summary are formatted in the document at hyperlinks to
corresponding instruction description.</para>

</section>

@ -0,0 +1,33 @@
<?xml version="1.0" encoding="UTF-8"?>
<!--
Copyright (c) 2017 OpenPOWER Foundation
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="sec_powerisa">
<title>The PowerISA</title>
<para>The PowerISA is for historical reasons is organized at the top level
by the distinction between older Vector Facility (Altivec / VMX) and the newer
Vector-Scalar Floating-Point Operations (VSX). </para>
<xi:include href="sec_power_vmx.xml"/>
<xi:include href="sec_power_vsx.xml"/>

</section>

@ -0,0 +1,46 @@
<?xml version="1.0" encoding="UTF-8"?>
<!--
Copyright (c) 2017 OpenPOWER Foundation
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="sec_powerisa_vector_facilities">
<title>PowerISA Vector facilities</title>
<para>The PowerISA vector facilities (VMX and VSX) are extensive, but does
not always provide a direct or obvious functional equivalent to the Intel
Intrinsics. But being not obvious is not the same as imposible. It just
requires some basic programing skills.</para>

<para>It is a good idea to have an overall understanding of the vector
capabilities the PowerISA. You do not need to memorize every instructions but
is helps to know where to look. Both the PowerISA and OpenPOWER ABI have a
specific structure and organization that can help you find what you looking
for. </para>

<para>It also helps to understand the relationship between the PowerISAs
low level instructions and the higher abstraction of the vector intrinsics as
defined by the OpenPOWER ABIs Vector Programming Interfaces and the the defacto
standard of GCC's PowerPC AltiVec Built-in Functions.</para>
<xi:include href="sec_powerisa.xml"/>
<xi:include href="sec_powerisa_vector_intrinsics.xml"/>
<xi:include href="sec_powerisa_vector_size_type.xml"/>

</section>

@ -0,0 +1,79 @@
<?xml version="1.0" encoding="UTF-8"?>
<!--
Copyright (c) 2017 OpenPOWER Foundation
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="sec_powerisa_vector_intrinsics">
<title>PowerISA Vector Intrinsics</title>
<para>The
<emphasis role="bold">OpenPOWER ELF V2 application binary interface (ABI): Chapter 6.</emphasis>
<emphasis><emphasis role="bold">Vector Programming Interfaces</emphasis></emphasis> and
<emphasis><emphasis role="bold">Appendix A. Predefined Functions for Vector
Programming</emphasis></emphasis> document the current and proposed vector built-ins we expect all
C/C++ compilers implement. </para>

<para>Some of these operations are endian sensitive and the compiler needs
to make corresponding adjustments as  it generate code for endian sensitive
built-ins. There is a good overview for this in the
<emphasis role="bold">OpenPOWER ABI Section</emphasis>
<emphasis><emphasis role="bold">6.4.
Vector Built-in Functions</emphasis></emphasis>.</para>

<para>Appendix A is organized (sorted) by built-in name, output type, then
parameter types. Most built-ins are generic as the named the operation (add,
sub, mul, cmpeq, ...) applies to multiple types. </para>

<para>So the build <literal>vec_add</literal> built-in applies to all the signed and unsigned
integer types (char, short, in, and long) plus float and double floating-point
types. The compiler looks at the parameter type to select the vector
instruction (or instruction sequence) that implements the (add) operation on
that type. The compiler infers the output result type from the operation and
input parameters and will complain if the target variable type is not
compatible. For example:
<programlisting><![CDATA[vector signed char vec_add (vector signed char, vector signed char);
vector unsigned char vec_add (vector unsigned char, vector unsigned char);
vector signed short vec_add (vector signed short, vector signed short);
vector unsigned short vec_add (vector unsigned short, vector unsigned short);
vector signed int vec_add (vector signed int, vector signed int);
vector unsigned int vec_add (vector unsigned int, vector unsigned int);
vector signed int vec_add (vector signed long, vector signed long);
vector unsigned int vec_add (vector unsigned long, vector unsigned long);
vector float vec_add (vector float, vector float);
vector double vec_add (vector double, vector double);]]></programlisting></para>

<para>This is one key difference between PowerISA built-ins and Intel
Intrinsics (Intel Intrinsics are not generic and include type information in
the name). This is why it is so important to understand the vector element
types and to add the appropriate type casts to get the correct results.</para>

<para>The defacto standard implementation is GCC as defined in the include
file <literal>&lt;altivec.h&gt;</literal> and documented in the GCC online documentation in
<link xlink:href="https://gcc.gnu.org/onlinedocs/gcc-6.3.0/gcc/PowerPC-AltiVec_002fVSX-Built-in-Functions.html#PowerPC-AltiVec_002fVSX-Built-in-Functions">6.59.20 PowerPC
AltiVec Built-in Functions</link>. The header file name and section title
reflect the origin of the Vector Facility, but recent versions of GCC altivec.h
include built-ins for newer PowerISA 2.06 and 2.07 VMX plus VSX extensions.
This is a work in progress where your  (older) distro GCC compiler may not
include built-ins for the latest PowerISA 3.0 or ABI edition. So before you use
a built-in you find in the ABI Appendix A, check the specific
<link xlink:href="https://gcc.gnu.org/onlinedocs/">GCC online documentation</link> for the
GCC version you are using.</para>

</section>

@ -0,0 +1,119 @@
<?xml version="1.0" encoding="UTF-8"?>
<!--
Copyright (c) 2017 OpenPOWER Foundation
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="sec_powerisa_vector_size_type">
<title>How vector elements change size and type</title>
<para>Most vector built ins return the same vector type as the (first)
input parameters, but there are exceptions. Examples include; conversions
between types, compares , pack, unpack,  merge, and integer multiply
operations.</para>

<para>Converting floats to from integer will change the type and something
change the element size as well (double ↔ int and float ↔ long). For the
VMX the conversions are always the same size (float ↔ [unsigned] int). But
VSX allows conversion of 64-bit (long or double) to from 32-bit (float or
 int)  with the inherent size changes. The PowerISA VSX defines a 4 element
vector layout where little endian elements 0, 2 are used for input/output and
elements 1,3 are undefined. The OpenPOWER ABI Appendix A define
<literal>vec_double</literal> and <literal>vec_float</literal>
with even/odd and high/low extensions as program aids. These are not
included in GCC 7 or earlier but are planned for GCC 8.</para>

<para>Compare operations produce either
<literal>vector bool &lt;</literal>input element type<literal>&gt;</literal>
(effectively bit masks) or predicates (the condition code for all and
any are represented as an int truth variable). When a predicate compare (i.e.
<literal>vec_all_eq</literal>, <literal>vec_any_gt</literal>),
is used in a if statement,  the condition code is
used directly in the conditional branch and the int truth value is not
generated.</para>

<para>Pack operations pack integer elements into the next smaller (half)
integer sized elements. Pack operations include signed and unsigned saturate
and unsigned modulo forms. As the packed result will be half the size (in
bits), pack instructions require 2 vectors (256-bits) as input and generate a
single 128-bit vector results.
<programlisting><![CDATA[vec_vpkudum ({1, 2}, {101, 102}) result={1, 2, 101, 102}]]></programlisting></para>

<para>Unpack operations expand integer elements into the next larger size
elements. The integers are always treated as signed values and sign-extended.
The processor design avoids instructions that return multiple register values.
So the PowerISA defines unpack-high and unpack low forms where instruction
takes (the high or low) half of vector elements and extends them to fill the
vector output. Element order is maintained and an unpack high / low sequence
with same input vector has the effect of unpacking to a 256-bit result in two
vector registers.
<programlisting><![CDATA[vec_vupkhsw ({1, 2, 3, 4}) result={1, 2}
vec_vupkhsw ({-1, 2, -3, 4}) result={-1, 2}
vec_vupklsw ({1, 2, 3, 4}) result={3, 4}
vec_vupklsw ({-1, 2, -3, 4}) result={-3, 4}]]></programlisting></para>

<para>Merge operations resemble shuffling two (vectors) card decks
together, alternating (elements) cards in the result.   As we are merging from
2 vectors (256-bits) into 1 vector (128-bits) and the elements do not change
size, we have merge high and merge low instruction forms for each (byte,
halfword and word) integer type. The merge high operations alternate elements
from the (vector register left) high half of the two input vectors. The merge
low operation alternate elements from the (vector register right) low half of
the two input vectors.</para>

<para>For PowerISA 2.07 we added vector merge word even / odd instructions.
Instead of high or low elements the shuffle is from the even or odd number
elements of the two input vectors. Passing the same vector to both inputs to
merge produces splat like results for each doubleword half, which is handy in
some convert operations.
<programlisting><![CDATA[vec_mrghd ({1, 2}, {101, 102}) result={1, 101}
vec_mrgld ({1, 2}, {101, 102}) result={2, 102}

vec_vmrghw ({1, 2, 3, 4}, {101, 102, 103, 104}) result={1, 101, 2, 102}
vec_vmrghw ({1, 2, 3, 4}, {1, 2, 3, 4}) result={1, 1, 2, 2}
vec_vmrglw ({1, 2, 3, 4}, {101, 102, 103, 104}) result={3, 103, 4, 104}
vec_vmrglw ({1, 2, 3, 4}, {1, 2, 3, 4}) result={3, 3, 4, 4}


vec_mergee ({1, 2, 3, 4}, {101, 102, 103, 104}) result={1, 101, 3, 103}
vec_mergee ({1, 2, 3, 4}, {1, 2, 3, 4}) result={1, 1, 3, 3}
vec_mergeo ({1, 2, 3, 4}, {101, 102, 103, 104}) result={2, 102, 4, 104}
vec_mergeo ({1, 2, 3, 4}, {1, 2, 3, 4}) result={2, 2, 4, 4}]]></programlisting></para>

<para>Integer multiply has the potential to generate twice as many bits in
the product as input. A multiply of 2 int (32-bit) values produces a long
(64-bits). Normal C language * operations ignore this and discard the top
32-bits of the result. However  in some computations it useful to preserve the
double product precision for intermediate computation before reducing the final
result back to the original precision.</para>

<para>The PowerISA VMX instruction set took the later approach ie keep all
the product bits until the programmer explicitly asks for the truncated result.
So the vector integer multiple are split into even/odd forms across signed and
unsigned; byte, halfword and word inputs. This requires two instructions (given
the same inputs) to generated the full vector  multiply across 2 vector
registers and 256-bits. Again as POWER processors are super-scalar this pair of
instructions should execute in parallel.</para>

<para>The set of expanded product values can either be used directly in
further (doubled precision) computation or merged/packed into the single single
vector at the smaller bit size. This is what the compiler will generate for C
vector extension multiply of vector integer types.</para>

</section>

@ -0,0 +1,57 @@
<?xml version="1.0" encoding="UTF-8"?>
<!--
Copyright (c) 2017 OpenPOWER Foundation
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="sec_prefered_methods">
<title>Prefered methods</title>
<para>As we will see there are multiple ways to implement the logic of
these intrinsics. Some implementation methods are preferred because they allow
the compiler to select instructions and provided the most flexibility for
optimization across the whole sequence. Other methods may be required to
deliver a specific semantic or to deliver better optimization than the current
compiler is capable of. Some methods are more portable across multiple
compilers (GCC, LLVM, ...). All of this should be taken into consideration for
each intrinsic implementation. In general we should use the following list as a
guide to these decisions:</para>
<orderedlist>
<listitem>
<para>Use C vector arithmetic, logical, dereference, etc., operators in
preference to intrinsics.</para>
</listitem>
<listitem>
<para>Use the bi-endian interfaces from Appendix A of the ABI in
preference to other intrinsics when available, as these are designed for
portability among compilers.</para>
</listitem>
<listitem>
<para>Use other, less well documented intrinsics (such as
<literal>__builtin_vsx_*</literal>) when no better facility is available, in preference to
assembly.</para>
</listitem>
<listitem>
<para>If necessary, use inline assembly, but know what you're
doing.</para>
</listitem>
</orderedlist>

</section>

@ -0,0 +1,66 @@
<?xml version="1.0" encoding="UTF-8"?>
<!--
Copyright (c) 2017 OpenPOWER Foundation
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="sec_prepare">
<title>Prepare yourself</title>
<para>To port Intel intrinsics to POWER you will need to prepare yourself
with knowledge of PowerISA vector facilities and how to access the associated
documentation.</para>

<itemizedlist>
<listitem>
<para>
<link xlink:href="https://gcc.gnu.org/onlinedocs/gcc-6.3.0/gcc/Vector-Extensions.html#Vector-Extensions">GCC vector extention</link>
syntax and usage. This is one of a set of GCC
"<link xlink:href="https://gcc.gnu.org/onlinedocs/gcc-6.3.0/gcc/C-Extensions.html#C-Extensions">Extentions to the C language Family</link>”
that the intrinsic header implementation depends
on.  As many of the GCC intrinsics for x86 are implemented via C vector
extensions, reading and understanding of this code is an important part of the
porting process. </para>
</listitem>
<listitem>
<para>Intel (x86) intrinsic and type naming conventions and how to find
more information. The intrinsic name encodes  some information about the
vector size and type of the data, but the pattern is not always  obvious.
Using the online
<link xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#">Intel
Intrinsic Guide</link> to look up the intrinsic by name is a good first
step.</para>
</listitem>
<listitem>
<para>PowerISA Vector facilities. The Vector facilities of POWER8 are
extensive and cover the usual types and usual operations. However it has a
different history and organization from Intel.  Both (Intel and PowerISA) have
their quirks and in some cases the mapping may not be obvious. So familiarizing
yourself with the PowerISA Vector (VMX) and Vector Scalar Extensions (VSX) is
important.</para>
</listitem>
</itemizedlist>
<xi:include href="sec_gcc_vector_extensions.xml"/>
<xi:include href="sec_intel_intrinsic_functions.xml"/>
<xi:include href="sec_powerisa_vector_facilities.xml"/>
<xi:include href="sec_more_examples.xml"/>


</section>

@ -0,0 +1,64 @@
<?xml version="1.0" encoding="UTF-8"?>
<!--
Copyright (c) 2017 OpenPOWER Foundation
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="sec_review_source">
<title>Look at the source, Luke</title>
<para>So if this is a code porting activity, where is the source? All the
source code we need to look at is in the GCC source trees. You can either git
(<link xlink:href="https://gcc.gnu.org/wiki/GitMirror">https://gcc.gnu.org/wiki/GitMirro</link>)
the gcc source  or down load one of the
recent AT source tars (for example:
<link xlink:href="ftp://ftp.unicamp.br/pub/linuxpatch/toolchain/at/ubuntu/dists/xenial/at10.0/">ftp://ftp.unicamp.br/pub/linuxpatch/toolchain/at/ubuntu/dists/xenial/at10.0/</link>).
 You will find the intrinsic headers in the ./gcc/config/i386/
sub-directory.</para>

<para>If you have a Intel Linux workstation or laptop with GCC installed,
you already have these headers, if you want to take a look:
<screen><prompt>$ </prompt><userinput>find /usr/lib -name '*mmintrin.h'</userinput>
/usr/lib/gcc/x86_64-redhat-linux/4.4.4/include/wmmintrin.h
/usr/lib/gcc/x86_64-redhat-linux/4.4.4/include/mmintrin.h
/usr/lib/gcc/x86_64-redhat-linux/4.4.4/include/xmmintrin.h
/usr/lib/gcc/x86_64-redhat-linux/4.4.4/include/emmintrin.h
/usr/lib/gcc/x86_64-redhat-linux/4.4.4/include/tmmintrin.h
...
<prompt>$ </prompt></screen></para>

<para>But depending on the vintage of the distro, these may not be the
latest versions of the headers. Looking at the header source will tell you a
few things.: The include structure (what other headers are implicitly
included). The types that are used at the API. And finally, how the API is
implemented.</para>
<para><literallayout>smmintrin.h (SSE4.1) includes tmmintrin,h
tmmintrin.h (SSSE3) includes pmmintrin.h
pmmintrin.h (SSE3) includes emmintrin,h
emmintrin.h (SSE2) includes xmmintrin.h
xmmintrin.h (SSE) includes mmintrin.h and mm_malloc.h
mmintrin.h (MMX)</literallayout></para>

<xi:include href="sec_intel_intrinsic_includes.xml"/>
<xi:include href="sec_intel_intrinsic_types.xml"/>
<xi:include href="sec_api_implemented.xml"/>


</section>

@ -0,0 +1,62 @@
<?xml version="1.0" encoding="UTF-8"?>
<!--
Copyright (c) 2017 OpenPOWER Foundation
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="sec_simple_examples">
<title>Some simple examples</title>
<para>For example; a vector double splat looks like this:
<programlisting><![CDATA[/* Create a vector with both elements equal to F. */
extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_set1_pd (double __F)
{
return __extension__ (__m128d){ __F, __F };
}]]></programlisting></para>
<para>Another example:
<programlisting><![CDATA[extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_add_pd (__m128d __A, __m128d __B)
{
return (__m128d) ((__v2df)__A + (__v2df)__B);
}]]></programlisting></para>
<para>Note in the example above the cast to __v2df for the operation. Both
__m128d and __v2df are vector double, but __v2df does no have the <literal>__may_alias__</literal>
attribute. And one more example:
<programlisting><![CDATA[extern __inline __m128i __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_mullo_epi16 (__m128i __A, __m128i __B)
{
return (__m128i) ((__v8hu)__A * (__v8hu)__B);
}]]></programlisting></para>
<para>Note this requires a cast for the compiler to generate the correct
code for the intended operation. The parameters and result are the generic
<literal>__m128i</literal>, which is a vector long long with the
<literal>__may_alias__</literal> attribute. But
operation is a vector multiply low unsigned short (<literal>__v8hu</literal>). So not only do we
use the cast to drop the <literal>__may_alias__</literal> attribute but we also need to cast to
the correct (vector unsigned short) type for the specified operation.</para>

<para>I have successfully copied these (and similar) source snippets over
to the PPC64LE implementation unchanged. This of course assumes the associated
types are defined and with compatible attributes.</para>

</section>

@ -0,0 +1,134 @@
<?xml version="1.0" encoding="UTF-8"?>
<!--
Copyright (c) 2017 OpenPOWER Foundation
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="sec_vec_or_not">
<title>To vec_not or not</title>
<para>Well not exactly. Looking at the OpenPOWER ABI document we see a
reference to
<literal>vec_cmpne</literal> for all numeric types. But when we look in the current
GCC 6 documentation we find that
<literal>vec_cmpne</literal> is not on the list. So it is planned
in the ABI, but not implemented yet.</para>

<para>Looking at the PowerISA 2.07B we find a VSX Vector Compare Equal to
Double-Precision but no Not Equal. In fact we see only vector double compare
instructions for greater than and greater than or equal in addition to the
equal compare. Not only can't we find a not equal, there is no less than or
less than or equal compares either.</para>

<para>So what is going on here? Partially this is the Reduced Instruction
Set Computer (RISC) design philosophy. In this case the compiler can generate
all the required compares using the existing vector instructions and simple
transforms based on Boolean algebra. So
<literal>vec_cmpne(A,B)</literal> is simply <literal>vec_not
(vec_cmpeq(A,B))</literal>. And <literal>vec_cmplt(A,B)</literal> is simply
<literal>vec_cmpgt(B,A)</literal> based on the
identity A &lt; B <emphasis><emphasis role="bold">iff</emphasis></emphasis> B &gt; A.
Similarly <literal>vec_cmple(A,B)</literal> is implemented as
<literal>vec_cmpge(B,A)</literal>.</para>

<para>What a minute, there is no <literal>vec_not()</literal> either. Can not find it in the
PowerISA, the OpenPOWER ABI, or the GCC PowerPC Altivec Built-in documentation.
There is no <literal>vec_move()</literal> either! How can this possibly work?</para>

<para>This is RISC philosophy again. We can always use a logical
instruction (like bit wise <emphasis role="bold">and</emphasis> or
<emphasis role="bold">or</emphasis>) to effect a move given that we also have
nondestructive 3 register instruction forms. In the PowerISA most instruction
have two input registers and a separate result register. So if the result
register number is  different from either input register then the inputs are
not clobbered (nondestructive). Of course nothing prevents you from specifying
the same register for both inputs or even all three registers (result and both
inputs).  And some times it is useful.</para>

<para>The statement <literal>B = vec_or (A,A)</literal> is is effectively a vector move/copy
from <literal>A</literal> to <literal>B</literal>. And <literal>A = vec_or (A,A)</literal> is obviously a
<emphasis role="bold"><literal>nop</literal></emphasis> (no operation). In the the
PowerISA defines the preferred <literal>nop</literal> and register move for vector registers in
this way.</para>

<para>It is also useful to have hardware implement the logical operators
<emphasis role="bold">nor</emphasis> (<emphasis role="bold">not or</emphasis>)
and <emphasis role="bold">nand</emphasis> (<emphasis role="bold">not and</emphasis>).  
The PowerISA provides these instruction for
fixed point and vector logical operation. So <literal>vec_not(A)</literal>
can be implemented as <literal>vec_nor(A,A)</literal>.
So looking at the  implementation of _mm_cmpne we propose the
following:
<programlisting><![CDATA[extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_cmpneq_pd (__m128d __A, __m128d __B)
{
__m128d temp = (__m128d)vec_cmpeq (__A, __B);
return ((__m128d)vec_nor (temp, temp));
}
extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_cmpneq_sd (__m128d __A, __m128d __B)
{
__v2df a, b, c;
a = vec_splat(__A, 0);
b = vec_splat(__B, 0);
c = (__v2df)vec_cmpeq(a, b);
c = (__v2df)vec_nor(c, c);
return ((__m128d){c[0], __A[1]});
}]]></programlisting></para>

<para>The Intel Intrinsics also include the not forms of the relational
compares:
<programlisting><![CDATA[extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_cmpnlt_sd (__m128d __A, __m128d __B)
{
return (__m128d)__builtin_ia32_cmpnltsd ((__v2df)__A, (__v2df)__B);
}

extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_cmpnle_sd (__m128d __A, __m128d __B)
{
return (__m128d)__builtin_ia32_cmpnlesd ((__v2df)__A, (__v2df)__B);
}]]></programlisting></para>

<para>The PowerISA and OpenPOWER ABI, or GCC PowerPC Altivec Built-in
documentation do not provide any direct equivalents to the  not greater than
class of compares. Again you don't really need them if you know Boolean
algebra. We can use identities like
{<emphasis role="bold">not</emphasis> (A &lt; B) iff A &gt;= B} and
{<emphasis role="bold">not</emphasis> (A
&lt;= B) iff A &gt; B}. So the PPC64LE implementation follows:
<programlisting><![CDATA[extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_cmpnlt_pd (__m128d __A, __m128d __B)
{
return ((__m128d)vec_cmpge (__A, __B));
}

extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_cmpnle_pd (__m128d __A, __m128d __B)
{
return ((__m128d)vec_cmpgt (__A, __B));
}]]></programlisting></para>

<para>These patterns repeat for the scalar version of the
<emphasis role="bold">not</emphasis> compares. And
in general the larger pattern described in this chapter applies to the other
float and integer types with similar interfaces.</para>


</section>

File diff suppressed because it is too large Load Diff

@ -0,0 +1,22 @@
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">

<parent>
<groupId>org.openpowerfoundation.docs</groupId>
<artifactId>master-pom</artifactId>
<version>1.0.0-SNAPSHOT</version>
<relativePath>../Docs-Master/pom.xml</relativePath>
</parent>
<modelVersion>4.0.0</modelVersion>

<artifactId>workgroup-pom</artifactId>
<packaging>pom</packaging>

<modules>
<!-- TODO: Add new documents are build in the project, add their directories to this list to
enable all document builds from the top level -->
<module>Vector_Intrinsics</module>
</modules>
</project>
Loading…
Cancel
Save