Inital port from SJM R4 document

9 years ago · dfa363ccaa
parent cb01eeda67
commit dfa363ccaa
44 changed files with 5362 additions and 0 deletions
--- a/Intrinsic-porting-guide-draft-R4.odt
+++ b/Intrinsic-porting-guide-draft-R4.odt
--- a/Intrinsic-porting-guide-draft-R4_hack.odt
+++ b/Intrinsic-porting-guide-draft-R4_hack.odt
--- a/Intrinsic-porting-guide-draft-R4_hack.xml
+++ b/Intrinsic-porting-guide-draft-R4_hack.xml
--- a/176
+++ b/176
@ -0,0 +1,176 @@
+
+                                 Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+
+   1. Definitions.
+
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity. For the purposes of this definition,
+      "control" means (i) the power, direct or indirect, to cause the
+      direction or management of such entity, whether by contract or
+      otherwise, or (ii) ownership of fifty percent (50%) or more of the
+      outstanding shares, or (iii) beneficial ownership of such entity.
+
+      "You" (or "Your") shall mean an individual or Legal Entity
+      exercising permissions granted by this License.
+
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on (or derived from) the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship. For the purposes
+      of this License, Derivative Works shall not include works that remain
+      separable from, or merely link (or bind by name) to the interfaces of,
+      the Work and Derivative Works thereof.
+
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner. For the purposes of this definition, "submitted"
+      means any form of electronic, verbal, or written communication sent
+      to the Licensor or its representatives, including but not limited to
+      communication on electronic mailing lists, source code control systems,
+      and issue tracking systems that are managed by, or on behalf of, the
+      Licensor for the purpose of discussing and improving the Work, but
+      excluding communication that is conspicuously marked or otherwise
+      designated in writing by the copyright owner as "Not a Contribution."
+
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
+
+   2. Grant of Copyright License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      copyright license to reproduce, prepare Derivative Works of,
+      publicly display, publicly perform, sublicense, and distribute the
+      Work and such Derivative Works in Source or Object form.
+
+   3. Grant of Patent License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      (except as stated in this section) patent license to make, have made,
+      use, offer to sell, sell, import, and otherwise transfer the Work,
+      where such license applies only to those patent claims licensable
+      by such Contributor that are necessarily infringed by their
+      Contribution(s) alone or by combination of their Contribution(s)
+      with the Work to which such Contribution(s) was submitted. If You
+      institute patent litigation against any entity (including a
+      cross-claim or counterclaim in a lawsuit) alleging that the Work
+      or a Contribution incorporated within the Work constitutes direct
+      or contributory patent infringement, then any patent licenses
+      granted to You under this License for that Work shall terminate
+      as of the date such litigation is filed.
+
+   4. Redistribution. You may reproduce and distribute copies of the
+      Work or Derivative Works thereof in any medium, with or without
+      modifications, and in Source or Object form, provided that You
+      meet the following conditions:
+
+      (a) You must give any other recipients of the Work or
+          Derivative Works a copy of this License; and
+
+      (b) You must cause any modified files to carry prominent notices
+          stating that You changed the files; and
+
+      (c) You must retain, in the Source form of any Derivative Works
+          that You distribute, all copyright, patent, trademark, and
+          attribution notices from the Source form of the Work,
+          excluding those notices that do not pertain to any part of
+          the Derivative Works; and
+
+      (d) If the Work includes a "NOTICE" text file as part of its
+          distribution, then any Derivative Works that You distribute must
+          include a readable copy of the attribution notices contained
+          within such NOTICE file, excluding those notices that do not
+          pertain to any part of the Derivative Works, in at least one
+          of the following places: within a NOTICE text file distributed
+          as part of the Derivative Works; within the Source form or
+          documentation, if provided along with the Derivative Works; or,
+          within a display generated by the Derivative Works, if and
+          wherever such third-party notices normally appear. The contents
+          of the NOTICE file are for informational purposes only and
+          do not modify the License. You may add Your own attribution
+          notices within Derivative Works that You distribute, alongside
+          or as an addendum to the NOTICE text from the Work, provided
+          that such additional attribution notices cannot be construed
+          as modifying the License.
+
+      You may add Your own copyright statement to Your modifications and
+      may provide additional or different license terms and conditions
+      for use, reproduction, or distribution of Your modifications, or
+      for any such Derivative Works as a whole, provided Your use,
+      reproduction, and distribution of the Work otherwise complies with
+      the conditions stated in this License.
+
+   5. Submission of Contributions. Unless You explicitly state otherwise,
+      any Contribution intentionally submitted for inclusion in the Work
+      by You to the Licensor shall be under the terms and conditions of
+      this License, without any additional terms or conditions.
+      Notwithstanding the above, nothing herein shall supersede or modify
+      the terms of any separate license agreement you may have executed
+      with Licensor regarding such Contributions.
+
+   6. Trademarks. This License does not grant permission to use the trade
+      names, trademarks, service marks, or product names of the Licensor,
+      except as required for reasonable and customary use in describing the
+      origin of the Work and reproducing the content of the NOTICE file.
+
+   7. Disclaimer of Warranty. Unless required by applicable law or
+      agreed to in writing, Licensor provides the Work (and each
+      Contributor provides its Contributions) on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied, including, without limitation, any warranties or conditions
+      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+      PARTICULAR PURPOSE. You are solely responsible for determining the
+      appropriateness of using or redistributing the Work and assume any
+      risks associated with Your exercise of permissions under this License.
+
+   8. Limitation of Liability. In no event and under no legal theory,
+      whether in tort (including negligence), contract, or otherwise,
+      unless required by applicable law (such as deliberate and grossly
+      negligent acts) or agreed to in writing, shall any Contributor be
+      liable to You for damages, including any direct, indirect, special,
+      incidental, or consequential damages of any character arising as a
+      result of this License or out of the use or inability to use the
+      Work (including but not limited to damages for loss of goodwill,
+      work stoppage, computer failure or malfunction, or any and all
+      other commercial damages or losses), even if such Contributor
+      has been advised of the possibility of such damages.
+
+   9. Accepting Warranty or Additional Liability. While redistributing
+      the Work or Derivative Works thereof, You may choose to offer,
+      and charge a fee for, acceptance of support, warranty, indemnity,
+      or other liability obligations and/or rights consistent with this
+      License. However, in accepting such obligations, You may act only
+      on Your own behalf and on Your sole responsibility, not on behalf
+      of any other Contributor, and only if You agree to indemnify,
+      defend, and hold each Contributor harmless for any liability
+      incurred by, or claims asserted against, such Contributor by reason
+      of your accepting any such warranty or additional liability.
+
--- a/README.md
+++ b/README.md
@ -0,0 +1,93 @@
+# Porting Guide for Linux on Power
+TBD...
+
+To build this project, one must ensure that the Docs-Master project has
+also been cloned at the same directory level as the Docs-Template project.
+This can be accomplished with the following steps:
+
+1. Clone the master documentation project (Docs-Master) using the following command:
+
+  ```
+  $ git clone https://github.com/OpenPOWERFoundation/Docs-Master.git
+  ```
+  
+2. Clone this project (Docs-Template) using the following command:
+
+  ```
+  $ git clone https://ibm.github.com/scheel/SJM-Porting-Guide.git
+  ```
+  
+3. Build the project with these commands:
+  ```
+  $ cd SJM-Porting-Guide
+  $ mvn clean generate-sources
+  ```
+
+The online version of the document can be found in the OpenPOWER Foundation
+Document library at [TBD](http://openpowerfoundation.org/?resource_lib=TBD).
+
+The project which controls the look and feel of the document is the 
+[Docs-Maven-Plugin project](https://github.com/OpenPOWERFoundation/Docs-Maven-Plugin), an 
+OpenPOWER Foundation private project on GitHub.  To obtain access to the Maven Plugin project, 
+contact Jeff Scheel \([scheel@us.ibm.com](mailto://scheel@us.ibm.com)\) or 
+Jeff Brown \([jeffdb@us.ibm.com](mailto://jeffdb@us.ibm.com)\).
+
+## License
+This project is licensed under the Apache V2 license.  More information
+can be found in the LICENSE file or online at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+## Community
+TBD...
+
+## Contributions
+TBD...
+
+Contributions to this project should conform to the `Developer Certificate
+of Origin` as defined at http://elinux.org/Developer_Certificate_Of_Origin.
+Commits to this project need to contain the following line to indicate
+the submitter accepts the DCO:
+```
+Signed-off-by: Your Name <your_email@domain.com>
+```
+By contributing in this way, you agree to the terms as follows:
+```
+Developer Certificate of Origin
+Version 1.1
+
+Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
+660 York Street, Suite 102,
+San Francisco, CA 94110 USA
+
+Everyone is permitted to copy and distribute verbatim copies of this
+license document, but changing it is not allowed.
+
+
+Developer's Certificate of Origin 1.1
+
+By making a contribution to this project, I certify that:
+
+(a) The contribution was created in whole or in part by me and I
+    have the right to submit it under the open source license
+    indicated in the file; or
+
+(b) The contribution is based upon previous work that, to the best
+    of my knowledge, is covered under an appropriate open source
+    license and I have the right under that license to submit that
+    work with modifications, whether created in whole or in part
+    by me, under the same open source license (unless I am
+    permitted to submit under a different license), as indicated
+    in the file; or
+
+(c) The contribution was provided directly to me by some other
+    person who certified (a), (b) or (c) and I have not modified
+    it.
+
+(d) I understand and agree that this project and the contribution
+    are public and that a record of the contribution (including all
+    personal information I submit with it, including my sign-off) is
+    maintained indefinitely and may be redistributed consistent with
+    this project or the open source license(s) involved.
+```
+
--- a/Vector_Intrinsics/app_intel_suffixes.xml
+++ b/Vector_Intrinsics/app_intel_suffixes.xml
@ -0,0 +1,318 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
+  Copyright (c) 2017 OpenPOWER Foundation
+  
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+  
+-->
+<appendix xmlns="http://docbook.org/ns/docbook"
+  xmlns:xi="http://www.w3.org/2001/XInclude"
+  xmlns:xlink="http://www.w3.org/1999/xlink"
+  version="5.0"
+  xml:id="app_intel_suffixes">
+    <?dbhtml stop-chunking?>
+  <title>Intel Intrinsic suffixes</title>
+  
+  <section>
+    <title>MMX</title>
+      <variablelist>
+        <varlistentry>
+          <term><emphasis role="bold"><literal>_pi16</literal></emphasis></term>
+            <listitem><para>4 x packed short int</para></listitem>
+			  </varlistentry>
+
+        <varlistentry>
+          <term><emphasis role="bold"><literal>_pi32</literal></emphasis></term>
+            <listitem><para>2 x packed int</para></listitem>
+			  </varlistentry>
+
+        <varlistentry>
+          <term><emphasis role="bold"><literal>_pi8</literal></emphasis></term>
+            <listitem><para>8 x packed signed char</para></listitem>
+			  </varlistentry>
+
+        <varlistentry>
+          <term><emphasis role="bold"><literal>_pu16</literal></emphasis></term>
+            <listitem><para>4 x packed unsigned short int</para></listitem>
+			  </varlistentry>
+
+        <varlistentry>
+          <term><emphasis role="bold"><literal>_pu8</literal></emphasis></term>
+            <listitem><para>8 x packed unsigned char</para></listitem>
+			  </varlistentry>
+
+        <varlistentry>
+          <term><emphasis role="bold"><literal>_si64</literal></emphasis></term>
+            <listitem><para>single 64-bit binary (logical)</para></listitem>
+			  </varlistentry>
+      </variablelist>
+  </section>
+  
+  <section>
+    <title>SSE</title>
+      <variablelist>
+        <varlistentry>
+          <term><emphasis role="bold"><literal>_ps</literal></emphasis></term>
+            <listitem><para>4 x packed float</para></listitem>
+			  </varlistentry>
+
+        <varlistentry>
+          <term><emphasis role="bold"><literal>_ss</literal></emphasis></term>
+            <listitem><para>single scalar float</para></listitem>
+			  </varlistentry>
+
+        <varlistentry>
+          <term><emphasis role="bold"><literal>_si32</literal></emphasis></term>
+            <listitem><para>single 32-bit int</para></listitem>
+			  </varlistentry>
+
+        <varlistentry>
+          <term><emphasis role="bold"><literal>_si64</literal></emphasis></term>
+            <listitem><para>single 64-bit long int</para></listitem>
+			  </varlistentry>
+      </variablelist>
+  </section>
+  
+  <section>
+    <title>SSE2</title>
+      <variablelist>
+        <varlistentry>
+          <term><emphasis role="bold"><literal>_epi16</literal></emphasis></term>
+            <listitem><para>8 x packed short int</para></listitem>
+			  </varlistentry>
+
+        <varlistentry>
+          <term><emphasis role="bold"><literal>_epi32</literal></emphasis></term>
+            <listitem><para>4 x packed int</para></listitem>
+			  </varlistentry>
+
+        <varlistentry>
+          <term><emphasis role="bold"><literal>_epi64</literal></emphasis></term>
+            <listitem><para>2 x packed long int</para></listitem>
+			  </varlistentry>
+
+        <varlistentry>
+          <term><emphasis role="bold"><literal>_epi8</literal></emphasis></term>
+            <listitem><para>16 x packed signed char</para></listitem>
+			  </varlistentry>
+
+        <varlistentry>
+          <term><emphasis role="bold"><literal>_epu16</literal></emphasis></term>
+            <listitem><para>8 x packed unsigned short int</para></listitem>
+			  </varlistentry>
+
+        <varlistentry>
+          <term><emphasis role="bold"><literal>_epu32</literal></emphasis></term>
+            <listitem><para>4 x packed unsigned int</para></listitem>
+			  </varlistentry>
+
+        <varlistentry>
+          <term><emphasis role="bold"><literal>_epu8</literal></emphasis></term>
+            <listitem><para>16 x packed unsigned char</para></listitem>
+			  </varlistentry>
+      </variablelist>
+
+      <!-- Is this break really desired? -->
+      <para/>
+      
+      <variablelist>
+        <varlistentry>
+          <term><emphasis role="bold"><literal>_pd</literal></emphasis></term>
+            <listitem><para>2 x packed double</para></listitem>
+			  </varlistentry>
+
+        <varlistentry>
+          <term><emphasis role="bold"><literal>_sd</literal></emphasis></term>
+            <listitem><para>single scalar double</para></listitem>
+			  </varlistentry>
+ 
+        <varlistentry>
+          <term><emphasis role="bold"><literal>_pi64</literal></emphasis></term>
+            <listitem><para>single long int</para></listitem>
+			  </varlistentry>
+
+        <varlistentry>
+          <term><emphasis role="bold"><literal>_si128</literal></emphasis></term>
+            <listitem><para>single 128-bit binary (logical)</para></listitem>
+			  </varlistentry>
+     </variablelist>
+		
+  </section>
+  
+  <section>
+    <title>AVX/AVX2 __m256_*</title>
+      <variablelist>
+        <varlistentry>
+          <term><emphasis role="bold"><literal>_ps</literal></emphasis></term>
+            <listitem><para>8 x packed float</para></listitem>
+			  </varlistentry>
+
+        <varlistentry>
+          <term><emphasis role="bold"><literal>_pd</literal></emphasis></term>
+            <listitem><para>4 x packed double</para></listitem>
+			  </varlistentry>
+
+        <varlistentry>
+          <term><emphasis role="bold"><literal>_epi16</literal></emphasis></term>
+            <listitem><para>16 x packed short int</para></listitem>
+			  </varlistentry>
+
+        <varlistentry>
+          <term><emphasis role="bold"><literal>_epi32</literal></emphasis></term>
+            <listitem><para>8 x packed int</para></listitem>
+			  </varlistentry>
+
+        <varlistentry>
+          <term><emphasis role="bold"><literal>_epi64</literal></emphasis></term>
+            <listitem><para>4 x packed long int</para></listitem>
+			  </varlistentry>
+
+        <varlistentry>
+          <term><emphasis role="bold"><literal>_epi8</literal></emphasis></term>
+            <listitem><para>32 x packed signed char</para></listitem>
+			  </varlistentry>
+
+        <varlistentry>
+          <term><emphasis role="bold"><literal>_epu16</literal></emphasis></term>
+            <listitem><para>16 x packed unsigned short int</para></listitem>
+			  </varlistentry>
+
+        <varlistentry>
+          <term><emphasis role="bold"><literal>_epu32</literal></emphasis></term>
+            <listitem><para>8 x packed unsigned int</para></listitem>
+			  </varlistentry>
+
+        <varlistentry>
+          <term><emphasis role="bold"><literal>_epu8</literal></emphasis></term>
+            <listitem><para>32 x packed unsigned char</para></listitem>
+			  </varlistentry>
+
+        <varlistentry>
+          <term><emphasis role="bold"><literal>_ss</literal></emphasis></term>
+            <listitem><para>single scalar float (broadcast/splat)</para></listitem>
+			  </varlistentry>
+
+        <varlistentry>
+          <term><emphasis role="bold"><literal>_sd</literal></emphasis></term>
+            <listitem><para>single scalar double</para></listitem>
+			  </varlistentry>
+
+        <varlistentry>
+          <term><emphasis role="bold"><literal>_si256</literal></emphasis></term>
+            <listitem><para>single 256-bit binary (logical)</para></listitem>
+			  </varlistentry>
+
+        <varlistentry>
+          <term><emphasis role="bold"><literal>_pd256</literal></emphasis></term>
+            <listitem><para>cast / zero extend</para></listitem>
+			  </varlistentry>
+
+        <varlistentry>
+          <term><emphasis role="bold"><literal>_ps256</literal></emphasis></term>
+            <listitem><para>cast / zero extend</para></listitem>
+			  </varlistentry>
+
+        <varlistentry>
+          <term><emphasis role="bold"><literal>_pd128</literal></emphasis></term>
+            <listitem><para>cast</para></listitem>
+			  </varlistentry>
+
+        <varlistentry>
+          <term><emphasis role="bold"><literal>_ps128</literal></emphasis></term>
+            <listitem><para>cast</para></listitem>
+			  </varlistentry>
+      </variablelist>
+  </section>
+  
+  <section>
+    <title>AVX512 __m512_*</title>
+      <variablelist>
+        <varlistentry>
+          <term><emphasis role="bold"><literal>_ps</literal></emphasis></term>
+            <listitem><para>16 x packed float</para></listitem>
+			  </varlistentry>
+
+        <varlistentry>
+          <term><emphasis role="bold"><literal>_pd</literal></emphasis></term>
+            <listitem><para>8 x packed double</para></listitem>
+			  </varlistentry>
+
+        <varlistentry>
+          <term><emphasis role="bold"><literal>_epi16</literal></emphasis></term>
+            <listitem><para>32 x packed short int</para></listitem>
+			  </varlistentry>
+
+        <varlistentry>
+          <term><emphasis role="bold"><literal>_epi32</literal></emphasis></term>
+            <listitem><para>16 x packed int</para></listitem>
+			  </varlistentry>
+
+        <varlistentry>
+          <term><emphasis role="bold"><literal>_epi64</literal></emphasis></term>
+            <listitem><para>8 x packed long int</para></listitem>
+			  </varlistentry>
+
+        <varlistentry>
+          <term><emphasis role="bold"><literal>_epi8</literal></emphasis></term>
+            <listitem><para>64 x packed signed char</para></listitem>
+			  </varlistentry>
+
+        <varlistentry>
+          <term><emphasis role="bold"><literal>_epu16</literal></emphasis></term>
+            <listitem><para>32 x packed unsigned short int</para></listitem>
+			  </varlistentry>
+
+        <varlistentry>
+          <term><emphasis role="bold"><literal>_epu32</literal></emphasis></term>
+            <listitem><para>16 x packed unsigned int</para></listitem>
+			  </varlistentry>
+
+        <varlistentry>
+          <term><emphasis role="bold"><literal>_epu64</literal></emphasis></term>
+            <listitem><para>8 x packed unsigned long int</para></listitem>
+			  </varlistentry>
+
+        <varlistentry>
+          <term><emphasis role="bold"><literal>_epu8</literal></emphasis></term>
+            <listitem><para>64 x packed unsigned char</para></listitem>
+			  </varlistentry>
+
+        <varlistentry>
+          <term><emphasis role="bold"><literal>_ss</literal></emphasis></term>
+            <listitem><para>single scalar float</para></listitem>
+			  </varlistentry>
+
+        <varlistentry>
+          <term><emphasis role="bold"><literal>_sd</literal></emphasis></term>
+            <listitem><para>single scalar double</para></listitem>
+			  </varlistentry>
+
+        <varlistentry>
+          <term><emphasis role="bold"><literal>_si512</literal></emphasis></term>
+            <listitem><para>single 512-bit binary (logical)</para></listitem>
+			  </varlistentry>
+
+        <varlistentry>
+          <term><emphasis role="bold"><literal>_pd512</literal></emphasis></term>
+            <listitem><para>cast / zero extend</para></listitem>
+			  </varlistentry>
+
+        <varlistentry>
+          <term><emphasis role="bold"><literal>_ps512</literal></emphasis></term>
+            <listitem><para>cast / zero extend</para></listitem>
+			  </varlistentry>
+      </variablelist>
+ 	  </section>
+
+</appendix>
+
--- a/Vector_Intrinsics/app_references.xml
+++ b/Vector_Intrinsics/app_references.xml
@ -0,0 +1,70 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
+  Copyright (c) 2017 OpenPOWER Foundation
+  
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+  
+-->
+<appendix xmlns="http://docbook.org/ns/docbook"
+  xmlns:xi="http://www.w3.org/2001/XInclude"
+  xmlns:xlink="http://www.w3.org/1999/xlink"
+  version="5.0"
+  xml:id="app_references">
+    <?dbhtml stop-chunking?>
+  <title>Document references</title>
+  
+  <section>
+    <title>OpenPOWER and Power documents</title>
+    <para>
+      <link xlink:href="https://openpowerfoundation.org/technical/technical-resources/technical-specifications/">OpenPOWER™ Technical Specifications</link>
+    </para>
+    <para>
+      <link xlink:href="https://openpowerfoundation.org/?resource_lib=ibm-power-isa-version-2-07-b">Power ISA™ Version 2.07 B</link>
+    </para>
+    <para>
+      <link xlink:href="https://www.docdroid.net/tWT7hjD/powerisa-v30.pdf.html">Power ISA™ Version 3.0</link>
+    </para>
+    <para>
+      <link xlink:href="https://openpowerfoundation.org/technical/technical-resources/technical-specifications/">Power Architecture 64-bit ELF ABI Specification (AKA OpenPower ABI for Linux Supplement)</link>
+    </para>
+    <para>
+      <link xlink:href="http://www.nxp.com/assets/documents/data/en/reference-manuals/ALTIVECPEM.pdf">AltiVec™ Technology Programming Environments Manual</link>
+    </para>
+
+  </section>
+  <section>
+    <title>A.2 Intel documents</title>
+    <para>
+      <link xlink:href="https://software.intel.com/en-us/articles/intel-sdm">Intel® 64 and IA-32 Architectures Software Developer’s Manual</link>
+    </para>
+    <para>
+      <link xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/">Intel™ Intrinsics Guide</link>
+    </para>
+    <para/>
+  </section>
+  <section>
+    <title>A.3 GNU Compiler Collection (GCC) documents</title>
+    <para>
+      <link xlink:href="https://gcc.gnu.org/onlinedocs/">GCC online documentation</link>
+    </para>
+    <para>
+      <link xlink:href="https://gcc.gnu.org/onlinedocs/gcc-6.3.0/gcc/">GCC Manual (GCC 6.3)</link>
+    </para>
+    <para>
+      <link xlink:href="https://gcc.gnu.org/onlinedocs/gccint/">GCC Internals Manual</link>
+    </para>
+    <para/>
+  </section>
+
+</appendix>
+
--- a/Vector_Intrinsics/bk_main.xml
+++ b/Vector_Intrinsics/bk_main.xml
@ -0,0 +1,103 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
+  Copyright (c) 2017 OpenPOWER Foundation
+  
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+  
+-->
+<book xmlns="http://docbook.org/ns/docbook"
+  xmlns:xi="http://www.w3.org/2001/XInclude"
+  xmlns:xlink="http://www.w3.org/1999/xlink"
+  version="5.0"
+  xml:id="bk_main">
+
+    <title>Linux on Power Porting Guide</title>
+    <subtitle>Vector Intrinsic</subtitle>
+
+    <info>
+    <author>
+      <personname>
+          <surname>System Software Work Group</surname>
+      </personname>
+      <email>syssw-chair@openpowerfoundation.org</email>
+      <affiliation>
+          <orgname>OpenPOWER Foundation</orgname>
+      </affiliation>
+    </author>
+    <copyright>
+      <year>2017</year>
+      <holder>OpenPOWER Foundation</holder>
+    </copyright>
+    <!-- TODO: Set the correct document releaseinfo -->
+    <releaseinfo>Revision 0.1</releaseinfo>
+    <productname>OpenPOWER</productname>
+    <pubdate/>
+
+    <legalnotice role="apache2">
+
+      <annotation>
+          <remark>Copyright details are filled in by the template.</remark>
+      </annotation>
+    </legalnotice>
+    
+    <!-- TODO: Update the following text with the correct document description (first paragraph),
+               Work Group name, and Work Product track (both in second paragraph). -->
+    <abstract>
+      <para>The goal of this project is to provide functional equivalents of the 
+      Intel MMX, SSE, and AVX intrinsic functions, that are commonly used in Linux 
+      applications, and make them (or equivalents) available for the PowerPC64LE 
+      platform.</para>
+      
+      <para>This document is a Standard Track, Work Group Note work product owned by the
+      System Software Workgroup and handled in compliance with the requirements outlined in the
+      <citetitle>OpenPOWER Foundation Work Group (WG) Process</citetitle> document.  It was
+      created using the <citetitle>Master Template Guide</citetitle> version 0.9.5. Comments,
+      questions, etc. can be submitted to the public mailing list  for this document at 
+      <link xlink:href="http://tbd.openpowerfoundation.org">TBD</link>.</para>
+    </abstract>
+
+    <revhistory>
+      <!-- TODO: Update as new revisions created -->
+      <revision>
+          <date>2017-07-26</date>
+          <revdescription>
+            <itemizedlist spacing="compact">
+              <listitem>
+                <para>Revision 0.1 - initial draft from Steve Munroe</para>
+              </listitem>
+            </itemizedlist>
+          </revdescription>
+      </revision>
+     </revhistory>
+    </info>
+
+  <!-- The ch_preface.xml file is required by all documents -->
+  <xi:include href="../../Docs-Master/common/ch_preface.xml"/>
+
+  <!-- Chapter heading files  -->
+  <xi:include href="ch_intel_intrinsic_porting.xml"/>
+  <xi:include href="ch_howto_start.xml"/>
+  
+  <!-- Placeholder files ATM -->
+  <!--chapter><title>Placeholders</title>
+  </chapter-->
+  
+
+  <!-- Document specific appendices -->
+  <xi:include href="app_references.xml"/>
+  <xi:include href="app_intel_suffixes.xml"/>
+
+  <!-- The app_foundation.xml appendix file is required by all documents. -->
+  <xi:include href="../../Docs-Master/common/app_foundation.xml"/>
+
+</book>
--- a/Vector_Intrinsics/ch_howto_start.xml
+++ b/Vector_Intrinsics/ch_howto_start.xml
@ -0,0 +1,115 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
+  Copyright (c) 2017 OpenPOWER Foundation
+  
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+  
+-->
+<chapter xmlns="http://docbook.org/ns/docbook"
+  xmlns:xi="http://www.w3.org/2001/XInclude"
+  xmlns:xlink="http://www.w3.org/1999/xlink"
+  version="5.0"
+  xml:id="ch_howto_start">
+  <title>How do we work this?</title>
+  
+  <para>The working assumption is to start with the existing GCC headers from 
+  ./gcc/config/i386/, then convert them to PowerISA and add them to 
+  ./gcc/config/rs6000/. I assume we will replicate the existing header structure 
+  and retain the existing header file and intrinsic names. This also allows us to 
+  reuse existing DejaGNU test cases from ./gcc/testsuite/gcc.target/i386, modify 
+  them as needed for the POWER target, and them to the 
+  ./gcc/testsuite/gcc.target/powerpc.</para>
+
+  <para>We can be flexible on the sequence that headers/intrinsics and test 
+  cases are ported.  This should be based on customer need and resolving 
+  internal dependencies.  This implies an oldest-to-newest / bottoms-up (MMX, 
+  SSE, SSE2, …) strategy. The assumption is, existing community and user 
+  application codes, are more likely to have optimized code for previous 
+  generation ubiquitous (SSE, SSE2, ...) processors than the latest (and rare) 
+  SkyLake AVX512.</para>
+
+  <para>I would start with an existing header from the current GCC 
+   ./gcc/config/i386/ and copy the header comment (including FSF copyright) down 
+  to any vector typedefs used in the API or implementation. Skip the Intel 
+  intrinsic implementation code for now, but add the ending #end if matching the 
+  headers conditional guard against multiple inclusion. You can add  #include 
+  &lt;alternative&gt; as needed. For examples:
+  <programlisting><![CDATA[/* Copyright (C) 2003-2017 Free Software Foundation, Inc
+...
+/* This header provides a best effort implementation of the Intel X86
+ * SSE2 intrinsics for the PowerPC target.  This implementation is a
+ * combination of compiled C vector codes or equivalent sequences of
+ * GCC vector builtins from the GCC PowerPC Altivec target.
+ *
+ * However some details of this implementation will differ from
+ * the X86 due to differences in the underlying hardware or GCC
+ * implementation. For example the PowerPC target only uses unordered
+ * floating point compares. */
+
+#ifndef EMMINTRIN_H_
+#define EMMINTRIN_H_
+
+#include <altivec.h>
+#include <assert.h>
+
+/* We need definitions from the SSE header files.  */
+#include <xmmintrin.h>
+
+/* The Intel API is flexible enough that we must allow aliasing with other
+   vector types, and their scalar components.  */
+typedef float __m128 __attribute__ ((__vector_size__ (16), __may_alias__));
+
+/* Internal data types for implementing the intrinsics.  */
+typedef float __v4sf __attribute__ ((__vector_size__ (16)));
+/* more typedefs.  */
+
+/* The intrinsic implmentations go here.  */
+
+#endif /* EMMINTRIN_H_ */]]></programlisting></para>
+
+  <para>Then you can start adding small groups of related intrinsic 
+  implementations to the header to be compiled and  examine the generated code. 
+  Once you have what looks like reasonable code you can grep through 
+   ./gcc/testsuite/gcc.target/i386 for examples using the intrinsic names you 
+  just added. You should be able to find functional tests for most X86 
+  intrinsics. </para>
+
+  <para>The 
+  <link xlink:href="https://gcc.gnu.org/onlinedocs/gccint/Testsuites.html#Testsuites">GCC 
+  testsuite</link> uses the DejaGNU  test framework as documented in the 
+  <link xlink:href="https://gcc.gnu.org/onlinedocs/gccint/">GNU Compiler Collection (GCC) 
+  Internals</link> manual. GCC adds its own DejaGNU directives and extensions, 
+  that are embedded in the testsuite source as comments.  Some are platform 
+  specific and will need to be adjusted for tests that are ported to our 
+  platform. For example
+  <programlisting><![CDATA[/* { dg-do run } */
+/* { dg-options "-O2 -msse2" } */
+/* { dg-require-effective-target sse2 } */]]></programlisting></para>
+
+  <para>should become something like
+  <programlisting><![CDATA[/* { dg-do run } */
+/* { dg-options "-O3 -mpower8-vector" } */
+/* { dg-require-effective-target lp64 } */
+/* { dg-require-effective-target p8vector_hw { target powerpc*-*-* } } */]]></programlisting></para>
+
+  <para>Repeat this process until you have equivalent implementations for all 
+  the intrinsics in that header and associated test cases that execute without 
+  error.</para>
+
+  <xi:include href="sec_prefered_methods.xml"/>
+  <xi:include href="sec_prepare.xml"/>  
+  <xi:include href="sec_differences.xml"/>
+
+
+</chapter>
+
--- a/Vector_Intrinsics/ch_intel_intrinsic_porting.xml
+++ b/Vector_Intrinsics/ch_intel_intrinsic_porting.xml
@ -0,0 +1,46 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
+  Copyright (c) 2017 OpenPOWER Foundation
+  
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+  
+-->
+<chapter xmlns="http://docbook.org/ns/docbook"
+  xmlns:xi="http://www.w3.org/2001/XInclude"
+  xmlns:xlink="http://www.w3.org/1999/xlink"
+  version="5.0"
+  xml:id="ch_intel_intrinsic_porting">
+  <title>Intel Intrinsic porting guide for Power64LE</title>
+  
+  <para>The goal of this project is to provide functional equivalents of the 
+  Intel MMX, SSE, and AVX intrinsic functions, that are commonly used in Linux 
+  applications, and make them (or equivalents) available for the PowerPC64LE 
+  platform. These X86 intrinsics started with the Intel and Microsoft compilers 
+  but were then ported to the GCC compiler. The GCC implementation is a set of 
+  headers with inline functions. These inline functions provide a implementation 
+  mapping from the Intel/Microsoft dialect intrinsic names to the corresponding 
+  GCC Intel built-in's or directly via C language vector extension syntax.</para>
+
+  <para>The current proposal is to start with the existing X86 GCC intrinsic 
+  headers and port them (copy and change the source)  to POWER using C language 
+  vector extensions, VMX and VSX built-ins. Another key assumption is that we 
+  will be able to use many of existing Intel DejaGNU test cases on 
+  ./gcc/testsuite/gcc.target/i386. This document is intended as a guide to 
+  developers participating in this effort. However this document provides 
+  guidance and examples that should be useful to developers who may encounter X86 
+  intrinsics in code that they are porting to another platform.</para>
+
+  <xi:include href="sec_review_source.xml"/>
+
+</chapter>
+
--- a/Vector_Intrinsics/pom.xml
+++ b/Vector_Intrinsics/pom.xml
@ -0,0 +1,148 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<project xmlns="http://maven.apache.org/POM/4.0.0"
+    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
+    xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
+    <parent>
+
+        <groupId>org.openpowerfoundation.docs</groupId>
+        <artifactId>workgroup-pom</artifactId>
+        <version>1.0.0-SNAPSHOT</version>
+        <relativePath>../pom.xml</relativePath>
+    </parent>
+    <modelVersion>4.0.0</modelVersion>
+
+    <!-- TODO: Rename the artifactID field to some appropriate for your new document -->
+    <artifactId>Porting-Guide-Vector-Intrinsics</artifactId>
+
+    <packaging>jar</packaging>
+ 
+    <!-- TODO: Rename the name field to some appropriate for your new document -->
+   <name>Porting-Guide-Vector-Intrinsics</name>
+
+    <properties>
+        <!-- This is set by Jenkins according to the branch. -->
+        <release.path.name></release.path.name>
+        <comments.enabled>0</comments.enabled>
+    </properties>
+    <!-- ################################################ -->
+    <!-- USE "mvn clean generate-sources" to run this POM -->
+    <!-- ################################################ -->
+    <build>
+        <plugins>
+            <plugin>
+
+                <groupId>org.openpowerfoundation.docs</groupId>
+
+                <artifactId>openpowerdocs-maven-plugin</artifactId>
+                <!-- version set in ../pom.xml -->
+                <executions>
+                    <execution>
+                        <id>generate-webhelp</id>
+                        <goals>
+                            <goal>generate-webhelp</goal>
+                        </goals>
+                        <phase>generate-sources</phase>
+                        <configuration>
+                            <!-- These parameters only apply to webhelp -->
+                            <enableDisqus>${comments.enabled}</enableDisqus>
+                            <disqusShortname>LoPAR-Virtualization</disqusShortname>
+                            <enableGoogleAnalytics>1</enableGoogleAnalytics>
+                            <googleAnalyticsId>UA-17511903-1</googleAnalyticsId>
+                            <generateToc>
+                                appendix		toc,title
+                                article/appendix	nop
+                                article   		toc,title
+                                book      		toc,title,figure,table,example,equation
+                        				book/appendix		nop
+                                book/chapter		nop
+                                chapter   		toc,title
+                        				chapter/section 	nop
+                                section   		toc
+                                part      		toc,title
+                                qandadiv  		toc
+                                qandaset  		toc
+                                reference 		toc,title
+                                set       		toc,title
+                            </generateToc>
+                            <!-- The following elements sets the autonumbering of sections in output for chapter numbers but no numbered sections-->
+                            <sectionAutolabel>1</sectionAutolabel>
+                            <tocSectionDepth>3</tocSectionDepth>
+                            <sectionLabelIncludesComponentLabel>1</sectionLabelIncludesComponentLabel>
+
+                            <!-- TODO: Rename the webhelpDirname field to the new directory for new document -->
+                            <webhelpDirname>Vector-Intrinsics</webhelpDirname>
+
+                            <!-- TODO: Rename the pdfFilenameBase field to the PDF name for new document -->
+                            <pdfFilenameBase>Vector-Intrinsics</pdfFilenameBase>
+
+                            <!-- TODO: Define the appropriate work product type.  These values are defined by the IPR Policy.
+                                       Consult with the Work Group Chair or a Technical Steering Committee member if you have
+                                       questions about which value to select.
+                                       
+                                       If no value is provided below, the document will default to "Work Group Notes".-->
+                            <workProduct>workgroupNotes</workProduct>
+                            <!--workProduct>workgroupSpecification</workProduct-->
+                            <!-- workProduct>candidateStandard</workProduct -->
+                            <!-- workProduct>openpowerStandard</workProduct -->
+
+                            <!-- TODO: Set the appropriate security policy for the document.  For documents
+                                       which are not "public" this will affect the document title page and
+                                       create a vertical running ribbon on the internal margin of the
+                                       security status in all CAPS.  Values and definitions are formally
+                                       defined by the IPR policy.  A layman's definition follows:
+
+                                       public =	                this document may be shared outside the
+                                                                foundation and thus this setting must be
+                                                                used only when completely sure it allowed
+                                       foundationConfidential = this document may be shared freely with
+                                                                OpenPOWER Foundation members but may not be
+                                                                shared publicly
+                                       workgroupConfidential =  this document may only be shared within the
+                                                                work group and should not be shared with
+                                                                other Foundation members or the public
+
+                                       The appropriate starting security for a new document is "workgroupConfidential". -->
+                            <!--security>workgroupConfidential</security -->
+                            <!-- security>foundationConfidential</security -->
+                            <security>public</security>
+
+                            <!-- TODO: Set the appropriate work flow status for the document.  For documents
+                                       which are not "published" this will affect the document title page
+                                       and create a vertical running ribbon on the internal margin of the 
+                                       security status in all CAPS.  Values and definitions are formally
+                                       defined by the IPR policy.  A layman's definition follows:
+
+                                           published = this document has completed all reviews and has
+                                                       been published
+                                           draft =     this document is actively being updated and has
+                                                       not yet been reviewed
+                                           review =    this document is presently being reviewed
+
+                                       The appropriate starting security for a new document is "draft". -->
+                            <documentStatus>draft</documentStatus>
+                            <!-- documentStatus>review</documentStatus -->
+                            <!-- documentStatus>publish</documentStatus -->                            
+
+                        </configuration>
+                    </execution>
+                </executions>
+                <configuration>
+                    <!-- These parameters apply to pdf and webhelp -->
+                    <xincludeSupported>true</xincludeSupported>
+                    <sourceDirectory>.</sourceDirectory>
+                    <includes>
+                       <!-- TODO: If you desire, you may change the following filename to something more appropriate for the new document -->
+                       bk_main.xml
+                    </includes>
+
+                    <!-- **TODO: Set to the correct project URL.  This likely needs input from the TSC.  -->
+                    <!-- canonicalUrlBase>http://openpowerfoundation.org/docs/template-guide/content</canonicalUrlBase -->
+                    <glossaryCollection>${basedir}/../glossary/glossary-terms.xml</glossaryCollection>
+                    <includeCoverLogo>1</includeCoverLogo>
+                    <coverUrl>www.openpowerfoundation.org</coverUrl>
+                </configuration>
+            </plugin>
+        </plugins>
+    </build>
+</project>
+
--- a/Vector_Intrinsics/sec_api_implemented.xml
+++ b/Vector_Intrinsics/sec_api_implemented.xml
@ -0,0 +1,35 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
+  Copyright (c) 2017 OpenPOWER Foundation
+  
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+  
+-->
+<section xmlns="http://docbook.org/ns/docbook"
+  xmlns:xi="http://www.w3.org/2001/XInclude"
+  xmlns:xlink="http://www.w3.org/1999/xlink"
+  version="5.0"
+  xml:id="sec_api_implemented">
+  <title>How the API is implemented</title>
+  
+  <para>One pleasant surprise is that many (at least for the older Intel) 
+  Intrinsics are implemented directly in C vector extension code and/or a simple 
+  mapping to GCC target specific builtins. </para>
+  
+  <xi:include href="sec_simple_examples.xml"/>
+  <xi:include href="sec_extra_attributes.xml"/>
+  <xi:include href="sec_how_findout.xml"/>
+  <xi:include href="sec_other_intrinsic_examples.xml"/>
+
+</section>
+
--- a/Vector_Intrinsics/sec_crossing_lanes.xml
+++ b/Vector_Intrinsics/sec_crossing_lanes.xml
@ -0,0 +1,111 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
+  Copyright (c) 2017 OpenPOWER Foundation
+  
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+  
+-->
+<section xmlns="http://docbook.org/ns/docbook"
+  xmlns:xi="http://www.w3.org/2001/XInclude"
+  xmlns:xlink="http://www.w3.org/1999/xlink"
+  version="5.0"
+  xml:id="sec_crossing_lanes">
+  <title>Crossing lanes</title>
+  
+  <para>We have seen that, most of the time, vector SIMD units prefer to keep 
+  computations in the same “lane” (element number) as the input elements. The 
+  only exception in the examples so far are the occasional splat (copy one 
+  element to all the other elements of the vector) operations. Splat is an 
+  example of the general category of “permute” operations (Intel would call 
+  this a “shuffle” or “blend”). Permutes selects and rearrange the 
+  elements of (usually) a concatenated pair of vectors and delivers those 
+  selected elements, in a specific order, to a result vector. The selection and 
+  order of elements in the result is controlled by a third vector, either as 3rd 
+  input vector or and immediate field of the instruction.</para>
+
+  <para>For example the Intel intrisics for 
+  <link xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_hadd&amp;expand=2757,4767,409,2757">Horizontal Add / Subtract</link> 
+  added with SSE3. These instrinsics add (subtract) adjacent element pairs, across pair of 
+  input vectors, placing the sum of the adjacent elements in the result vector. 
+  For example 
+  <link xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_hadd_ps&amp;expand=2757,4767,409,2757,2757">_mm_hadd_ps</link>  
+  which implments the operation on float:
+  <programlisting><![CDATA[	result[0] = __A[1] + __A[0];
+	result[1] = __A[3] + __A[2];
+	result[2] = __B[1] + __B[0];
+	result[3] = __B[3] + __B[2];]]></programlisting></para>
+
+  <para>Horizontal Add (hadd) provides an incremental vector “sum across” 
+  operation commonly needed in matrix and vector transform math. Horizontal Add 
+  is incremental as you need three hadd instructions to sum across 4 vectors of 4 
+  elements ( 7 for 8 x 8, 15 for 16 x 16, …).</para>
+ 
+  <para>The PowerISA does not have a sum-across operation for float or 
+  double. We can user the vector float add instruction after we rearrange the 
+  inputs so that element pairs line up for the horizontal add. For example we 
+  would need to permute the input vectors {1, 2, 3, 4} and {101, 102, 103, 104} 
+  into vectors {2, 4, 102, 104} and {1, 3, 101, 103} before 
+  the  <literal>vec_add</literal>. This 
+  requires two vector permutes to align the elements into the correct lanes for 
+  the vector add (to implement Horizontal Add).  </para>
+
+  <para>The PowerISA provides generalized byte-level vector permute (vperm) 
+  based a vector register pair source as input and a control vector. The control 
+  vector provides 16 indexes (0-31) to select bytes from the concatenated input 
+  vector register pair (VRA, VRB). A more specific set of permutes (pack, unpack, 
+  merge, splat) operations (across element sizes) are encoded as separate 
+  instruction opcodes or instruction immediate fields.</para>
+
+  <para>Unfortunately only the general <literal>vec_perm</literal> 
+  can provide the realignment 
+  we need the _mm_hadd_ps operation or any of the int, short variants of hadd. 
+  For example:
+  <programlisting><![CDATA[extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
+_mm_hadd_ps (__m128 __X, __m128 __Y)
+{
+    __vector unsigned char xform2 = {
+        0x00, 0x01, 0x02, 0x03,  0x08, 0x09, 0x0A, 0x0B,  
+        0x10, 0x11, 0x12, 0x13,  0x18, 0x19, 0x1A, 0x1B
+      };
+    __vector unsigned char xform1 = {
+        0x04, 0x05, 0x06, 0x07,  0x0C, 0x0D, 0x0E, 0x0F,  
+        0x14, 0x15, 0x16, 0x17,  0x1C, 0x1D, 0x1E, 0x1F
+      };
+    return (__m128) vec_add (vec_perm ((__v4sf) __X, (__v4sf) __Y, xform1),
+                             vec_perm ((__v4sf) __X, (__v4sf) __Y, xform2));
+}]]></programlisting></para>
+
+  <para>This requires two permute control vectors; one to select the even 
+  word elements across <literal>__X</literal> and <literal>__Y</literal>, 
+  and another to select the odd word elements 
+  across <literal>__X</literal> and <literal>__Y</literal>. 
+  The result of these permutes (<literal>vec_perm</literal>) are inputs to the 
+  <literal>vec_add</literal> and completes the add operation. </para>
+
+  <para>Fortunately the permute required for the double (64-bit) case (IE 
+  _mm_hadd_pd) reduces to the equivalent of <literal>vec_mergeh</literal> / 
+  <literal>vec_mergel</literal>  doubleword 
+  (which are variants of  VSX Permute Doubleword Immediate). So the 
+  implementation of _mm_hadd_pd can be simplified to this:
+  <programlisting><![CDATA[extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
+_mm_hadd_pd (__m128d __X, __m128d __Y)
+{
+	return (__m128d) vec_add (vec_mergeh ((__v2df) __X, (__v2df)__Y),
+			                  vec_mergel ((__v2df) __X, (__v2df)__Y));
+}]]></programlisting></para>
+
+  <para>This eliminates the load of the control vectors required by the 
+  previous example.</para>
+
+</section>
+
--- a/Vector_Intrinsics/sec_differences.xml
+++ b/Vector_Intrinsics/sec_differences.xml
@ -0,0 +1,40 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
+  Copyright (c) 2017 OpenPOWER Foundation
+  
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+  
+-->
+<section xmlns="http://docbook.org/ns/docbook"
+  xmlns:xi="http://www.w3.org/2001/XInclude"
+  xmlns:xlink="http://www.w3.org/1999/xlink"
+  version="5.0"
+  xml:id="sec_differences">
+  <title>Profound differences</title>
+  
+  <para>We have already mentioned above a number of architectural differences 
+  that effect porting of codes containing Intel intrinsics to POWER. The fact 
+  that Intel supports multiple vector extensions with different vector widths 
+  (64, 128, 256, and 512-bits) while the PowerISA only supports vectors of 
+  128-bits is one issue. Another is the difference in how the respective ISAs 
+  support scalars in vector registers is another.  In the text above we propose 
+  workable alternatives for the PowerPC port. There also differences in the 
+  handling of floating point exceptions and rounding modes that may impact the 
+  application's performance or behavior.</para>
+  
+  <xi:include href="sec_floatingpoint_exceptions.xml"/>
+  <xi:include href="sec_floatingpoint_rounding.xml"/>
+  <xi:include href="sec_performance.xml"/>
+
+</section>
+
--- a/Vector_Intrinsics/sec_extra_attributes.xml
+++ b/Vector_Intrinsics/sec_extra_attributes.xml
@ -0,0 +1,137 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
+  Copyright (c) 2017 OpenPOWER Foundation
+  
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+  
+-->
+<section xmlns="http://docbook.org/ns/docbook"
+  xmlns:xi="http://www.w3.org/2001/XInclude"
+  xmlns:xlink="http://www.w3.org/1999/xlink"
+  version="5.0"
+  xml:id="sec_extra_attributes">
+  <title>Those extra attributes</title>
+  
+  <para>You may have noticed there are some special attributes:
+  
+  <literallayout>__gnu_inline__
+
+This attribute should be used with a function that is also declared with the
+inline keyword. It directs GCC to treat the function as if it were defined in
+gnu90 mode even when compiling in C99 or gnu99 mode.
+
+If the function is declared extern, then this definition of the function is used
+only for inlining. In no case is the function compiled as a standalone function,
+not even if you take its address explicitly. Such an address becomes an external
+reference, as if you had only declared the function, and had not defined it. This
+has almost the effect of a macro. The way to use this is to put a function
+definition in a header file with this attribute, and put another copy of the
+function, without extern, in a library file. The definition in the header file
+causes most calls to the function to be inlined.
+
+__always_inline__
+
+Generally, functions are not inlined unless optimization is specified. For func-
+tions declared inline, this attribute inlines the function independent of any
+restrictions that otherwise apply to inlining. Failure to inline such a function
+is diagnosed as an error.
+
+__artificial__
+
+This attribute is useful for small inline wrappers that if possible should appear
+during debugging as a unit. Depending on the debug info format it either means
+marking the function as artificial or using the caller location for all instructions
+within the inlined body.
+
+__extension__
+
+... -pedantic’ and other options cause warnings for many GNU C extensions.
+You can prevent such warnings within one expression by writing __extension__</literallayout></para>
+
+  <para>So far I have been using these attributes unchanged.</para>
+
+  <para>But most intrinsics map the Intel intrinsic to one or more target 
+  specific GCC builtins. For example:
+  <programlisting><![CDATA[/* Load two DPFP values from P.  The address must be 16-byte aligned.  */
+extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
+_mm_load_pd (double const *__P)
+{
+  return *(__m128d *)__P;
+}
+
+/* Load two DPFP values from P.  The address need not be 16-byte aligned.  */
+extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
+_mm_loadu_pd (double const *__P)
+{
+  return __builtin_ia32_loadupd (__P);
+}]]></programlisting></para>
+
+  <para>The first intrinsic (_mm_load_pd ) is implement as a C vector pointer 
+  reference, but from the comment assumes the compiler will use a 
+  <emphasis role="bold">movapd</emphasis>
+  instruction that requires 16-byte alignment (will raise a general-protection 
+  exception if not aligned). This  implies that there is a performance advantage 
+  for at least some Intel processors to keep the vector aligned. The second 
+  intrinsic uses the explicit GCC builtin 
+  <emphasis role="bold"><literal>__builtin_ia32_loadupd</literal></emphasis> to generate the 
+  <emphasis role="bold"><literal>movupd</literal></emphasis> instruction which handles unaligned references.</para>
+
+  <para>The opposite assumption applies to POWER and PPC64LE, where GCC 
+  generates the VSX <emphasis role="bold"><literal>lxvd2x</literal></emphasis> / 
+  <emphasis role="bold"><literal>xxswapd</literal></emphasis> 
+  instruction sequence by default, which 
+  allows unaligned references. The PowerISA equivalent for aligned vector access 
+  is the VMX <emphasis role="bold"><literal>lvx</literal></emphasis> instruction and the 
+  <emphasis role="bold"><literal>vec_ld</literal></emphasis> builtin, which forces quadword 
+  aligned access (by ignoring the low order 4 bits of the effective address). The 
+  <emphasis role="bold"><literal>lvx</literal></emphasis> instruction does not raise 
+  alignment exceptions, but perhaps should as part 
+  of our implementation of the Intel intrinsic. This requires that we use 
+  PowerISA VMX/VSX built-ins to insure we get the expected results.</para>
+
+  <para>The current prototype defines the following:
+  <programlisting><![CDATA[/* Load two DPFP values from P.  The address must be 16-byte aligned.  */
+extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
+_mm_load_pd (double const *__P)
+{
+  assert(((unsigned long)__P & 0xfUL) == 0UL);
+  return ((__m128d)vec_ld(0, (__v16qu*)__P));
+}
+
+/* Load two DPFP values from P.  The address need not be 16-byte aligned.  */
+extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
+_mm_loadu_pd (double const *__P)
+{
+  return (vec_vsx_ld(0, __P));
+}]]></programlisting></para>
+
+  <para>The aligned  load intrinsic adds an assert which checks alignment 
+  (to match the Intel semantic) and uses  the GCC builtin 
+  <emphasis role="bold"><literal>vec_ld</literal></emphasis> (generates an 
+  <emphasis role="bold"><literal>lvx</literal></emphasis>).  The assert 
+  generates extra code but this can be eliminated by defining 
+  <emphasis role="bold"><literal>NDEBUG</literal></emphasis> at compile time. 
+  The unaligned load intrinsic uses the GCC builtin 
+  <literal>vec_vsx_ld</literal>  (for PPC64LE generates 
+  <emphasis role="bold"><literal>lxvd2x</literal></emphasis> / 
+  <emphasis role="bold"><literal>xxswapd</literal></emphasis> for POWER8  and will 
+  simplify to <emphasis role="bold"><literal>lxv</literal></emphasis> 
+  or <emphasis role="bold"><literal>lxvx</literal></emphasis> 
+  for POWER9).  And similarly for <emphasis role="bold"><literal>__mm_store_pd</literal></emphasis> / 
+  <emphasis role="bold"><literal>__mm_storeu_pd</literal></emphasis>, using 
+  <emphasis role="bold"><literal>vec_st</literal></emphasis>
+  and <emphasis role="bold"><literal>vec_vsx_st</literal></emphasis>. These concepts extent to the 
+  load/store intrinsics for vector float and vector int.</para>
+
+</section>
+
--- a/Vector_Intrinsics/sec_floatingpoint_exceptions.xml
+++ b/Vector_Intrinsics/sec_floatingpoint_exceptions.xml
@ -0,0 +1,73 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
+  Copyright (c) 2017 OpenPOWER Foundation
+  
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+  
+-->
+<section xmlns="http://docbook.org/ns/docbook"
+  xmlns:xi="http://www.w3.org/2001/XInclude"
+  xmlns:xlink="http://www.w3.org/1999/xlink"
+  version="5.0"
+  xml:id="sec_floatingpoint_exceptions">
+  <title>Floating Point Exceptions</title>
+  
+  <para>Nominally both ISAs support the IEEE754 specifications, but there are 
+  some subtle differences. Both architecture define a status and control register 
+  to record exceptions and enable / disable floating exceptions for program 
+  interrupt or default action. Intel has a MXCSR and PowerISA has a FPSCR which 
+  basically do the same thing but with different bit layout. </para>
+
+  <para>Intel provides <literal>_mm_setcsr</literal> / <literal>_mm_getcsr</literal> 
+  intrinsics to allow direct 
+  access to the MXCSR. In the early days before the OS POSIX run-times where 
+  updated  to manage the MXCSR, this might have been useful. Today this would be 
+  highly discouraged with a strong preference to use the POSIX APIs 
+  (<literal>feclearexceptflag</literal>, 
+  <literal>fegetexceptflag</literal>, 
+  <literal>fesetexceptflag</literal>, ...) instead.</para>
+
+  <para>If we implement <literal>_mm_setcsr</literal> / 
+  <literal>_mm_getcs</literal> at all, we should simply 
+  redirect the implementation to use the POSIX APIs from 
+  <literal>&lt;fenv.h&gt;</literal>. But it 
+  might be simpler just to replace these intrinsics with macros that generate 
+  #error.</para>
+
+  <para>The Intel MXCSR does have some none (POSIX/IEEE754) standard quirks; 
+  Flush-To-Zero and Denormals-Are-Zeros flags. This simplifies the hardware 
+  response to what should be a rare condition (underflows where the result can 
+  not be represented in the exponent range and precision of the format) by simply 
+  returning a signed 0.0 value. The intrinsic header implementation does provide 
+  constant masks for <literal>_MM_DENORMALS_ZERO_ON</literal> 
+  (<literal>&lt;pmmintrin.h&gt;</literal>) and 
+  <literal>_MM_FLUSH_ZERO_ON</literal> (<literal>&lt;xmmintrin.h&gt;</literal>, 
+  so technically it is available to users 
+  of the Intel Intrinsics API.</para>
+
+  <para>The VMX Vector facility provides a separate Vector Status and Control 
+  register (VSCR) with a Non-Java Mode control bit. This control combines the 
+  flush-to-zero semantics for floating Point underflow and denormal values. But 
+  this control only applies to VMX vector float instructions and does not apply 
+  to VSX scalar floating Point or vector double instructions. The FPSCR does 
+  define a Floating-Point non-IEEE mode which is optional in the architecture. 
+  This would apply to Scalar and VSX floating-point operations if it was 
+  implemented. This was largely intended for embedded processors and is not 
+  implemented in the POWER processor line.</para>
+
+  <para>As the flush-to-zero is primarily a performance enhansement and is 
+  clearly outside the IEEE754 standard, it may be best to simply ignore this 
+  option for the intrinsic port.</para>
+
+</section>
+
--- a/Vector_Intrinsics/sec_floatingpoint_rounding.xml
+++ b/Vector_Intrinsics/sec_floatingpoint_rounding.xml
@ -0,0 +1,33 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
+  Copyright (c) 2017 OpenPOWER Foundation
+  
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+  
+-->
+<section xmlns="http://docbook.org/ns/docbook"
+  xmlns:xi="http://www.w3.org/2001/XInclude"
+  xmlns:xlink="http://www.w3.org/1999/xlink"
+  version="5.0"
+  xml:id="sec_floatingpoint_rounding">
+  <title>Floating-point rounding modes</title>
+  
+  <para>The Intel (x86 / x86_64) and PowerISA architectures both support the 
+  4 IEEE754 rounding modes. Again while the Intel Intrinsic API allows the 
+  application to change rounding modes via updates to the 
+  <literal>MXCSR</literal> it is a bad idea 
+  and should be replaced with the POSIX APIs (<literal>fegetround</literal> and 
+  <literal>fesetround</literal>). </para>
+  
+</section>
+
--- a/Vector_Intrinsics/sec_gcc_vector_extensions.xml
+++ b/Vector_Intrinsics/sec_gcc_vector_extensions.xml
@ -0,0 +1,113 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
+  Copyright (c) 2017 OpenPOWER Foundation
+  
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+  
+-->
+<section xmlns="http://docbook.org/ns/docbook"
+  xmlns:xi="http://www.w3.org/2001/XInclude"
+  xmlns:xlink="http://www.w3.org/1999/xlink"
+  version="5.0"
+  xml:id="sec_gcc_vector_extensions">
+  <title>GCC Vector Extensions</title>
+  
+  <para>The GCC vector extensions are common syntax but implemented in a 
+  target specific way. Using the C vector extensions require the 
+  <literal>__gnu_inline__</literal> 
+  attribute to avoid syntax errors in case the user specified  C standard 
+  compliance (<literal>-std=c90</literal>, <literal>-std=c11</literal>, 
+  etc) that would normally disallow such 
+  extensions. </para>
+
+  <para>The GCC implementation for PowerPC64 Little Endian is (mostly) 
+  functionally compatible with x86_64 vector extension usage. We can use the same 
+  type definitions (at least for  vector_size (16)), operations, syntax 
+  <literal>&lt;</literal><emphasis role="bold"><literal>{</literal></emphasis><literal>...</literal><emphasis role="bold"><literal>}</literal></emphasis><literal>&gt;</literal> 
+  for vector initializers and constants, and array syntax 
+  <literal>&lt;</literal><emphasis role="bold"><literal>[]</literal></emphasis><literal>&gt;</literal> 
+  for vector element access. So simple arithmetic / logical operations 
+  on whole vectors should work as is. </para>
+
+  <para>The caveat is that the interface data type of the Intel Intrinsic may 
+  not match the data types of the operation, so it may be necessary to cast the 
+  operands to the specific type for the operation. This also applies to vector 
+  initializers and accessing vector elements. You need to use the appropriate 
+  type to get the expected results. Of course this applies to X86_64 as well. For 
+  example:
+  <programlisting><![CDATA[/* Perform the respective operation on the four SPFP values in A and B.  */
+extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
+_mm_add_ps (__m128 __A, __m128 __B)
+{
+  return (__m128) ((__v4sf)__A + (__v4sf)__B);
+}
+
+/* Stores the lower SPFP value.  */
+extern __inline void __attribute__((__gnu_inline__, __always_inline__, __artificial__))
+_mm_store_ss (float *__P, __m128 __A)
+{
+  *__P = ((__v4sf)__A)[0];
+}]]></programlisting></para>
+
+  <para>Note the cast from the interface type (<literal>__m128</literal>} to the implementation 
+  type (<literal>__v4sf</literal>, defined in the intrinsic header) for the vector float add (+) 
+  operation. This is enough for the compiler to select the appropriate vector add 
+  instruction for the float type. Then the result (which is 
+  <literal>__v4sf</literal>) needs to be 
+  cast back to the expected interface type (<literal>__m128</literal>). </para>
+
+  <para>Note also the use of <emphasis>array syntax</emphasis> (<literal>__A)[0]</literal>)
+   to extract the lowest 
+  (left most<footnote><para>Here we are using logical left and logical right 
+  which will not match the PowerISA register view in Little endian. Logical left 
+  is the left most element for initializers {left, … , right}, storage order 
+  and array  order where the left most element is [0].</para></footnote>) 
+  element of a vector. The cast (<literal>__v4sf</literal>) insures that the compiler knows we are 
+  extracting the left most 32-bit float. The compiler insures the code generated 
+  matches the Intel behavior for PowerPC64 Little Endian. </para>
+
+  <para>The code generation is complicated by the fact that PowerISA vector 
+  registers are Big Endian (element 0 is the left most word of the vector) and 
+  X86 scalar stores are from the left most (work/dword) for the vector register. 
+  Application code with extensive use of scalar (vs packed) intrinsic loads / 
+  stores should be flagged for rewrite to native PPC code using exisiing scalar 
+  types (float, double, int, long, etc.). </para>
+
+  <para>Another example is the set reverse order:
+  <programlisting><![CDATA[/* Create the vector [Z Y X W].  */
+extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
+_mm_set_ps (const float __Z, const float __Y, const float __X, const float __W)
+{
+  return __extension__ (__m128)(__v4sf){ __W, __X, __Y, __Z };
+}
+
+/* Create the vector [W X Y Z].  */
+extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
+_mm_setr_ps (float __Z, float __Y, float __X, float __W)
+{
+  return __extension__ (__m128)(__v4sf){ __Z, __Y, __X, __W };
+}]]></programlisting></para>
+
+  <para>Note the use of <emphasis>initializer syntax</emphasis> used to collect a set of scalars 
+  into a vector. Code with constant initializer values will generate a vector 
+  constant of the appropriate endian. However code with variables in the 
+  initializer can get complicated as it often requires transfers between register 
+  sets and perhaps format conversions. We can assume that the compiler will 
+  generate the correct code, but if this class of intrinsics shows up a hot spot, 
+  a rewrite to native PPC vector built-ins may be appropriate. For example 
+  initializer of a variable replicated to all the vector fields might not be 
+  recognized as a “load and splat” and making this explicit may help the 
+  compiler generate better code.</para>
+
+</section>
+
--- a/Vector_Intrinsics/sec_handling_avx.xml
+++ b/Vector_Intrinsics/sec_handling_avx.xml
@ -0,0 +1,91 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
+  Copyright (c) 2017 OpenPOWER Foundation
+  
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+  
+-->
+<section xmlns="http://docbook.org/ns/docbook"
+  xmlns:xi="http://www.w3.org/2001/XInclude"
+  xmlns:xlink="http://www.w3.org/1999/xlink"
+  version="5.0"
+  xml:id="sec_handling_avx">
+  <title>Dealing with AVX and AVX512</title>
+  
+  <para>AVX is a bit easier for PowerISA and the ELF V2 ABI. First we have 
+  lots (64) of vector registers and a super scalar vector pipe-line (can execute 
+  two or more independent 128-bit vector operations concurrently). Second the ELF 
+  V2 ABI was designed to pass and return larger aggregates in vector 
+  registers:</para>
+
+  <itemizedlist>
+    <listitem>
+      <para>Up to 12 qualified vector arguments can be passed in 
+      v2–v13.</para>
+    </listitem>
+    <listitem>
+      <para>A qualified vector argument corresponds to:
+        <itemizedlist>
+          <listitem>
+            <para>A vector data type</para>
+          </listitem>
+
+          <listitem>
+            <para>A member of a homogeneous aggregate of multiple like data types 
+            passed in up to eight vector registers.</para>
+          </listitem>
+
+          <listitem>
+            <para>Homogeneous floating-point or vector aggregate return values 
+            that consist of up to eight registers with up to eight elements will 
+            be returned in floating-point or vector registers that correspond to 
+            the parameter registers that would be used if the return value type 
+            were the first input parameter to a function.</para>
+          </listitem>
+        </itemizedlist>
+      </para>
+    </listitem>
+  </itemizedlist>
+
+  <para>So the ABI allows for passing up to three structures each 
+  representing 512-bit vectors and returning such (512-bit) structure all in VMX 
+  registers. This can be extended further by spilling parameters (beyond 12 X 
+  128-bit vectors) to the parameter save area, but we should not need that, as 
+  most intrinsics only use 2 or 3 operands.. Vector registers not needed for 
+  parameter passing, along with an additional 8 volatile vector registers, are 
+  available for scratch and local variables. All can be used by the application 
+  without requiring register spill to the save area. So most intrinsic operations 
+  on 256- or 512-bit vectors can be held within existing PowerISA vector 
+  registers. </para>
+
+  <para>For larger functions that might use multiple AVX 256 or 512-bit 
+  intrinsics and, as a result, push beyond the 20 volatile vector registers, the 
+  compiler will just allocate non-volatile vector registers by allocating a stack 
+  frame and spilling non-volatile vector registers to the save area (as needed in 
+  the function prologue). This frees up to 64 vectors (32 x 256-bit or 16 x 
+  512-bit structs) for code optimization. </para>
+
+  <para>Based on the specifics of our ISA and ABI we will not not use 
+  <literal>__vector_size__</literal> (32) or (64) in the PowerPC implementation of 
+  <literal>__m256</literal> and <literal>__m512</literal> 
+  types. Instead we will typedef structs of 2 or 4 vector (<literal>__m128</literal>) fields. This 
+  allows efficient handling of these larger data types without require new GCC 
+  language extensions. </para>
+
+  <para>In the end we should use the same type names and definitions as the 
+  GCC X86 intrinsic headers where possible. Where that is not possible we can 
+  define new typedefs that provide the best mapping to the underlying PowerISA 
+  hardware.</para>
+
+</section>
+
--- a/Vector_Intrinsics/sec_handling_mmx.xml
+++ b/Vector_Intrinsics/sec_handling_mmx.xml
@ -0,0 +1,72 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
+  Copyright (c) 2017 OpenPOWER Foundation
+  
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+  
+-->
+<section xmlns="http://docbook.org/ns/docbook"
+  xmlns:xi="http://www.w3.org/2001/XInclude"
+  xmlns:xlink="http://www.w3.org/1999/xlink"
+  version="5.0"
+  xml:id="sec_handling_mmx">
+  <title>Dealing with MMX</title>
+  
+  <para>MMX is actually the hard case. The <literal>__m64</literal>
+  type supports SIMD vector 
+  int types (char, short, int, long).  The  Intel API defines  
+  <literal>__m64</literal> as:
+  <programlisting><![CDATA[typedef int __m64 __attribute__ ((__vector_size__ (8), __may_alias__));]]></programlisting></para>
+
+  <para>Which is problematic for the PowerPC target (not really supported in 
+  GCC) and we would prefer to use a native PowerISA type that can be passed in a 
+  single register.  The PowerISA Rotate Under Mask instructions can easily 
+  extract and insert integer fields of a General Purpose Register (GPR). This 
+  implies that MMX integer types can be handled as a internal union of arrays for 
+  the supported element types. So an 64-bit unsigned long long is the best type 
+  for parameter passing and return values. Especially for the 64-bit (_si64) 
+  operations as these normally generate a single PowerISA instruction.</para>
+
+  <para>The SSE extensions include some convert operations for 
+  <literal>_m128</literal> to / 
+  from <literal>_m64</literal> and this includes some int to / from float conversions. However in 
+  these cases the float operands always reside in SSE (XMM) registers (which 
+  match the PowerISA vector registers) and the MMX registers only contain integer 
+  values. POWER8 (PowerISA-2.07) has direct move instructions between GPRs and 
+  VSRs. So these transfers are normally a single instruction and any conversions 
+  can be handed in the vector unit.</para>
+
+  <para>When transferring a <literal>__m64</literal> value to a vector register we should also 
+  execute a xxsplatd instruction to insure there is valid data in all four 
+  element lanes before doing floating point operations. This avoids generating 
+  extraneous floating point exceptions that might be generated by uninitialized 
+  parts of the vector. The top two lanes will have the floating point results 
+  that are in position for direct transfer to a GPR or stored via Store Float 
+  Double (stfd). These operation are internal to the intrinsic implementation and 
+  there is no requirement to keep temporary vectors in correct Little Endian 
+  form.</para>
+
+  <para>Also for the smaller element sizes and higher element counts (MMX 
+  <literal>_pi8</literal> and <literal>_p16</literal> types) the number of  Rotate Under Mask instructions required to 
+  disassemble the 64-bit <literal>__m64</literal> 
+  into elements, perform the element calculations, 
+  and reassemble the elements in a single <literal>__m64</literal> 
+  value can get larger. In this 
+  case we can generate shorter instruction sequences by transfering (via direct 
+  move instruction) the GPR <literal>__m64</literal> value to the 
+  a vector register, performance the 
+  SIMD operation there, then transfer the <literal>__m64</literal> 
+  result back to a GPR.</para>
+  
+</section>
+
--- a/Vector_Intrinsics/sec_how_findout.xml
+++ b/Vector_Intrinsics/sec_how_findout.xml
@ -0,0 +1,60 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
+  Copyright (c) 2017 OpenPOWER Foundation
+  
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+  
+-->
+<section xmlns="http://docbook.org/ns/docbook"
+  xmlns:xi="http://www.w3.org/2001/XInclude"
+  xmlns:xlink="http://www.w3.org/1999/xlink"
+  version="5.0"
+  xml:id="sec_how_findout">
+  <title>How did I find this out?</title>
+  
+  <para>The next question is where did I get the details above. The GCC 
+  documentation for <emphasis role="bold"><literal>__builtin_ia32_loadupd</literal></emphasis>
+  provides minimal information (the 
+  builtin name, parameters and return types). Not very informative. </para>
+
+  <para>Looking up the Intel intrinsic description is more informative. You 
+  can Google the intrinsic name or use the 
+  <link xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/">Intel 
+  Intrinsic guide</link> for this. The Intrinsic Guide is interactive and 
+  includes  Intel (Chip) technology and text based search capabilities. Clicking 
+  on the intrinsic name opens to a synopsis including; the underlying instruction 
+  name, text description, operation pseudo code, and in some cases performance 
+  information (latency and throughput).</para>
+
+  <para>The key is to get a description of the intrinsic (operand fields and 
+  types, and which fields are updated for the result) and the underlying Intel 
+  instruction. If the Intrinsic guide is not clear you can look up the 
+  instruction details in the 
+  “<link xlink:href="https://software.intel.com/en-us/articles/intel-sdm">Intel® 64 and IA-32 
+  Architectures Software Developer’s Manual</link>”.</para>
+
+  <para>Information about the PowerISA vector facilities is found in the 
+  <link xlink:href="https://openpowerfoundation.org/?resource_lib=ibm-power-isa-version-2-07-b">PowerISA Version 2.07B</link> (for POWER8 and 
+  <link xlink:href="https://www.docdroid.net/tWT7hjD/powerisa-v30.pdf.html">3.0 for 
+  POWER9</link>) manual, Book I, Chapter 6. Vector Facility and Chapter 7. 
+  Vector-Scalar Floating-Point Operations. Another good reference is the 
+  <link xlink:href="https://openpowerfoundation.org/technical/technical-resources/technical-specifications/">OpenPOWER ELF V2 application binary interface</link> (ABI) 
+  document, Chapter 6. Vector Programming Interfaces and Appendix A. Predefined 
+  Functions for Vector Programming.</para>
+
+  <para>Another useful document is the original <link xlink:href="http://www.nxp.com/assets/documents/data/en/reference-manuals/ALTIVECPEM.pdf">Altivec Technology Programers Interface Manual</link> 
+  with a  user friendly structure and many helpful diagrams. But alas the PIM does does not 
+  cover the resent PowerISA (power7,  power8, and power9) enhancements.</para>
+
+</section>
+
--- a/Vector_Intrinsics/sec_intel_intrinsic_functions.xml
+++ b/Vector_Intrinsics/sec_intel_intrinsic_functions.xml
@ -0,0 +1,122 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
+  Copyright (c) 2017 OpenPOWER Foundation
+  
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+  
+-->
+<section xmlns="http://docbook.org/ns/docbook"
+  xmlns:xi="http://www.w3.org/2001/XInclude"
+  xmlns:xlink="http://www.w3.org/1999/xlink"
+  version="5.0"
+  xml:id="sec_intel_intrinsic_functions">
+  <title>Intel Intrinsic functions</title>
+  
+  <para>So what is an intrinsic function? From Wikipedia:
+
+    <blockquote><para>In <link xlink:href="https://en.wikipedia.org/wiki/Compiler_theory">compiler theory</link>, an 
+    <emphasis role="bold">intrinsic function</emphasis> is a function available for use in a given 
+    <link xlink:href="https://en.wikipedia.org/wiki/Programming_language">programming 
+    language</link> whose implementation is handled specially by the compiler. 
+    Typically, it substitutes a sequence of automatically generated instructions 
+    for the original function call, similar to an 
+    <link xlink:href="https://en.wikipedia.org/wiki/Inline_function">inline function</link>. 
+    Unlike an inline function though, the compiler has an intimate knowledge of the 
+    intrinsic function and can therefore better integrate it and optimize it for 
+    the situation. This is also called builtin function in many languages.</para></blockquote></para>
+
+  <para>The “Intel Intrinsics” API provides access to the many 
+  instruction set extensions (Intel Technologies) that Intel has added (and 
+  continues to add) over the years. The intrinsics provided access to new 
+  instruction capabilities before the compilers could exploit them directly. 
+  Initially these intrinsic functions where defined for the Intel and Microsoft 
+  compiler and where eventually implemented and contributed to GCC.</para>
+
+  <para>The Intel Intrinsics have a specific type and naming structure. In 
+  this naming structure, functions starts with a common prefix (MMX and SSE use 
+  '_mm' prefix, while AVX added the '_mm256' '_mm512' prefixes), then a short 
+  functional name ('set', 'load', 'store', 'add', 'mul', 'blend', 'shuffle', '…') and a suffix 
+  ('_pd', '_sd', '_pi32'...) with type and packing information. See 
+  <xref linkend="app_intel_suffixes"/> for the list of common intrisic suffixes.</para>
+
+  <para>Oddly many of the MMX/SSE operations are not vectors at all. There 
+  are a lot of scalar operations on a single float, double, or long long type. In 
+  effect these are scalars that can take advantage of the larger (xmm) register 
+  space. Also in the Intel 32-bit architecture they provided IEEE754 float and 
+  double types, and 64-bit integers that did not exist or where hard to implement 
+  in the base i386/387 instruction set. These scalar operation use a suffix 
+  starting with '_s' (<literal>_sd</literal> for scalar double float, 
+  <literal>_ss</literal> scalar float, and <literal>_si64</literal> 
+  for scalar long long).</para>
+
+  <para>True vector operations use the packed or extended packed suffixes, 
+  starting with '_p' or '_ep' (<literal>_pd</literal> for vector double, 
+  <literal>_ps</literal> for vector float, and 
+  <literal>_epi32</literal> for vector int). The use of '_ep'  
+  seems to be reserved to disambiguate 
+  intrinsics that existed in the (64-bit vector) MMX extension from the extended 
+  (128-bit vector) SSE equivalent. For example 
+  <emphasis role="bold"><literal>_mm_add_pi32</literal></emphasis> is a MMX operation on 
+  a pair of 32-bit integers, while 
+  <emphasis role="bold"><literal>_mm_add_epi32</literal></emphasis> is an SSE2 operation on vector 
+  of 4 32-bit integers.</para>
+
+  <para>The GCC  builtins for the 
+  <link xlink:href="https://gcc.gnu.org/onlinedocs/gcc-6.3.0/gcc/x86-Built-in-Functions.html#x86-Built-in-Functions">i386.target</link>, 
+  (includes x86 and x86_64) are not 
+  the same as the Intel Intrinsics. While they have similar intent and cover most 
+  of the same functions, they use a different naming (prefixed with 
+  <literal>__builtin_ia32_</literal>, then function name with type suffix) and uses GCC vector type 
+  modes for operand types. For example:
+  <programlisting><![CDATA[v8qi __builtin_ia32_paddb (v8qi, v8qi)
+v4hi __builtin_ia32_paddw (v4hi, v4hi)
+v2si __builtin_ia32_paddd (v2si, v2si)
+v2di __builtin_ia32_paddq (v2di, v2di)]]></programlisting></para>
+
+  <para>Note: A key difference between GCC builtins for i386 and Powerpc is 
+  that the x86 builtins have different names of each operation and type while the 
+  powerpc altivec builtins tend to have a single generatic builtin for  each 
+  operation, across a set of compatible operand types. </para>
+
+  <para>In GCC the Intel Intrinsic header (*intrin.h) files are implemented 
+  as a set of inline functions using the Intel Intrinsic API names and types. 
+  These functions are implemented as either GCC C vector extension code or via 
+  one or more GCC builtins for the i386 target. So lets take a look at some 
+  examples from GCC's SSE2 intrinsic header emmintrin.h:
+  <programlisting><![CDATA[extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
+_mm_add_pd (__m128d __A, __m128d __B)
+{
+  return (__m128d) ((__v2df)__A + (__v2df)__B);
+}
+
+extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
+_mm_add_sd (__m128d __A, __m128d __B)
+{
+  return (__m128d)__builtin_ia32_addsd ((__v2df)__A, (__v2df)__B);
+}]]></programlisting></para>
+
+  <para>Note that the  
+  <emphasis role="bold"><literal>_mm_add_pd</literal></emphasis> is implemented direct as C vector 
+  extension code., while 
+  <emphasis role="bold"><literal>_mm_add_sd</literal></emphasis> is implemented via the GCC builtin 
+  <emphasis role="bold"><literal>__builtin_ia32_addsd</literal></emphasis>. From the 
+  discussion above we know the <literal>_pd</literal> suffix 
+  indicates a packed vector double while the <literal>_sd</literal> suffix indicates a scalar double 
+  in a XMM register. </para>
+  
+  <xi:include href="sec_packed_vs_scalar_intrinsics.xml"/>
+  <xi:include href="sec_vec_or_not.xml"/>
+  <xi:include href="sec_crossing_lanes.xml"/>
+
+</section>
+
--- a/Vector_Intrinsics/sec_intel_intrinsic_includes.xml
+++ b/Vector_Intrinsics/sec_intel_intrinsic_includes.xml
@ -0,0 +1,82 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
+  Copyright (c) 2017 OpenPOWER Foundation
+  
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+  
+-->
+<section xmlns="http://docbook.org/ns/docbook"
+  xmlns:xi="http://www.w3.org/2001/XInclude"
+  xmlns:xlink="http://www.w3.org/1999/xlink"
+  version="5.0"
+  xml:id="sec_intel_intrinsic_includes">
+  <title>The structure of the intrinsic includes</title>
+  
+  <para>The GCC x86 intrinsic functions for vector were initially grouped by 
+  technology (MMX and SSE), which starts with MMX continues with SSE through 
+  SSE4.1 stacked like a set of Russian dolls.</para>
+
+  <para>Basically each higher layer include, needs typedefs and helper macros 
+  defined by the lower level intrinsic includes. mm_malloc.h simply provides 
+  wrappers for posix_memalign and free. Then it gets a little weird, starting 
+  with the crypto extensions:
+  
+  <programlisting><![CDATA[wmmintrin.h  (AES)	includes emmintrin.h]]></programlisting></para>
+  
+  <para>For AVX, AVX2, and AVX512 they must have decided 
+  that the Russian Dolls thing was getting out of hand. AVX et all is split 
+  across 14 files
+  
+  <programlisting><![CDATA[#include <avxintrin.h>
+#include <avx2intrin.h>
+#include <avx512fintrin.h>
+#include <avx512erintrin.h>
+#include <avx512pfintrin.h>
+#include <avx512cdintrin.h>
+#include <avx512vlintrin.h>
+#include <avx512bwintrin.h>
+#include <avx512dqintrin.h>
+#include <avx512vlbwintrin.h>
+#include <avx512vldqintrin.h>
+#include <avx512ifmaintrin.h>
+#include <avx512ifmavlintrin.h>
+#include <avx512vbmiintrin.h>
+#include <avx512vbmivlintrin.h>]]></programlisting>
+  
+  but they do not want the applications include these 
+  individually.</para>
+  
+  <para>So <emphasis role="bold">immintrin.h</emphasis> includes everything Intel vector, include all the 
+  AVX, AES, SSE and MMX flavors. 
+  <programlisting><![CDATA[#ifndef _IMMINTRIN_H_INCLUDED
+# error "Never use <avxintrin.h> directly; include <immintrin.h> instead."
+#endif]]></programlisting></para>
+
+  <para>So what is the net? The include structure provides some strong clues 
+  about the order that we should approach this effort.  For example if you need 
+  to intrinsic from SSE4 (smmintrin.h) we are likely to need to type definitions 
+  from SSE (emmintrin.h). So a bottoms up (MMX, SSE, SSE2, …) approach seems 
+  like the best plan of attack. Also saving the AVX parts for latter make sense, 
+  as most are just wider forms of operations that already exists in SSE.</para>
+
+  <para>We should use the same include structure to implement our PowerISA 
+  equivalent API headers. This will make porting easier (drop-in replacement) and 
+  should get the application running quickly on POWER. Then we are in a position 
+  to profile and analyze the resulting application. This will show any hot spots 
+  where the simple one-to-one transformation results in bottlenecks and 
+  additional tuning is needed. For these cases we should improve our tools (SDK 
+  MA/SCA) to identify opportunities for, and perhaps propose, alternative 
+  sequences that are better tuned to PowerISA and our micro-architecture.</para>
+
+</section>
+
--- a/Vector_Intrinsics/sec_intel_intrinsic_types.xml
+++ b/Vector_Intrinsics/sec_intel_intrinsic_types.xml
@ -0,0 +1,89 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
+  Copyright (c) 2017 OpenPOWER Foundation
+  
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+  
+-->
+<section xmlns="http://docbook.org/ns/docbook"
+  xmlns:xi="http://www.w3.org/2001/XInclude"
+  xmlns:xlink="http://www.w3.org/1999/xlink"
+  version="5.0"
+  xml:id="sec_intel_intrinsic_types">
+  <title>The types used for intrinsics</title>
+  
+  <para>The type system for Intel intrinsics is a little strange. For example 
+  from xmmintrin.h:
+  <programlisting><![CDATA[/* The Intel API is flexible enough that we must allow aliasing with other
+   vector types, and their scalar components.  */
+typedef float __m128 __attribute__ ((__vector_size__ (16), __may_alias__));
+
+/* Internal data types for implementing the intrinsics.  */
+typedef float __v4sf __attribute__ ((__vector_size__ (16)));]]></programlisting></para>
+
+  <para>So there is one set of types that are used in the function prototypes 
+  of the API, and the internal types that are used in the implementation. Notice 
+  the special attribute <literal>__may_alias__</literal>. From the GCC documentation:
+
+    <blockquote><para>
+    Accesses through pointers to types with this attribute are not subject 
+    to type-based alias analysis, but are instead assumed to be able to alias any 
+    other type of objects. ... This extension exists to support some vector APIs, 
+    in which pointers to one vector type are permitted to alias pointers to a 
+    different vector type.</para></blockquote></para>
+  
+  <para>So there are a 
+  couple of issues here: 1)  the API seem to force the compiler to assume 
+  aliasing of any parameter passed by reference. Normally the compiler assumes 
+  that parameters of different size do not overlap in storage, which allows more 
+  optimization. 2) the data type used at the interface may not be the correct 
+  type for the implied operation. So parameters of type 
+  <literal>__m128i</literal> (which is defined 
+  as vector long long) is also used for parameters and return values of vector 
+  [char | short | int ]. </para>
+
+  <para>This may not matter when using x86 built-in's but does matter when 
+  the implementation uses C vector extensions or in our case use PowerPC generic 
+  vector built-ins 
+  (<xref linkend="sec_powerisa_vector_intrinsics"/>). 
+  For the later cases the type must be correct for 
+  the compiler to generate the correct type (char, short, int, long) 
+  (<xref linkend="sec_api_implemented"/>) for the generic 
+  builtin operation. There is also concern that excessive use of 
+  <literal>__may_alias__</literal>
+  will limit compiler optimization. We are not sure how important this attribute 
+  is to the correct operation of the API.  So at a later stage we should 
+  experiment with removing it from our implementation for PowerPC</para>
+
+  <para>The good news is that PowerISA has good support for 128-bit vectors 
+  and (with the addition of VSX) all the required vector data (char, short, int, 
+  long, float, double) types. However Intel supports a wider variety of the 
+  vector sizes  than PowerISA does. This started with the 64-bit MMX vector 
+  support that preceded SSE and extends to 256-bit and 512-bit vectors of AVX, 
+  AVX2, and AVX512 that followed SSE.</para>
+
+  <para>Within the GCC Intel intrinsic implementation these are all 
+  implemented as vector attribute extensions of the appropriate  size (   
+  <literal>__vector_size__</literal> ({8 | 16 | 32, and 64}). For the PowerPC target  GCC currently 
+  only supports the native <literal>__vector_size__</literal> ( 16 ). These we can support directly 
+  in VMX/VSX registers and associated instructions. The GCC will compile with 
+  other   <literal>__vector_size__</literal> values, but the resulting types are treated as simple 
+  arrays of the element type. This does not allow the compiler to use the vector 
+  registers and vector instructions for these (nonnative) vectors.   So what is 
+  a programmer to do?</para>
+    
+  <xi:include href="sec_handling_mmx.xml"/>
+  <xi:include href="sec_handling_avx.xml"/>
+
+</section>
+
--- a/Vector_Intrinsics/sec_more_examples.xml
+++ b/Vector_Intrinsics/sec_more_examples.xml
@ -0,0 +1,76 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
+  Copyright (c) 2017 OpenPOWER Foundation
+  
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+  
+-->
+<section xmlns="http://docbook.org/ns/docbook"
+  xmlns:xi="http://www.w3.org/2001/XInclude"
+  xmlns:xlink="http://www.w3.org/1999/xlink"
+  version="5.0"
+  xml:id="sec_more_examples">
+  <title>Some more intrinsic examples</title>
+  
+  <para>The intrinsic 
+  <link xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_cvtpd_ps&amp;expand=1624">_mm_cvtpd_ps</link> 
+  converts a packed vector double into 
+  a packed vector single float. Since only 2 doubles fit into a 128-bit vector 
+  only 2 floats are returned and occupy only half (64-bits) of the XMM register. 
+  For this intrinsic the 64-bit are packed into the logical left half of the 
+  registers and the logical right half of the register is set to zero (as per the 
+  Intel <literal>cvtpd2ps</literal> instruction).</para>
+
+  <para>The PowerISA provides the VSX Vector round and Convert 
+  Double-Precision to Single-Precision format (xvcvdpsp) instruction. In the ABI 
+  this is <literal>vec_floato</literal> (vector double) .  
+  This instruction convert each double 
+  element then transfers converted element 0 to float element 1, and converted 
+  element 1 to float element 3. Float elements 0 and 2 are undefined (the 
+  hardware can do what ever). This does not match the expected results for 
+  <literal>_mm_cvtpd_ps</literal>.
+  <programlisting><![CDATA[vec_floato ({1.0, 2.0}) result = {<undefined>, 1.0, <undefined>, 2.0}
+_mm_cvtpd_ps  ({1.0, 2.0}) result = {1.0, 2.0, 0.0, 0.0}]]></programlisting></para>
+
+  <para>So we need to re-position the results to word elements 0 and 2, which 
+  allows a pack operation to deliver the correct format. Here the merge odd 
+  splats element 1 to 0 and element 3 to 2. The Pack operation combines the low 
+  half of each doubleword from the vector result and vector of zeros to generate 
+  the require format.
+  <programlisting><![CDATA[extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
+_mm_cvtpd_ps (__m128d __A)
+{
+	__v4sf result;
+	__v4si temp;
+	const __v4si vzero = {0,0,0,0};
+//	temp = (__v4si)vec_floato (__A); /* GCC 8 */
+
+    __asm__(
+		"xvcvdpsp %x0,%x1;\n"
+		: "=wa" (temp)
+		: "wa" (__A)
+        : );
+
+	temp = vec_mergeo (temp, temp);
+	result = (__v4sf)vec_vpkudum ((vector long)temp, (vector long)vzero);
+ 	return (result);
+}]]></programlisting></para>
+
+  <para>This  technique is also used to implement  
+  <link xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_cvttpd_epi32&amp;expand=1624,1859">_mm_cvttpd_epi32</link> 
+  which converts a packed vector double in to a packed vector int. The PowerISA instruction 
+  <literal>xvcvdpsxws</literal> uses a similar layout for the result as 
+  <literal>xvcvdpsp</literal> and requires the same fix up.</para>
+
+</section>
+
--- a/Vector_Intrinsics/sec_other_intrinsic_examples.xml
+++ b/Vector_Intrinsics/sec_other_intrinsic_examples.xml
@ -0,0 +1,68 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
+  Copyright (c) 2017 OpenPOWER Foundation
+  
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+  
+-->
+<section xmlns="http://docbook.org/ns/docbook"
+  xmlns:xi="http://www.w3.org/2001/XInclude"
+  xmlns:xlink="http://www.w3.org/1999/xlink"
+  version="5.0"
+  xml:id="sec_other_intrinsic_examples">
+  <title>Examples implemented using other intrinsics</title>
+  
+  <para>Some intrinsic implementations are defined in terms of other 
+  intrinsics. For example.
+  <programlisting><![CDATA[/* Create a vector with element [0] as F and the rest zero.  */
+extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
+_mm_set_sd (double __F)
+{
+  return __extension__ (__m128d){ __F, 0.0 };
+}
+
+/* Create a vector with element [0] as *P and the rest zero.  */
+extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
+_mm_load_sd (double const *__P)
+{
+  return _mm_set_sd (*__P);
+}]]></programlisting></para>
+
+  <para>This notion of using part (one fourth or half) of the SSE XMM 
+  register and leaving the rest unchanged (or forced to zero) is specific to SSE 
+  scalar operations and can generate some complicated (sub-optimal) PowerISA 
+  code.  In this case <emphasis role="bold"><literal>_mm_load_sd</literal></emphasis> 
+  passes the dereferenced double value  to 
+  <emphasis role="bold"><literal>_mm_set_sd</literal></emphasis> which 
+  uses C vector initializer notation to combine (merge) that 
+  double scalar value with a scalar 0.0 constant into a vector double.</para>
+
+  <para>While code like this should work as-is for PPC64LE, you should look 
+  at the generated code and assess if it is reasonable.  In this case the code 
+  is not awful (a load double splat, vector xor to generate 0.0s, then a 
+  <literal>xxmrghd</literal>
+  to combine __F and 0.0).  Other examples may generate sub-optimal code and 
+  justify a rewrite to PowerISA scalar or vector code (<link 
+  xlink:href="https://gcc.gnu.org/onlinedocs/gcc-6.3.0/gcc/PowerPC-AltiVec_002fVSX-Built-
+  in-Functions.html#PowerPC-AltiVec_002fVSX-Built-in-Functions">GCC PowerPC 
+  AltiVec Built-in Functions</link> or inline assembler). </para>
+
+  <para>Net: try using the existing C code if you can, but check on what the 
+  compiler generates.  If the generated code is horrendous, it may be worth the 
+  effort to write a PowerISA specific equivalent. For codes making extensive use 
+  of MMX or SSE scalar intrinsics you will be better off rewriting to use 
+  standard C scalar types and letting the the GCC compiler handle the details 
+  (see <link linkend="sec_prefered_methods"/>).</para>
+
+</section>
+
--- a/Vector_Intrinsics/sec_packed_vs_scalar_intrinsics.xml
+++ b/Vector_Intrinsics/sec_packed_vs_scalar_intrinsics.xml
@ -0,0 +1,302 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
+  Copyright (c) 2017 OpenPOWER Foundation
+  
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+  
+-->
+<section xmlns="http://docbook.org/ns/docbook"
+  xmlns:xi="http://www.w3.org/2001/XInclude"
+  xmlns:xlink="http://www.w3.org/1999/xlink"
+  version="5.0"
+  xml:id="sec_packed_vs_scalar_intrinsics">
+  <title>Packed vs scalar intrinsics</title>
+  
+  <para>So what is actually going on here? The vector code is clear enough if 
+  you know that '+' operator is applied to each vector element. The the intent of 
+  the builtin is a little less clear, as the GCC documentation for 
+  <literal>__builtin_ia32_addsd</literal> is not very 
+  helpful (nonexistent). So perhaps the 
+  <link xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_add_pd&amp;expand=97">Intel Intrinsic Guide</link> 
+  will be more enlightening. To paraphrase:
+  <blockquote>
+
+    <para>From the 
+    <link xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_add_pd&amp;expand=97"><literal>_mm_add_dp</literal> description</link> ; 
+    for each double float 
+    element ([0] and [1] or bits [63:0] and [128:64]) for operands a and b are 
+    added and resulting vector is returned. </para>
+
+    <para>From the 
+    <link xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_add_sd&amp;expand=97,130"><literal>_mm_add_sd</literal> description</link> ; 
+    Add element 0 of first operand 
+    (a[0]) to element 0 of the second operand (b[0]) and return the packed vector 
+    double {(a[0] + b[0]), a[1]}. Or said differently the sum of the logical left 
+    most half of the the operands are returned in the logical left most half 
+    (element [0]) of the  result, along with the logical right half (element [1]) 
+    of the first operand (unchanged) in the logical right half of the result.</para></blockquote></para>
+
+  <para>So the packed double is easy enough but the scalar double details are 
+  more complicated. One source of complication is that while both Instruction Set 
+  Architectures (SSE vs VSX) support scalar floating point operations in vector 
+  registers the semantics are different. </para>
+
+  <itemizedlist>
+    <listitem>
+      <para>The vector bit and field numbering is different (reversed).
+      <itemizedlist>
+        <listitem>
+          <para>For Intel the scalar is always placed in the low order (right most) 
+          bits of the XMM register (and the low order address for load and store).</para>
+        </listitem>
+        
+        <listitem>
+          <para>For PowerISA and VSX, scalar floating point operations and Floating 
+          Point Registers (FPRs) are on the low numbered bits which is the left hand 
+          side of the vector / scalar register (VSR). </para>
+        </listitem>
+
+        <listitem>
+          <para>For the PowerPC64 ELF V2  little endian ABI we also make point of 
+          making the GCC vector extensions and vector built ins, appear to be little 
+          endian. So vector element 0 corresponds to the low order address and low 
+          order (right hand) bits of the vector register (VSR).</para>
+        </listitem>
+      </itemizedlist></para>
+    </listitem>
+    <listitem>
+      <para>The handling of the non-scalar part of the register for scalar 
+      operations are different.
+      <itemizedlist>
+        <listitem>
+          <para>For Intel ISA the scalar operations either leaves the high order part 
+          of the XMM vector unchanged or in some cases force it to 0.0.</para>
+        </listitem>
+        
+        <listitem>
+          <para>For PowerISA scalar operations on the combined FPR/VSR register leaves 
+          the remainder (right half of the VSR) <emphasis role="bold">undefined</emphasis>.</para>
+        </listitem>
+      </itemizedlist></para>
+    </listitem>
+  </itemizedlist>
+
+  <para>To minimize confusion and use consistent nomenclature, I will try to 
+  use the terms logical left and logical right elements based on the order they 
+  apprear in a C vector initializers and element index order. So in the vector 
+  <literal>(__v2df){1.0, 20.}</literal>, The value 1.0 is the in the logical left element [0] and 
+  the value 2.0 is logical right element [1].</para>
+
+  <para>So lets look at how to implement these intrinsics for the PowerISA. 
+  For example in this case we can use the GCC vector extension, like so:
+  <programlisting><![CDATA[extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
+_mm_add_pd (__m128d __A, __m128d __B)
+{
+  return (__m128d) ((__v2df)__A + (__v2df)__B);
+}
+
+extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
+_mm_add_sd (__m128d __A, __m128d __B)
+{
+  __A[0] = __A[0] + __B[0];
+  return (__A);
+}]]></programlisting></para>
+
+  <para>The packed double implementation operates on the vector as a whole. 
+  The scalar double implementation operates on and updates only [0] element of 
+  the vector and leaves the <literal>__A[1]</literal> element unchanged.  
+  Form this source the GCC 
+  compiler generates the following code for PPC64LE target.:</para>
+
+  <para>The packed vector double generated the corresponding VSX vector 
+  double add (xvadddp). But the scalar implementation is bit more complicated. 
+  <programlisting><![CDATA[0000000000000720 <test_add_pd>:
+ 720:	07 1b 42 f0 	xvadddp vs34,vs34,vs35
+	...
+
+0000000000000740 <test_add_sd>:
+ 740:	56 13 02 f0 	xxspltd vs0,vs34,1
+ 744:	57 1b 63 f0 	xxspltd vs35,vs35,1
+ 748:	03 19 60 f0 	xsadddp vs35,vs0,vs35
+ 74c:	57 18 42 f0 	xxmrghd vs34,vs34,vs35
+	...
+]]></programlisting></para>
+
+  <para>First the PPC64LE vector format, element [0] is not in the correct 
+  position for  the scalar operations. So the compiler generates vector splat 
+  double (<literal>xxspltd</literal>) instructions to copy elements <literal>__A[0]</literal> and 
+  <literal>__B[0]</literal> into position 
+  for the VSX scalar add double (xsadddp) that follows. However the VSX scalar 
+  operation leaves the other half of the VSR undefined (which does not match the 
+  expected Intel semantics). So the compiler must generates a vector merge high 
+  double (<literal>xxmrghd</literal>) instruction to combine the original 
+  <literal>__A[1]</literal> element (from <literal>vs34</literal>) 
+  with the scalar add result from <literal>vs35</literal> 
+  element [1]. This merge swings the scalar 
+  result from <literal>vs35[1]</literal> element into the 
+  <literal>vs34[0]</literal> position, while preserving the 
+  original <literal>vs34[1]</literal> (from <literal>__A[1]</literal>) 
+  element (copied to itself).<footnote><para>Fun 
+  fact: The vector registers in PowerISA are decidedly Big Endian. But we decided 
+  to make the PPC64LE ABI behave like a Little Endian system to make application 
+  porting easier. This require the compiler to manipulate the PowerISA vector 
+  instrinsic behind the the scenes to get the correct Little Endian results. For 
+  example the element selector [0|1] for <literal>vec_splat</literal> and the 
+  generation of <literal>vec_mergeh</literal> vs <literal>vec_mergel</literal> 
+  are reversed for the Little Endian.</para></footnote></para>
+
+  <para>This technique applies to packed and scalar intrinsics for the the 
+  usual arithmetic operators (add, subtract, multiply, divide). Using GCC vector 
+  extensions in these intrinsic implementations provides the compiler more 
+  opportunity to optimize the whole function. </para>
+
+  <para>Now we can look at a slightly more interesting (complicated) case. 
+  Square root (<literal>sqrt</literal>) is not a arithmetic operator in C and is usually handled 
+  with a library call or a compiler builtin. We really want to avoid a library 
+  calls and want to avoid any unexpected side effects. As you see below the 
+  implementation of 
+  <link xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_sqrt_pd&amp;expand=4926"><literal>_mm_sqrt_pd</literal></link> and 
+  <link xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_sqrt_sd&amp;expand=4926,4956"><literal>_mm_sqrt_sd</literal></link> 
+  intrinsics are based on GCC x86 built ins.
+  <programlisting><![CDATA[extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
+_mm_sqrt_pd (__m128d __A)
+{
+  return (__m128d)__builtin_ia32_sqrtpd ((__v2df)__A);
+}
+
+/* Return pair {sqrt (B[0]), A[1]}.  */
+extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
+_mm_sqrt_sd (__m128d __A, __m128d __B)
+{
+  __v2df __tmp = __builtin_ia32_movsd ((__v2df)__A, (__v2df)__B);
+  return (__m128d)__builtin_ia32_sqrtsd ((__v2df)__tmp);
+}]]></programlisting></para>
+
+  <para>For the packed vector sqrt, the PowerISA VSX has an equivalent vector 
+  double square root instruction and GCC provides the <literal>vec_sqrt</literal> builtin. But the 
+  scalar implementation involves an additional parameter and an extra move. 
+   This seems intended to mimick the propagation of the <literal>__A[1]</literal> input to the 
+  logical right half of the XMM result that we saw with <literal>_mm_add_sd above</literal>.</para>
+
+  <para>The instinct is to extract the low scalar (<literal>__B[0]</literal>) 
+  from operand <literal>__B</literal> 
+  and pass this to  the GCC <literal>__builtin_sqrt ()</literal> before recombining that scalar 
+  result with <literal>__A[1]</literal> for the vector result. Unfortunately C language standards 
+  force the compiler to call the libm sqrt function unless <literal>-ffast-math</literal> is 
+  specified. The <literal>-ffast-math</literal> option is not commonly used and we want to avoid the 
+  external library dependency for what should be only a few inline instructions. 
+  So this is not a good option.</para>
+
+  <para>Thinking outside the box; we do have an inline intrinsic for a 
+  (packed) vector double sqrt, that we just implemented. However we need to 
+  insure the other half of <literal>__B</literal> (<literal>__B[1]</literal>) 
+  does not cause an harmful side effects 
+  (like raising exceptions for NAN or  negative values). The simplest solution 
+  is to splat <literal>__B[0]</literal> to both halves of a temporary value before taking the 
+  <literal>vec_sqrt</literal>. Then this result can be combined with <literal>__A[1]</literal> to return the final 
+  result. For example:
+  <programlisting><![CDATA[extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
+_mm_sqrt_pd (__m128d __A)
+{
+	  return (vec_sqrt (__A));
+}
+
+extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
+_mm_sqrt_sd (__m128d __A, __m128d __B)
+{
+	  __m128d c;
+	  c = _mm_sqrt_pd(_mm_set1_pd (__B[0]));
+	  return (_mm_setr_pd (c[0], __A[1]));
+}]]></programlisting></para>
+  
+  <para>In this  example we use 
+  <link xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_set1_pd&amp;expand=4926,4956,4926,4956,4652"><literal>_mm_set1_pd</literal></link> 
+  to splat the scalar <literal>__B[0]</literal>, before passing that vector to our 
+  <literal>_mm_sqrt_pd</literal> implementation, 
+  then pass the sqrt result (<literal>c[0]</literal>)  with <literal>__A[1]</literal> to  
+  <link xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_setr_pd&amp;expand=4679"><literal>_mm_setr_pd</literal></link> 
+  to combine the final result. You could also use the <literal>{c[0], __A[1]}</literal> 
+  initializer instead of <literal>_mm_setr_pd</literal>.</para>
+
+  <para>Now we can look at vector and scalar compares that add there own 
+  complication: For example, the Intel Intrinsic Guide for 
+  <link xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_cmpeq_pd&amp;expand=779,788,779"><literal>_mm_cmpeq_pd</literal></link> 
+  describes comparing double elements [0|1] and returning 
+  either 0s for not equal and 1s (<literal>0xFFFFFFFFFFFFFFFF</literal> 
+  or long long -1) for equal. The comparison result is intended as a select mask 
+  (predicates) for selecting or ignoring specific elements in later operations. 
+  The scalar version 
+  <link xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_cmpeq_sd&amp;expand=779,788"><literal>_mm_cmpeq_sd</literal></link> 
+  is similar except for the quirk 
+  of only comparing element [0] and combining the result with <literal>__A[1]</literal> to return 
+  the final vector result.</para>
+
+  <para>The packed vector implementation for PowerISA is simple as VSX 
+  provides the equivalent instruction and GCC provides the 
+  <literal>vec_cmpeq</literal> builtin 
+  supporting the vector double type. The technique of using scalar comparison 
+  operators on the <literal>__A[0]</literal> and <literal>__B[0]</literal> 
+  does not work as the C comparison operators 
+  return 0 or 1 results while we need the vector select mask (effectively 0 or 
+  -1). Also we need to watch for sequences that mix scalar floats and integers, 
+  generating if/then/else logic or requiring expensive transfers across register 
+  banks.</para>
+
+  <para>In this case we are better off using explicit vector built-ins for 
+  <literal>_mm_add_sd</literal> as and example. We can use <literal>vec_splat</literal> 
+  from element [0] to temporaries 
+  where we can safely use <literal>vec_cmpeq</literal> to generate the expect selector mask. Note 
+  that the <literal>vec_cmpeq</literal> returns a bool long type so we need the cast the result back 
+  to <literal>__v2df</literal>. Then use the 
+  <literal>(__m128d){c[0], __A[1]}</literal> initializer to combine the 
+  comparison result with the original <literal>__A[1]</literal> input and cast to the require 
+  interface type.  So we have this example:
+  <programlisting><![CDATA[extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
+_mm_cmpeq_pd (__m128d __A, __m128d __B)
+{
+  return ((__m128d)vec_cmpeq (__A, __B));
+}
+
+extern __inline  __m128d  __attribute__((__gnu_inline__, __always_inline__, __artificial__))
+_mm_cmpeq_sd(__m128d  __A, __m128d  __B)
+{
+	__v2df a, b, c;
+	/* PowerISA VSX does not allow partial (for just left double)
+	 * results. So to insure we don't generate spurious exceptions
+	 * (from the right double values) we splat the left double
+	 * before we to the operation. */
+	a = vec_splat(__A, 0);
+	b = vec_splat(__B, 0);
+	c = (__v2df)vec_cmpeq(a, b);
+	/* Then we merge the left double result with the original right
+	 * double from __A.  */
+	return ((__m128d){c[0], __A[1]});
+}]]></programlisting></para>
+
+  <para>Now lets look at a similar example that adds some surprising 
+  complexity. This is the compare not equal case so we should be able to find the 
+  equivalent vec_cmpne builtin:
+  <programlisting><![CDATA[extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
+_mm_cmpneq_pd (__m128d __A, __m128d __B)
+{
+  return (__m128d)__builtin_ia32_cmpneqpd ((__v2df)__A, (__v2df)__B);
+}
+
+extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
+_mm_cmpneq_sd (__m128d __A, __m128d __B)
+{
+  return (__m128d)__builtin_ia32_cmpneqsd ((__v2df)__A, (__v2df)__B);
+}]]></programlisting></para>
+
+</section>
+
--- a/Vector_Intrinsics/sec_performance.xml
+++ b/Vector_Intrinsics/sec_performance.xml
@ -0,0 +1,49 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
+  Copyright (c) 2017 OpenPOWER Foundation
+  
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+  
+-->
+<section xmlns="http://docbook.org/ns/docbook"
+  xmlns:xi="http://www.w3.org/2001/XInclude"
+  xmlns:xlink="http://www.w3.org/1999/xlink"
+  version="5.0"
+  xml:id="sec_performance">
+  <title>Performance</title>
+  
+  <para>The performance of a ported intrinsic depends on the specifics of the 
+  intrinsic and the context it is used in. Many of the SIMD operations have 
+  equivalent instructions in both architectures. For example the vector float and 
+  vector double match very closely. However the SSE and VSX scalars have subtle 
+  differences of how the scalar is positioned with the vector registers and what 
+  happens to the rest (non-scalar part) of the register (previously discussed in 
+  <xref linkend="sec_packed_vs_scalar_intrinsics"/>). 
+  This requires additional PowerISA instructions 
+  to preserve the non-scalar portion of the vector registers. This may or may not 
+  be important to the logic of the program being ported, but we have handle the 
+  case where it is.</para>
+
+  <para>This is where the context of now the intrinsic is used starts to 
+  matter. If the scalar intrinsics are used within a larger program the compiler 
+  may be able to eliminate the redundant register moves as the results are never 
+  used. In the other cases common set up (like permute vectors or bit masks) can 
+  be common-ed up and hoisted out of the loop. So it is very important to let the 
+  compiler do its job with higher optimization levels (<literal>-O3</literal>, 
+  <literal>-funroll-loops</literal>).</para>
+  
+  <xi:include href="sec_performance_sse.xml"/>
+  <xi:include href="sec_performance_mmx.xml"/>
+
+</section>
+
--- a/Vector_Intrinsics/sec_performance_mmx.xml
+++ b/Vector_Intrinsics/sec_performance_mmx.xml
@ -0,0 +1,41 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
+  Copyright (c) 2017 OpenPOWER Foundation
+  
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+  
+-->
+<section xmlns="http://docbook.org/ns/docbook"
+  xmlns:xi="http://www.w3.org/2001/XInclude"
+  xmlns:xlink="http://www.w3.org/1999/xlink"
+  version="5.0"
+  xml:id="sec_performance_mmx">
+  <title>Using MMX intrinsics</title>
+  
+  <para>MMX was the first and oldest SIMD extension and initially filled a 
+  need for wider (64-bit) integer and additional register. This is back when 
+  processors were 32-bit and 8 x 32-bit registers was starting to cramp our 
+  programming style. Now 64-bit processors, larger register sets, and 128-bit (or 
+  larger) vector SIMD extensions are common. There is simply no good reasons 
+  write new code using the (now) very limited MMX capabilities. </para>
+
+  <para>We recommend that existing MMX codes be rewritten to use the newer 
+  SSE  and VMX/VSX intrinsics or using the more portable GCC  builtin vector 
+  support or in the case of si64 operations use C scalar code. The MMX si64 
+  scalars which are just (64-bit) operations on long long int types and any 
+  modern C compiler can handle this type. The char short in SIMD operations 
+  should all be promoted to 128-bit SIMD operations on GCC builtin vectors. Both 
+  will improve cross platform portability and performance.</para>
+  
+</section>
+
--- a/Vector_Intrinsics/sec_performance_sse.xml
+++ b/Vector_Intrinsics/sec_performance_sse.xml
@ -0,0 +1,44 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
+  Copyright (c) 2017 OpenPOWER Foundation
+  
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+  
+-->
+<section xmlns="http://docbook.org/ns/docbook"
+  xmlns:xi="http://www.w3.org/2001/XInclude"
+  xmlns:xlink="http://www.w3.org/1999/xlink"
+  version="5.0"
+  xml:id="sec_performance_sse">
+  <title>Using SSE float and double scalars</title>
+  
+  <para>SSE scalar float / double intrinsics  “hand” optimization is no 
+  longer necessary. This was important, when SSE was initially introduced, and 
+  compiler support was limited or nonexistent.  Also SSE scalar float / double 
+  provided additional (16) registers and IEEE754 compliance, not available from 
+  the 8087 floating point architecture that preceded it. So application 
+  developers where motivated to use SSE instruction versus what the compiler was 
+  generating at the time.</para>
+
+  <para>Modern compilers can now to generate and  optimize these (SSE 
+  scalar) instructions for Intel from C standard scalar code. Of course PowerISA 
+  supported IEEE754 float and double and had 32 dedicated floating point 
+  registers from the start (and now 64 with VSX). So replacing a Intel specific 
+  scalar intrinsic implementation with the equivalent C language scalar 
+  implementation is usually a win; allows the compiler to apply the latest 
+  optimization and tuning for the latest generation processor, and is portable to 
+  other platforms where the compiler can also apply the latest optimization and 
+  tuning for that processors latest generation.</para>
+   
+</section>
+
--- a/Vector_Intrinsics/sec_power_vector_permute_format.xml
+++ b/Vector_Intrinsics/sec_power_vector_permute_format.xml
@ -0,0 +1,147 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
+  Copyright (c) 2017 OpenPOWER Foundation
+  
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+  
+-->
+<section xmlns="http://docbook.org/ns/docbook"
+  xmlns:xi="http://www.w3.org/2001/XInclude"
+  xmlns:xlink="http://www.w3.org/1999/xlink"
+  version="5.0"
+  xml:id="sec_power_vector_permute_format">
+  <title>Vector permute and formatting instructions</title>
+
+  <para>The vector Permute and formatting chapter follows and is an important 
+  one to study. These operation operation on the byte, halfword, word (and with 
+  2.07 doubleword) integer types . Plus special Pixel type. The shifts 
+  instructions in this chapter operate on the vector as a whole at either the bit 
+  or the byte (octet) level, This is an important chapter to study for moving 
+  PowerISA vector results into the vector elements that Intel Intrinsics 
+  expect:
+  
+  <literallayout><literal>6.8 Vector Permute and Formatting Instructions . . . . . . . . . . . 249
+6.8.1 Vector Pack and Unpack Instructions  . . . . . . . . . . . . . 249
+6.8.2 Vector Merge Instructions  . . . . . . . . . . . . . . . . . . 256
+6.8.3 Vector Splat Instructions  . . . . . . . . . . . . . . . . . . 259
+6.8.4 Vector Permute Instruction . . . . . . . . . . . . . . . . . . 260
+6.8.5 Vector Select Instruction  . . . . . . . . . . . . . . . . . . 261
+6.8.6 Vector Shift Instructions  . . . . . . . . . . . . . . . . . . 262</literal></literallayout></para>
+
+  <para>The Vector Integer instructions include the add / subtract / Multiply 
+  / Multiply Add/Sum / (no divide) operations for the standard integer types. 
+  There are instruction forms that  provide signed, unsigned, modulo, and 
+  saturate results for most operations. The PowerISA 2.07 extension add / 
+  subtract of 128-bit integers with carry and extend to 256, 512-bit and beyond , 
+  is included here. There are signed / unsigned compares across the standard 
+  integer types (byte, .. doubleword). The usual and bit-wise logical operations. 
+  And the SIMD shift / rotate instructions that operate on the vector elements 
+  for various types.
+
+  <literallayout><literal>6.9 Vector Integer Instructions  . . . . . . . . . . . . . . . . . . 264
+6.9.1 Vector Integer Arithmetic Instructions . . . . . . . . . . . . 264
+6.9.2 Vector Integer Compare Instructions. . . . . . . . . . . . . . 294
+6.9.3 Vector Logical Instructions  . . . . . . . . . . . . . . . . . 300
+6.9.4 Vector Integer Rotate and Shift Instructions . . . . . . . . . 302</literal></literallayout></para>
+
+  <para>The vector [single] float instructions are grouped into this chapter. 
+  This chapter does not include the double float instructions which are described 
+  in the VSX chapter. VSX also include additional float instructions that operate 
+  on the whole 64 register vector-scalar set.
+
+  <literallayout><literal>6.10 Vector Floating-Point Instruction Set . . . . . . . . . . . . . 306
+6.10.1 Vector Floating-Point Arithmetic Instructions . . . . . . . . 306
+6.10.2 Vector Floating-Point Maximum and Minimum Instructions  . . . 308
+6.10.3 Vector Floating-Point Rounding and Conversion Instructions. . 309
+6.10.4 Vector Floating-Point Compare Instructions  . . . . . . . . . 313
+6.10.5 Vector Floating-Point Estimate Instructions . . . . . . . . . 316</literal></literallayout></para>
+
+  <para>The vector XOR based instructions are new with PowerISA 2.07 (POWER8) 
+  and provide vector  crypto and check-sum operations:
+
+  <literallayout><literal>6.11 Vector Exclusive-OR-based Instructions  . . . . . . . . . . . . 318
+6.11.1 Vector AES Instructions . . . . . . . . . . . . . . . . . . . 318
+6.11.2 Vector SHA-256 and SHA-512 Sigma Instructions . . . . . . . . 320
+6.11.3 Vector Binary Polynomial Multiplication Instructions. . . . . 321
+6.11.4 Vector Permute and Exclusive-OR Instruction . . . . . . . . . 323</literal></literallayout></para>
+
+  <para>The vector gather and bit permute support bit level rearrangement of 
+  bits with in the vector. While the vector versions of the count leading zeros 
+  and population count are useful to accelerate specific algorithms.
+
+  <literallayout><literal>6.12 Vector Gather Instruction . . . . . . . . . . . . . . . . . . . 324
+6.13 Vector Count Leading Zeros Instructions . . . . . . . . . . . . 325
+6.14 Vector Population Count Instructions. . . . . . . . . . . . . . 326
+6.15 Vector Bit Permute Instruction  . . . . . . . . . . . . . . . . 327</literal></literallayout></para>
+
+  <para>The Decimal Integer add / subtract instructions complement the 
+  Decimal Floating-Point instructions. They can also be used to accelerated some 
+  binary to/from decimal conversions. The VSCR instruction provides access the 
+  the Non-Java mode floating-point control and the saturation status. These 
+  instruction are not normally of interest in porting Intel intrinsics.
+
+  <literallayout><literal>6.16 Decimal Integer Arithmetic Instructions . . . . . . . . . . . . 328
+6.17 Vector Status and Control Register Instructions . . . . . . . . 331</literal></literallayout></para>
+
+  <para>With PowerISA 2.07B (Power8) several major extension where added to 
+  the Vector Facility:</para>
+
+  <itemizedlist>
+    <listitem>
+      <para>Vector Crypto: Under “Vector Exclusive-OR-based Instructions 
+      Vector Exclusive-OR-based Instructions”, AES [inverse] Cipher, SHA 256 / 512 
+      Sigma, Polynomial Multiplication, and Permute and XOR instructions.</para>
+    </listitem>
+    <listitem>
+      <para>64-bit Integer; signed and unsigned add / subtract, signed and 
+      unsigned compare, Even / Odd 32 x 32 multiple with 64-bit product, signed / 
+      unsigned max / min, rotate and shift left/right.</para>
+    </listitem>
+    <listitem>
+      <para>Direct Move between GRPs and the FPRs / Left half of Vector 
+      Registers.</para>
+    </listitem>
+    <listitem>
+      <para>128-bit integer add / subtract with carry / extend, direct 
+      support for vector <literal>__int128</literal> and multiple precision arithmetic.</para>
+    </listitem>
+    <listitem>
+      <para>Decimal Integer add subtract for 31 digit BCD.</para>
+    </listitem>
+    <listitem>
+      <para>Miscellaneous SIMD extensions: Count leading Zeros, Population 
+      count, bit gather / permute, and vector forms of eqv, nand, orc.</para>
+    </listitem>
+  </itemizedlist>
+
+  <para>The rational for why these are included in the Vector Facilities 
+  (VMX) (vs Vector-Scalar Floating-Point Operations (VSX)) has more to do with 
+  how the instruction where encoded then with the type of operations or the ISA 
+  version of introduction. This is primarily a trade-off between the bits 
+  required for register selection vs bits for extended op-code space within in a 
+  fixed 32-bit instruction. Basically accessing 32 vector registers require 
+  5-bits per register, while accessing all 64 vector-scalar registers require 
+  6-bits per register. When you consider the most vector instructions require  3 
+   and some (select, fused multiply-add) require 4 register operand forms,  the 
+  impact on op-code space is significant. The larger register set of VSX was 
+  justified by queuing theory of larger HPC matrix codes using double float, 
+  while 32 registers are sufficient for most applications.</para>
+
+  <para>So by definition the VMX instructions are restricted to the original 
+  32 vector registers while VSX instructions are encoded to  access all 64 
+  floating-point scalar and vector double registers. This distinction can be 
+  troublesome when programming at the assembler level, but the compiler and 
+  compiler built-ins can hide most of this detail from the programmer. </para>
+  
+</section>
+
--- a/Vector_Intrinsics/sec_power_vmx.xml
+++ b/Vector_Intrinsics/sec_power_vmx.xml
@ -0,0 +1,67 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
+  Copyright (c) 2017 OpenPOWER Foundation
+  
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+  
+-->
+<section xmlns="http://docbook.org/ns/docbook"
+  xmlns:xi="http://www.w3.org/2001/XInclude"
+  xmlns:xlink="http://www.w3.org/1999/xlink"
+  version="5.0"
+  xml:id="sec_power_vmx">
+  <title>The Vector Facility (VMX)</title>
+  
+ <para>The orginal VMX supported SIMD integer byte, halfword, and word, and 
+  single float data types within a separate (from GPR and FPR) bank of 32 x 
+  128-bit vector registers. These operations like to stay within their (SIMD) 
+  lanes except where the operation changes the element data size (integer 
+  multiply, pack, and unpack). </para>
+
+  <para>This is complimented by bit logical and shift / rotate / permute / 
+  merge instuctions that operate on the vector as a whole.  Some operation 
+  (permute, pack, merge, shift double, select) will select 128 bits from a pair 
+  of vectors (256-bits) and deliver 128-bit vector result. These instructions 
+  will cross lanes or multiple registers to grab fields and assmeble them into 
+  the single register result.</para>
+
+  <para>The PowerISA 2.07B Chapter 6. Vector Facility is organised starting 
+  with an overview (chapters 6.1- 6.6):
+
+  <literallayout><literal>6.1 Vector Facility Overview . . . . . . . . . . . . . . . . . . . . 227
+6.2 Chapter Conventions. . . . . . . . . . . . . . . . . . . . . . . 227
+6.2.1 Description of Instruction Operation . . . . . . . . . . . . . 227
+6.3 Vector Facility Registers  . . . . . . . . . . . . . . . . . . . 234
+6.3.1 Vector Registers . . . . . . . . . . . . . . . . . . . . . . . 234
+6.3.2 Vector Status and Control Register . . . . . . . . . . . . . . 234
+6.3.3 VR Save Register . . . . . . . . . . . . . . . . . . . . . . . 235
+6.4 Vector Storage Access Operations . . . . . . . . . . . . . . . . 235
+6.4.1 Accessing Unaligned Storage Operands . . . . . . . . . . . . . 237
+6.5 Vector Integer Operations  . . . . . . . . . . . . . . . . . . . 238
+6.5.1 Integer Saturation . . . . . . . . . . . . . . . . . . . . . . 238
+6.6 Vector Floating-Point Operations . . . . . . . . . . . . . . . . 240
+6.6.1 Floating-Point Overview  . . . . . . . . . . . . . . . . . . . 240
+6.6.2 Floating-Point Exceptions  . . . . . . . . . . . . . . . . . . 240
+6.7 Vector Storage Access Instructions . . . . . . . . . . . . . . . 242
+6.7.1 Storage Access Exceptions  . . . . . . . . . . . . . . . . . . 242
+6.7.2 Vector Load Instructions . . . . . . . . . . . . . . . . . . . 243
+6.7.3 Vector Store Instructions. . . . . . . . . . . . . . . . . . . 246
+6.7.4 Vector Alignment Support Instructions. . . . . . . . . . . . . 248</literal></literallayout></para>
+  
+  <para>Then a chapter on storage (load/store) access for vector and vector 
+  elements:</para>
+  
+  <xi:include href="sec_power_vector_permute_format.xml"/>
+
+</section>
+
--- a/Vector_Intrinsics/sec_power_vsx.xml
+++ b/Vector_Intrinsics/sec_power_vsx.xml
@ -0,0 +1,186 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
+  Copyright (c) 2017 OpenPOWER Foundation
+  
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+  
+-->
+<section xmlns="http://docbook.org/ns/docbook"
+  xmlns:xi="http://www.w3.org/2001/XInclude"
+  xmlns:xlink="http://www.w3.org/1999/xlink"
+  version="5.0"
+  xml:id="sec_power_vector_scalar_floatingpoint">
+  <title>Vector-Scalar Floating-Point Operations (VSX)</title>
+  
+  <para>With PowerISA 2.06 (POWER7) we extended the vector SIMD capabilities 
+  of the PowerISA:</para>
+  
+  <itemizedlist>
+    <listitem>
+      <para>Extend the available vector and floating-point scalar register 
+      sets from 32 registers each to a combined 64 x 64-bit scalar floating-point and 
+      64 x 128-bit vector registers.</para>
+    </listitem>
+    <listitem>
+      <para>Enable scalar double float operations on all 64 scalar 
+      registers.</para>
+    </listitem>
+    <listitem>
+      <para>Enable vector double and vector float operations for all 64 
+      vector registers.</para>
+    </listitem>
+    <listitem>
+      <para>Enable super-scalar execution of vector instructions and support 
+      2 independent vector floating point  pipelines for parallel execution of 4 x 
+      64-bit Floating point Fused Multiply Adds (FMAs) and 8 x 32-bit (FMAs) per 
+      cycle.</para>
+    </listitem>
+  </itemizedlist>
+
+  <para>With PowerISA 2.07 (POWER8) we added single-precision scalar 
+  floating-point instruction to VSX. This completes the floating-point 
+  computational set for VSX. This ISA release also clarified how these operate in 
+  the Little Endian storage model.</para>
+
+  <para>While the focus was on enhanced floating-point computation (for High 
+  Performance Computing),  VSX also extended  the ISA with additional storage 
+  access, logical, and permute (merge, splat, shift) instructions. This was 
+  necessary to extend these operations cover 64 VSX registers, and improves 
+  unaligned storage access for vectors  (not available in VMX).</para>
+
+  <para>The PowerISA 2.07B Chapter 7. Vector-Scalar Floating-Point Operations 
+  is organized starting with an introduction and overview (chapters 7.1- 7.5) . 
+  The early sections (7.1 and 7.2) describe the layout of the 64 VSX registers 
+  and how they relate (overlap and inter-operate) to the existing floating point 
+  scalar (FPRs) and (VMX VRs) vector registers.
+
+  <literallayout><literal>7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 317
+7.1.1 Overview of the Vector-Scalar Extension  . . . . . . . . . . . 317
+7.2 VSX Registers  . . . . . . . . . . . . . . . . . . . . . . . . . 318
+7.2.1 Vector-Scalar Registers  . . . . . . . . . . . . . . . . . . . 318
+7.2.2 Floating-Point Status and Control Register . . . . . . . . . . 321</literal></literallayout></para>
+
+  <para>The definitions given in “7.1.1.1 Compatibility with Category 
+  Floating-Point and Category Decimal Floating-Point Operations”, and 
+  “7.1.1.2 Compatibility with Category Vector Operations”
+    <blockquote>
+      <para>The instruction sets defined in Chapter 4.
+      Floating-Point Facility and Chapter 5. Decimal
+      Floating-Point retain their definition with one primary
+      difference. The FPRs are mapped to doubleword
+      element 0 of VSRs 0-31. The contents of doubleword 1
+      of the VSR corresponding to a source FPR specified
+      by an instruction are ignored. The contents of
+      doubleword 1 of a VSR corresponding to the target
+      FPR specified by an instruction are undefined.</para>
+      
+      <para>The instruction set defined in Chapter 6. Vector Facility
+      [Category: Vector], retains its definition with one
+      primary difference. The VRs are mapped to VSRs
+      32-63.</para></blockquote></para>
+
+  <note><para>The reference to scalar element 0 above is from the big endian 
+  register perspective of the ISA. In the PPC64LE ABI implementation, and for the 
+  purpose of porting Intel intrinsics, this is logical element 1.  Intel SSE 
+  scalar intrinsics operated on logical element [0],  which is in the wrong 
+  position for PowerISA FPU and VSX scalar floating-point  operations. Another 
+  important note is what happens to the other half of the VSR when you execute a 
+  scalar floating-point instruction (<emphasis>The contents of doubleword 1 of a VSR … 
+  are undefined.</emphasis>)</para></note>
+
+  <para>The compiler will hide some of this detail when generating code for 
+  little endian vector element [] notation and most vector built-ins. For example 
+  <literal>vec_splat (A, 0)</literal> is transformed for 
+  PPC64LE to <literal>xxspltd VRT,VRA,1</literal>. 
+  What the compiler <emphasis><emphasis role="bold">can not</emphasis></emphasis> 
+  hide is the different placement of scalars within vector registers.</para>
+
+  <para>Vector registers (VRs) 0-31 overlay and can be accessed from vector 
+  scalar registers (VSRs) 32-63. The ABI also specifies that VR2-13 are used to 
+  pass parameter and return values. In some cases the same (similar) operations 
+  exist in both VMX and VSX instruction forms, while in the other cases 
+  operations only exist for VMX (byte level permute and shift) or VSX (Vector 
+  double).</para>
+
+  <para>So resister selection that; avoids unnecessary vector moves, follows 
+  the ABI, while maintaining the correct instruction specific register numbering, 
+  can be tricky. The 
+  <link xlink:href="https://gcc.gnu.org/onlinedocs/gcc-6.3.0/gcc/Machine-Constraints.html#Machine-Constraints">GCC register constraint</link> 
+  annotations for Inline 
+  assembler using vector instructions  is challenging, even for experts. So only 
+  experts should be writing assembler and then only in extraordinary 
+  circumstances. You should leave these details to the compiler (using vector 
+  extensions and vector built-ins) when ever possible.</para>
+
+  <para>The next sections get is into the details of floating point 
+  representation, operations, and exceptions. Basically the implementation 
+  details for the IEEE754R and C/C++ language standards that most developers only 
+  access via higher level APIs. So most programmers will not need this level of 
+  detail, but it is there if needed.
+
+  <literallayout><literal>7.3 VSX Operations . . . . . . . . . . . . . . . . . . . . . . . . . 326
+7.3.1 VSX Floating-Point Arithmetic Overview . . . . . . . . . . . . 326
+7.3.2 VSX Floating-Point Data  . . . . . . . . . . . . . . . . . . . 327
+7.3.3 VSX Floating-Point Execution Models  . . . . . . . . . . . . . 335
+7.4 VSX Floating-Point Exceptions  . . . . . . . . . . . . . . . . . 338
+7.4.1 Floating-Point Invalid Operation Exception . . . . . . . . . . 341
+7.4.2 Floating-Point Zero Divide Exception . . . . . . . . . . . . . 347
+7.4.3 Floating-Point Overflow Exception. . . . . . . . . . . . . . . 349
+7.4.4 Floating-Point Underflow Exception . . . . . . . . . . . . . . 351</literal></literallayout></para>
+
+  <para>Finally an overview the VSX storage access instructions for big and 
+  little endian and for aligned and unaligned data addresses. This included 
+  diagrams that illuminate the differences
+
+  <literallayout><literal>7.5 VSX Storage Access Operations  . . . . . . . . . . . . . . . . . 356
+7.5.1 Accessing Aligned Storage Operands . . . . . . . . . . . . . . 356
+7.5.2 Accessing Unaligned Storage Operands . . . . . . . . . . . . . 357
+7.5.3 Storage Access Exceptions  . . . . . . . . . . . . . . . . . . 358</literal></literallayout></para>
+
+  <para>Section 7.6 starts with a VSX instruction Set Summary which is the 
+  place to start to get an feel for the types and operations supported.  The 
+  emphasis on float-point, both scalar and vector (especially vector double), is 
+  pronounced. Many of the scalar and single-precision vector instruction look 
+  like duplicates of what we have seen in the Chapter 4 Floating-Point and 
+  Chapter 6 Vector facilities. The difference here is, new instruction encodings 
+  to access the full 64 VSX register space. </para>
+
+  <para>In addition there are small number of logical instructions are 
+  include to support predication (selecting / masking vector elements based on 
+  compare results). And set of permute, merge, shift, and splat instructions that 
+  operation on VSX word (float) and doubleword (double) elements. As mentioned 
+  about VMX section 6.8 these instructions are good to study as they are useful 
+  for realigning elements from PowerISA vector results to that required for Intel 
+  Intrinsics.
+
+  <literallayout><literal>7.6 VSX Instruction Set . . . . . . . . . . . . . . . . . . . . . .  359
+7.6.1 VSX Instruction Set Summary . . . . . . . . . . . . . . . . .  359
+7.6.1.1 VSX Storage Access Instructions . . . . . . . . . . . . . .  359
+7.6.1.2 VSX Move Instructions . . . . . . . . . . . . . . . . . . .  360
+7.6.1.3 VSX Floating-Point Arithmetic Instructions  . . . . . . . .  360
+7.6.1.4 VSX Floating-Point Compare Instructions . . . . . . . . . .  363
+7.6.1.5 VSX DP-SP Conversion Instructions . . . . . . . . . . . . .  364
+7.6.1.6 VSX Integer Conversion Instructions . . . . . . . . . . . .  364
+7.6.1.7 VSX Round to Floating-Point Integer Instructions  . . . . .  366
+7.6.1.8 VSX Logical Instructions. . . . . . . . . . . . . . . . . .  366
+7.6.1.9 VSX Permute Instructions. . . . . . . . . . . . . . . . . .  367
+7.6.2 VSX Instruction Description Conventions . . . . . . . . . . .  368
+7.6.3 VSX Instruction Descriptions  . . . . . . . . . . . . . . . .  392</literal></literallayout></para>
+
+  <para>The VSX Instruction Descriptions section contains the detail 
+  description for each VSX category instruction.  The table entries from the 
+  Instruction Set Summary are formatted in the document at hyperlinks to 
+  corresponding instruction description.</para>
+
+</section>
+
--- a/Vector_Intrinsics/sec_powerisa.xml
+++ b/Vector_Intrinsics/sec_powerisa.xml
@ -0,0 +1,33 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
+  Copyright (c) 2017 OpenPOWER Foundation
+  
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+  
+-->
+<section xmlns="http://docbook.org/ns/docbook"
+  xmlns:xi="http://www.w3.org/2001/XInclude"
+  xmlns:xlink="http://www.w3.org/1999/xlink"
+  version="5.0"
+  xml:id="sec_powerisa">
+  <title>The PowerISA</title>
+  
+  <para>The PowerISA is for historical reasons is organized at the top level 
+  by the distinction between older Vector Facility (Altivec / VMX) and the newer 
+  Vector-Scalar Floating-Point Operations (VSX). </para>
+ 
+  <xi:include href="sec_power_vmx.xml"/>
+  <xi:include href="sec_power_vsx.xml"/>
+
+</section>
+
--- a/Vector_Intrinsics/sec_powerisa_vector_facilities.xml
+++ b/Vector_Intrinsics/sec_powerisa_vector_facilities.xml
@ -0,0 +1,46 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
+  Copyright (c) 2017 OpenPOWER Foundation
+  
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+  
+-->
+<section xmlns="http://docbook.org/ns/docbook"
+  xmlns:xi="http://www.w3.org/2001/XInclude"
+  xmlns:xlink="http://www.w3.org/1999/xlink"
+  version="5.0"
+  xml:id="sec_powerisa_vector_facilities">
+  <title>PowerISA Vector facilities</title>
+  
+  <para>The PowerISA vector facilities (VMX and VSX) are extensive, but does 
+  not always provide a direct or obvious functional equivalent to the Intel 
+  Intrinsics. But being not obvious is not the same as imposible. It just 
+  requires some basic programing skills.</para>
+
+  <para>It is a good idea to have an overall understanding of the vector 
+  capabilities the PowerISA. You do not need to memorize every instructions but 
+  is helps to know where to look. Both the PowerISA and OpenPOWER ABI have a 
+  specific structure and organization that can help you find what you looking 
+  for. </para>
+
+  <para>It also helps to understand the relationship between the PowerISAs 
+  low level instructions and the higher abstraction of the vector intrinsics as 
+  defined by the OpenPOWER ABIs Vector Programming Interfaces and the the defacto 
+  standard of GCC's PowerPC AltiVec Built-in Functions.</para>
+  
+  <xi:include href="sec_powerisa.xml"/>
+  <xi:include href="sec_powerisa_vector_intrinsics.xml"/>
+  <xi:include href="sec_powerisa_vector_size_type.xml"/>
+
+</section>
+
--- a/Vector_Intrinsics/sec_powerisa_vector_intrinsics.xml
+++ b/Vector_Intrinsics/sec_powerisa_vector_intrinsics.xml
@ -0,0 +1,79 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
+  Copyright (c) 2017 OpenPOWER Foundation
+  
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+  
+-->
+<section xmlns="http://docbook.org/ns/docbook"
+  xmlns:xi="http://www.w3.org/2001/XInclude"
+  xmlns:xlink="http://www.w3.org/1999/xlink"
+  version="5.0"
+  xml:id="sec_powerisa_vector_intrinsics">
+  <title>PowerISA Vector Intrinsics</title>
+  
+  <para>The 
+  <emphasis role="bold">OpenPOWER ELF V2 application binary interface (ABI): Chapter 6.</emphasis> 
+  <emphasis><emphasis role="bold">Vector Programming Interfaces</emphasis></emphasis> and 
+  <emphasis><emphasis role="bold">Appendix A. Predefined Functions for Vector 
+  Programming</emphasis></emphasis> document the current and proposed vector built-ins we expect all 
+  C/C++ compilers implement. </para>
+
+  <para>Some of these operations are endian sensitive and the compiler needs 
+  to make corresponding adjustments as  it generate code for endian sensitive 
+  built-ins. There is a good overview for this in the 
+  <emphasis role="bold">OpenPOWER ABI Section</emphasis>
+  <emphasis><emphasis role="bold">6.4. 
+  Vector Built-in Functions</emphasis></emphasis>.</para>
+
+  <para>Appendix A is organized (sorted) by built-in name, output type, then 
+  parameter types. Most built-ins are generic as the named the operation (add, 
+  sub, mul, cmpeq, ...) applies to multiple types. </para>
+
+  <para>So the build <literal>vec_add</literal> built-in applies to all the signed and unsigned 
+  integer types (char, short, in, and long) plus float and double floating-point 
+  types. The compiler looks at the parameter type to select the vector 
+  instruction (or instruction sequence) that implements the (add) operation on 
+  that type. The compiler infers the output result type from the operation and 
+  input parameters and will complain if the target variable type is not 
+  compatible. For example:
+  <programlisting><![CDATA[vector signed char vec_add (vector signed char, vector signed char);
+vector unsigned char vec_add (vector unsigned char, vector unsigned char);
+vector signed short vec_add (vector signed short, vector signed short);
+vector unsigned short vec_add (vector unsigned short, vector unsigned short);
+vector signed int vec_add (vector signed int, vector signed int);
+vector unsigned int vec_add (vector unsigned int, vector unsigned int);
+vector signed int vec_add (vector signed long, vector signed long);
+vector unsigned int vec_add (vector unsigned long, vector unsigned long);
+vector float vec_add (vector float, vector float);
+vector double vec_add (vector double, vector double);]]></programlisting></para>
+
+  <para>This is one key difference between PowerISA built-ins and Intel 
+  Intrinsics (Intel Intrinsics are not generic and include type information in 
+  the name). This is why it is so important to understand the vector element 
+  types and to add the appropriate type casts to get the correct results.</para>
+
+  <para>The defacto standard implementation is GCC as defined in the include 
+  file <literal>&lt;altivec.h&gt;</literal> and documented in the GCC online documentation in 
+  <link xlink:href="https://gcc.gnu.org/onlinedocs/gcc-6.3.0/gcc/PowerPC-AltiVec_002fVSX-Built-in-Functions.html#PowerPC-AltiVec_002fVSX-Built-in-Functions">6.59.20 PowerPC 
+  AltiVec Built-in Functions</link>. The header file name and section title 
+  reflect the origin of the Vector Facility, but recent versions of GCC altivec.h 
+  include built-ins for newer PowerISA 2.06 and 2.07 VMX plus VSX extensions. 
+  This is a work in progress where your  (older) distro GCC compiler may not 
+  include built-ins for the latest PowerISA 3.0 or ABI edition. So before you use 
+  a built-in you find in the ABI Appendix A, check the specific 
+  <link xlink:href="https://gcc.gnu.org/onlinedocs/">GCC online documentation</link> for the 
+  GCC version you are using.</para>
+
+</section>
+
--- a/Vector_Intrinsics/sec_powerisa_vector_size_type.xml
+++ b/Vector_Intrinsics/sec_powerisa_vector_size_type.xml
@ -0,0 +1,119 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
+  Copyright (c) 2017 OpenPOWER Foundation
+  
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+  
+-->
+<section xmlns="http://docbook.org/ns/docbook"
+  xmlns:xi="http://www.w3.org/2001/XInclude"
+  xmlns:xlink="http://www.w3.org/1999/xlink"
+  version="5.0"
+  xml:id="sec_powerisa_vector_size_type">
+  <title>How vector elements change size and type</title>
+  
+  <para>Most vector built ins return the same vector type as the (first) 
+  input parameters, but there are exceptions. Examples include; conversions 
+  between types, compares , pack, unpack,  merge, and integer multiply 
+  operations.</para>
+
+  <para>Converting floats to from integer will change the type and something 
+  change the element size as well (double ↔ int and float ↔ long). For the 
+  VMX the conversions are always the same size (float ↔ [unsigned] int). But 
+  VSX allows conversion of 64-bit (long or double) to from 32-bit (float or 
+   int)  with the inherent size changes. The PowerISA VSX defines a 4 element 
+  vector layout where little endian elements 0, 2 are used for input/output and 
+  elements 1,3 are undefined. The OpenPOWER ABI Appendix A define 
+  <literal>vec_double</literal> and <literal>vec_float</literal> 
+  with even/odd and high/low extensions as program aids. These are not 
+  included in GCC 7 or earlier but are planned for GCC 8.</para>
+
+  <para>Compare operations produce either 
+  <literal>vector bool &lt;</literal>input element type<literal>&gt;</literal> 
+  (effectively bit masks) or predicates (the condition code for all and 
+  any are represented as an int truth variable). When a predicate compare (i.e. 
+  <literal>vec_all_eq</literal>, <literal>vec_any_gt</literal>), 
+  is used in a if statement,  the condition code is 
+  used directly in the conditional branch and the int truth value is not 
+  generated.</para>
+
+  <para>Pack operations pack integer elements into the next smaller (half) 
+  integer sized elements. Pack operations include signed and unsigned saturate 
+  and unsigned modulo forms. As the packed result will be half the size (in 
+  bits), pack instructions require 2 vectors (256-bits) as input and generate a 
+  single 128-bit vector results.
+  <programlisting><![CDATA[vec_vpkudum ({1, 2}, {101, 102}) result={1, 2, 101, 102}]]></programlisting></para>
+
+  <para>Unpack operations expand integer elements into the next larger size 
+  elements. The integers are always treated as signed values and sign-extended. 
+  The processor design avoids instructions that return multiple register values. 
+  So the PowerISA defines unpack-high and unpack low forms where instruction 
+  takes (the high or low) half of vector elements and extends them to fill the 
+  vector output. Element order is maintained and an unpack high / low sequence 
+  with same input vector has the effect of unpacking to a 256-bit result in two 
+  vector registers.
+  <programlisting><![CDATA[vec_vupkhsw ({1, 2, 3, 4}) result={1, 2}
+vec_vupkhsw ({-1, 2, -3, 4}) result={-1, 2}
+vec_vupklsw ({1, 2, 3, 4}) result={3, 4}
+vec_vupklsw ({-1, 2, -3, 4}) result={-3, 4}]]></programlisting></para>
+
+  <para>Merge operations resemble shuffling two (vectors) card decks 
+  together, alternating (elements) cards in the result.   As we are merging from 
+  2 vectors (256-bits) into 1 vector (128-bits) and the elements do not change 
+  size, we have merge high and merge low instruction forms for each (byte, 
+  halfword and word) integer type. The merge high operations alternate elements 
+  from the (vector register left) high half of the two input vectors. The merge 
+  low operation alternate elements from the (vector register right) low half of 
+  the two input vectors.</para>
+
+  <para>For PowerISA 2.07 we added vector merge word even / odd instructions. 
+  Instead of high or low elements the shuffle is from the even or odd number 
+  elements of the two input vectors. Passing the same vector to both inputs to 
+  merge produces splat like results for each doubleword half, which is handy in 
+  some convert operations.
+  <programlisting><![CDATA[vec_mrghd ({1, 2}, {101, 102}) result={1, 101}
+vec_mrgld ({1, 2}, {101, 102}) result={2, 102}
+
+vec_vmrghw ({1, 2, 3, 4}, {101, 102, 103, 104}) result={1, 101, 2, 102}
+vec_vmrghw ({1, 2, 3, 4}, {1, 2, 3, 4}) result={1, 1, 2, 2}
+vec_vmrglw ({1, 2, 3, 4}, {101, 102, 103, 104}) result={3, 103, 4, 104}
+vec_vmrglw ({1, 2, 3, 4}, {1, 2, 3, 4}) result={3, 3, 4, 4}
+
+
+vec_mergee ({1, 2, 3, 4}, {101, 102, 103, 104}) result={1, 101, 3, 103}
+vec_mergee ({1, 2, 3, 4}, {1, 2, 3, 4}) result={1, 1, 3, 3}
+vec_mergeo ({1, 2, 3, 4}, {101, 102, 103, 104}) result={2, 102, 4, 104}
+vec_mergeo ({1, 2, 3, 4}, {1, 2, 3, 4}) result={2, 2, 4, 4}]]></programlisting></para>
+
+  <para>Integer multiply has the potential to generate twice as many bits in 
+  the product as input. A multiply of 2 int (32-bit) values produces a long 
+  (64-bits). Normal C language * operations ignore this and discard the top 
+  32-bits of the result. However  in some computations it useful to preserve the 
+  double product precision for intermediate computation before reducing the final 
+  result back to the original precision.</para>
+
+  <para>The PowerISA VMX instruction set took the later approach ie keep all 
+  the product bits until the programmer explicitly asks for the truncated result. 
+  So the vector integer multiple are split into even/odd forms across signed and 
+  unsigned; byte, halfword and word inputs. This requires two instructions (given 
+  the same inputs) to generated the full vector  multiply across 2 vector 
+  registers and 256-bits. Again as POWER processors are super-scalar this pair of 
+  instructions should execute in parallel.</para>
+
+  <para>The set of expanded product values can either be used directly in 
+  further (doubled precision) computation or merged/packed into the single single 
+  vector at the smaller bit size. This is what the compiler will generate for C 
+  vector extension multiply of vector integer types.</para>
+
+</section>
+
--- a/Vector_Intrinsics/sec_prefered_methods.xml
+++ b/Vector_Intrinsics/sec_prefered_methods.xml
@ -0,0 +1,57 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
+  Copyright (c) 2017 OpenPOWER Foundation
+  
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+  
+-->
+<section xmlns="http://docbook.org/ns/docbook"
+  xmlns:xi="http://www.w3.org/2001/XInclude"
+  xmlns:xlink="http://www.w3.org/1999/xlink"
+  version="5.0"
+  xml:id="sec_prefered_methods">
+  <title>Prefered methods</title>
+  
+  <para>As we will see there are multiple ways to implement the logic of 
+  these intrinsics. Some implementation methods are preferred because they allow 
+  the compiler to select instructions and provided the most flexibility for 
+  optimization across the whole sequence. Other methods may be required to 
+  deliver a specific semantic or to deliver better optimization than the current 
+  compiler is capable of. Some methods are more portable across multiple 
+  compilers (GCC, LLVM, ...). All of this should be taken into consideration for 
+  each intrinsic implementation. In general we should use the following list as a 
+  guide to these decisions:</para>
+  
+  <orderedlist>
+    <listitem>
+      <para>Use C vector arithmetic, logical, dereference, etc., operators in 
+      preference to intrinsics.</para>
+    </listitem>
+    <listitem>
+      <para>Use the bi-endian interfaces from Appendix A of the ABI in 
+      preference to other intrinsics when available, as these are designed for 
+      portability among compilers.</para>
+    </listitem>
+    <listitem>
+      <para>Use other, less well documented intrinsics (such as 
+      <literal>__builtin_vsx_*</literal>) when no better facility is available, in preference to 
+      assembly.</para>
+    </listitem>
+    <listitem>
+      <para>If necessary, use inline assembly, but know what you're 
+      doing.</para>
+    </listitem>
+  </orderedlist>
+
+</section>
+
--- a/Vector_Intrinsics/sec_prepare.xml
+++ b/Vector_Intrinsics/sec_prepare.xml
@ -0,0 +1,66 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
+  Copyright (c) 2017 OpenPOWER Foundation
+  
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+  
+-->
+<section xmlns="http://docbook.org/ns/docbook"
+  xmlns:xi="http://www.w3.org/2001/XInclude"
+  xmlns:xlink="http://www.w3.org/1999/xlink"
+  version="5.0"
+  xml:id="sec_prepare">
+  <title>Prepare yourself</title>
+  
+  <para>To port Intel intrinsics to POWER you will need to prepare yourself 
+  with knowledge of PowerISA vector facilities and how to access the associated 
+  documentation.</para>
+
+  <itemizedlist>
+    <listitem>
+      <para>
+        <link xlink:href="https://gcc.gnu.org/onlinedocs/gcc-6.3.0/gcc/Vector-Extensions.html#Vector-Extensions">GCC vector extention</link> 
+        syntax and usage. This is one of a set of GCC 
+        "<link xlink:href="https://gcc.gnu.org/onlinedocs/gcc-6.3.0/gcc/C-Extensions.html#C-Extensions">Extentions to the C language Family</link>” 
+        that the intrinsic header implementation depends 
+        on.  As many of the GCC intrinsics for x86 are implemented via C vector 
+        extensions, reading and understanding of this code is an important part of the 
+        porting process. </para>
+    </listitem>
+    <listitem>
+      <para>Intel (x86) intrinsic and type naming conventions and how to find 
+      more information. The intrinsic name encodes  some information about the 
+      vector size and type of the data, but the pattern is not always  obvious. 
+      Using the online 
+      <link xlink:href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#">Intel 
+      Intrinsic Guide</link> to look up the intrinsic by name is a good first 
+      step.</para>
+    </listitem>
+    <listitem>
+      <para>PowerISA Vector facilities. The Vector facilities of POWER8 are 
+      extensive and cover the usual types and usual operations. However it has a 
+      different history and organization from Intel.  Both (Intel and PowerISA) have 
+      their quirks and in some cases the mapping may not be obvious. So familiarizing 
+      yourself with the PowerISA Vector (VMX) and Vector Scalar Extensions (VSX) is 
+      important.</para>
+    </listitem>
+  </itemizedlist>
+  
+  <xi:include href="sec_gcc_vector_extensions.xml"/>
+  <xi:include href="sec_intel_intrinsic_functions.xml"/>
+  <xi:include href="sec_powerisa_vector_facilities.xml"/>
+  <xi:include href="sec_more_examples.xml"/>
+
+
+</section>
+
--- a/Vector_Intrinsics/sec_review_source.xml
+++ b/Vector_Intrinsics/sec_review_source.xml
@ -0,0 +1,64 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
+  Copyright (c) 2017 OpenPOWER Foundation
+  
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+  
+-->
+<section xmlns="http://docbook.org/ns/docbook"
+  xmlns:xi="http://www.w3.org/2001/XInclude"
+  xmlns:xlink="http://www.w3.org/1999/xlink"
+  version="5.0"
+  xml:id="sec_review_source">
+  <title>Look at the source, Luke</title>
+  
+  <para>So if this is a code porting activity, where is the source? All the 
+  source code we need to look at is in the GCC source trees. You can either git 
+  (<link xlink:href="https://gcc.gnu.org/wiki/GitMirror">https://gcc.gnu.org/wiki/GitMirro</link>) 
+  the gcc source  or down load one of the 
+  recent AT source tars (for example: 
+  <link xlink:href="ftp://ftp.unicamp.br/pub/linuxpatch/toolchain/at/ubuntu/dists/xenial/at10.0/">ftp://ftp.unicamp.br/pub/linuxpatch/toolchain/at/ubuntu/dists/xenial/at10.0/</link>). 
+   You will find the intrinsic headers in the ./gcc/config/i386/ 
+  sub-directory.</para>
+
+  <para>If you have a Intel Linux workstation or laptop with GCC installed, 
+  you already have these headers, if you want to take a look:
+  <screen><prompt>$ </prompt><userinput>find /usr/lib -name '*mmintrin.h'</userinput>
+/usr/lib/gcc/x86_64-redhat-linux/4.4.4/include/wmmintrin.h
+/usr/lib/gcc/x86_64-redhat-linux/4.4.4/include/mmintrin.h
+/usr/lib/gcc/x86_64-redhat-linux/4.4.4/include/xmmintrin.h
+/usr/lib/gcc/x86_64-redhat-linux/4.4.4/include/emmintrin.h
+/usr/lib/gcc/x86_64-redhat-linux/4.4.4/include/tmmintrin.h
+...
+<prompt>$ </prompt></screen></para>
+
+  <para>But depending on the vintage of the distro, these may not be the 
+  latest versions of the headers. Looking at the header source will tell you a 
+  few things.: The include structure (what other headers are implicitly 
+  included). The types that are used at the API. And finally, how the API is 
+  implemented.</para>
+  
+  <para><literallayout>smmintrin.h	(SSE4.1)	includes tmmintrin,h
+tmmintrin.h (SSSE3) includes pmmintrin.h
+pmmintrin.h (SSE3)  includes emmintrin,h
+emmintrin.h (SSE2)  includes xmmintrin.h
+xmmintrin.h (SSE)   includes mmintrin.h and mm_malloc.h
+mmintrin.h  (MMX)</literallayout></para>
+
+  <xi:include href="sec_intel_intrinsic_includes.xml"/>
+  <xi:include href="sec_intel_intrinsic_types.xml"/>
+  <xi:include href="sec_api_implemented.xml"/>
+
+
+</section>
+
--- a/Vector_Intrinsics/sec_simple_examples.xml
+++ b/Vector_Intrinsics/sec_simple_examples.xml
@ -0,0 +1,62 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
+  Copyright (c) 2017 OpenPOWER Foundation
+  
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+  
+-->
+<section xmlns="http://docbook.org/ns/docbook"
+  xmlns:xi="http://www.w3.org/2001/XInclude"
+  xmlns:xlink="http://www.w3.org/1999/xlink"
+  version="5.0"
+  xml:id="sec_simple_examples">
+  <title>Some simple examples</title>
+  
+  <para>For example; a vector double splat looks like this:
+  <programlisting><![CDATA[/* Create a vector with both elements equal to F.  */
+extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
+_mm_set1_pd (double __F)
+{
+  return __extension__ (__m128d){ __F, __F };
+}]]></programlisting></para>
+  
+  <para>Another example:
+  <programlisting><![CDATA[extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
+_mm_add_pd (__m128d __A, __m128d __B)
+{
+  return (__m128d) ((__v2df)__A + (__v2df)__B);
+}]]></programlisting></para>
+  
+  <para>Note in the example above the cast to __v2df for the operation. Both 
+  __m128d and __v2df are vector double, but __v2df does no have the <literal>__may_alias__</literal> 
+  attribute. And one more example:
+  <programlisting><![CDATA[extern __inline __m128i __attribute__((__gnu_inline__, __always_inline__, __artificial__))
+_mm_mullo_epi16 (__m128i __A, __m128i __B)
+{
+  return (__m128i) ((__v8hu)__A * (__v8hu)__B);
+}]]></programlisting></para>
+  
+  <para>Note this requires a cast for the compiler to generate the correct 
+  code for the intended operation. The parameters and result are the generic 
+  <literal>__m128i</literal>, which is a vector long long with the 
+  <literal>__may_alias__</literal> attribute. But 
+  operation is a vector multiply low unsigned short (<literal>__v8hu</literal>). So not only do we 
+  use the cast to drop the <literal>__may_alias__</literal> attribute but we also need to cast to 
+  the correct (vector unsigned short) type for the specified operation.</para>
+
+  <para>I have successfully copied these (and similar) source snippets over 
+  to the PPC64LE implementation unchanged. This of course assumes the associated 
+  types are defined and with compatible attributes.</para>
+
+</section>
+
--- a/Vector_Intrinsics/sec_vec_or_not.xml
+++ b/Vector_Intrinsics/sec_vec_or_not.xml
@ -0,0 +1,134 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
+  Copyright (c) 2017 OpenPOWER Foundation
+  
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+  
+-->
+<section xmlns="http://docbook.org/ns/docbook"
+  xmlns:xi="http://www.w3.org/2001/XInclude"
+  xmlns:xlink="http://www.w3.org/1999/xlink"
+  version="5.0"
+  xml:id="sec_vec_or_not">
+  <title>To vec_not or not</title>
+  
+  <para>Well not exactly. Looking at the OpenPOWER ABI document we see a 
+  reference to 
+  <literal>vec_cmpne</literal> for all numeric types. But when we look in the current 
+  GCC 6 documentation we find that 
+  <literal>vec_cmpne</literal> is not on the list. So it is planned 
+  in the ABI, but not implemented yet.</para>
+
+  <para>Looking at the PowerISA 2.07B we find a VSX Vector Compare Equal to 
+  Double-Precision but no Not Equal. In fact we see only vector double compare 
+  instructions for greater than and greater than or equal in addition to the 
+  equal compare. Not only can't we find a not equal, there is no less than or 
+  less than or equal compares either.</para>
+
+  <para>So what is going on here? Partially this is the Reduced Instruction 
+  Set Computer (RISC) design philosophy. In this case the compiler can generate 
+  all the required compares using the existing vector instructions and simple 
+  transforms based on Boolean algebra. So 
+  <literal>vec_cmpne(A,B)</literal> is simply <literal>vec_not 
+  (vec_cmpeq(A,B))</literal>. And <literal>vec_cmplt(A,B)</literal> is simply 
+  <literal>vec_cmpgt(B,A)</literal> based on the 
+  identity A &lt; B <emphasis><emphasis role="bold">iff</emphasis></emphasis> B &gt; A. 
+  Similarly <literal>vec_cmple(A,B)</literal> is implemented as 
+  <literal>vec_cmpge(B,A)</literal>.</para>
+
+  <para>What a minute, there is no <literal>vec_not()</literal> either. Can not find it in the 
+  PowerISA, the OpenPOWER ABI, or the GCC PowerPC Altivec Built-in documentation. 
+  There is no <literal>vec_move()</literal> either! How can this possibly work?</para>
+
+  <para>This is RISC philosophy again. We can always use a logical 
+  instruction (like bit wise <emphasis role="bold">and</emphasis> or 
+  <emphasis role="bold">or</emphasis>) to effect a move given that we also have 
+  nondestructive 3 register instruction forms. In the PowerISA most instruction 
+  have two input registers and a separate result register. So if the result 
+  register number is  different from either input register then the inputs are 
+  not clobbered (nondestructive). Of course nothing prevents you from specifying 
+  the same register for both inputs or even all three registers (result and both 
+  inputs).  And some times it is useful.</para>
+
+  <para>The statement <literal>B = vec_or (A,A)</literal> is is effectively a vector move/copy 
+  from <literal>A</literal> to <literal>B</literal>. And <literal>A = vec_or (A,A)</literal> is obviously a 
+  <emphasis role="bold"><literal>nop</literal></emphasis> (no operation). In the the 
+  PowerISA defines the preferred <literal>nop</literal> and register move for vector registers in 
+  this way.</para>
+
+  <para>It is also useful to have hardware implement the logical operators 
+  <emphasis role="bold">nor</emphasis> (<emphasis role="bold">not or</emphasis>) 
+  and <emphasis role="bold">nand</emphasis> (<emphasis role="bold">not and</emphasis>).  
+  The PowerISA provides these instruction for 
+  fixed point and vector logical operation. So <literal>vec_not(A)</literal> 
+  can be implemented as <literal>vec_nor(A,A)</literal>. 
+  So looking at the  implementation of _mm_cmpne we propose the 
+  following:
+  <programlisting><![CDATA[extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
+_mm_cmpneq_pd (__m128d __A, __m128d __B)
+{
+  __m128d	temp = (__m128d)vec_cmpeq (__A, __B);
+  return ((__m128d)vec_nor (temp, temp));
+}
+extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
+_mm_cmpneq_sd (__m128d __A, __m128d __B)
+{
+	__v2df a, b, c;
+	a = vec_splat(__A, 0);
+	b = vec_splat(__B, 0);
+	c = (__v2df)vec_cmpeq(a, b);
+	c = (__v2df)vec_nor(c, c);
+	return ((__m128d){c[0], __A[1]});
+}]]></programlisting></para>
+
+  <para>The Intel Intrinsics also include the not forms of the relational 
+  compares:
+  <programlisting><![CDATA[extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
+_mm_cmpnlt_sd (__m128d __A, __m128d __B)
+{
+  return (__m128d)__builtin_ia32_cmpnltsd ((__v2df)__A, (__v2df)__B);
+}
+
+extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
+_mm_cmpnle_sd (__m128d __A, __m128d __B)
+{
+  return (__m128d)__builtin_ia32_cmpnlesd ((__v2df)__A, (__v2df)__B);
+}]]></programlisting></para>
+
+  <para>The PowerISA and OpenPOWER ABI, or GCC PowerPC Altivec Built-in 
+  documentation do not provide any direct equivalents to the  not greater than 
+  class of compares. Again you don't really need them if you know Boolean 
+  algebra. We can use identities like 
+  {<emphasis role="bold">not</emphasis> (A &lt; B) iff A &gt;= B} and 
+  {<emphasis role="bold">not</emphasis> (A 
+  &lt;= B) iff A &gt; B}. So the PPC64LE implementation follows:
+  <programlisting><![CDATA[extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
+_mm_cmpnlt_pd (__m128d __A, __m128d __B)
+{
+  return ((__m128d)vec_cmpge (__A, __B));
+}
+
+extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
+_mm_cmpnle_pd (__m128d __A, __m128d __B)
+{
+  return ((__m128d)vec_cmpgt (__A, __B));
+}]]></programlisting></para>
+
+  <para>These patterns repeat for the scalar version of the 
+  <emphasis role="bold">not</emphasis> compares. And 
+  in general the larger pattern described in this chapter applies to the other 
+  float and integer types with similar interfaces.</para>
+
+
+</section>
+
--- a/intrinsic.xml
+++ b/intrinsic.xml
--- a/pom.xml
+++ b/pom.xml
@ -0,0 +1,22 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<project  xmlns="http://maven.apache.org/POM/4.0.0"
+  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
+  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
+
+  <parent>
+     <groupId>org.openpowerfoundation.docs</groupId>
+     <artifactId>master-pom</artifactId>
+     <version>1.0.0-SNAPSHOT</version>
+     <relativePath>../Docs-Master/pom.xml</relativePath>
+  </parent>
+  <modelVersion>4.0.0</modelVersion>
+
+  <artifactId>workgroup-pom</artifactId>
+  <packaging>pom</packaging>
+
+  <modules>
+    <!-- TODO: Add new documents are build in the project, add their directories to this list to
+               enable all document builds from the top level -->
+    <module>Vector_Intrinsics</module>
+  </modules>
+</project>