Finish chapter 2.

7 years ago · cd7e3f4a4b
parent 43c3a2851a
commit cd7e3f4a4b
1 changed files with 143 additions and 12 deletions
--- a/Intrinsics_Reference/ch_biendian.xml
+++ b/Intrinsics_Reference/ch_biendian.xml
@ -1034,18 +1034,149 @@ register vector double vd = vec_splats(*double_ptr);</programlisting>
  </section>

  <section>
-    <title>Examples</title>
-    <para>filler</para>
-  </section>
-
-  <section>
-    <title>Limitations</title>
-    <para>
-      <code>vec_sld</code>
-    </para>
-    <para>
-      <code>vec_perm</code>
-    </para>
+    <title>Examples and Limitations</title>
+    <section>
+      <title>Unaligned vector access</title>
+      <para>
+	A common programming error is to cast a pointer to a base type
+	(such as <code>int</code>) to a pointer of the corresponding
+	vector type (such as <code>vector int</code>), and then
+	dereference the pointer.  This constitutes undefined behavior,
+	because it casts a pointer with a smaller alignment
+	requirement to a pointer with a larger alignment requirement.
+	Compilers may not produce code that you expect in the presence
+	of undefined behavior.
+      </para>
+      <para>
+	Thus, do not write the following:
+      </para>
+      <programlisting>  int a[4096];
+  vector int x = *((vector int *) a);</programlisting>
+      <para>
+	Instead, write this:
+      </para>
+      <programlisting>  int a[4096];
+  vector int x = vec_xl (0, a);</programlisting>
+    </section>
+    <section>
+      <title>vec_sld is not bi-endian</title>
+      <para>
+	One oddity in the bi-endian vector programming model is that
+	<code>vec_sld</code> has big-endian semantics for code
+	compiled for both big-endian and little-endian targets.  That
+	is, any code that uses <code>vec_sld</code> without guarding
+	it with a test on endianness is likely to be incorrect.
+      </para>
+      <para>
+	At the time that the bi-endian model was being developed, it
+	was discovered that existing code in several Linux packages
+	was using <code>vec_sld</code> in order to perform multiplies,
+	or to otherwise shift portions of base elements left.  A
+	straightforward little-endian implementation of
+	<code>vec_sld</code> would concatenate the two input vectors
+	in reverse order and shift bytes to the right.  This would
+	only give compatible results for <code>vector char</code>
+	types.  Those using this intrinsic as a cheap multiply, or to
+	shift bytes within larger elements, would see different
+	results on little-endian versus big-endian with such an
+	implementation.  Therefore it was decided that
+	<code>vec_sld</code> would not have a bi-endian
+	implementation.
+      </para>
+      <para>
+	<code>vec_sro</code> is not bi-endian for similar reasons.
+      </para>
+    </section>
+    <section>
+      <title>Limitations on bi-endianness of vec_perm</title>
+      <para>
+	The <code>vec_perm</code> intrinsic is bi-endian, provided
+	that it is used to reorder entire elements of the input
+	vectors.
+      </para>
+      <para>
+	To see why this is, let's examine the code generation for
+      </para>
+      <programlisting>  vector int t;
+  vector int a = (vector int){0x00010203, 0x04050607, 0x08090a0b, 0x0c0d0e0f};
+  vector int b = (vector int){0x10111213, 0x14151617, 0x18191a1b, 0x1c1d1e1f};
+  vector char c = (vector char){0,1,2,3,28,29,30,31,12,13,14,15,20,21,22,23};
+  t = vec_perm (a, b, c);</programlisting>
+      <para>
+	For big endian, a compiler should generate:
+      </para>
+      <programlisting>  vperm  t,a,b,c</programlisting>
+      <para>
+	For little endian targeting a POWER8 system, a compiler should
+	generate:
+      </para>
+      <programlisting>  vnand  d,c,c
+  vperm  t,b,a,d</programlisting>
+      <para>
+	For little endian targeting a POWER9 system, a compiler should
+	generate:
+      </para>
+      <programlisting>  vpermr  t,b,a,c</programlisting>
+      <para>
+	Note that the <code>vpermr</code> instruction takes care of
+	modifying the permute control vector (PCV) <code>c</code> that
+	was done using the <code>vnand</code> instruction for POWER8.
+	Because only the bottom 5 bits of each element of the PCV are
+	read by the hardware, this has the effect of subtracting the
+	original elements of the PCV from 31.
+      </para>
+      <para>
+	Note also that the PCV <code>c</code> has element values that
+	are contiguous in groups of 4.  This selects entire elements
+	from the input vectors <code>a</code> and <code>b</code> to
+	reorder.  Thus the intent of the code is to select the first
+	integer element of <code>a</code>, the last integer element of
+	<code>b</code>, the last integer element of <code>a</code>,
+	and the second integer element of <code>b</code>, in that
+	order.
+      </para>
+      <para>
+	For little endian, the modified PCV is elementwise subtracted
+	from 31, giving {31,30,29,28,3,2,1,0,19,18,17,16,11,10,9,8}.
+	Since the elements appear in reverse order in a register when
+	loaded from little-endian memory, the elements appear in the
+	register from left to right as
+	{8,9,10,11,16,17,18,19,0,1,2,3,28,29,30,31}.  So the following
+	<code>vperm</code> instruction will again select entire
+	elements using the groups of 4 contiguous bytes, and the
+	values of the integers will be reordered without compromising
+	each integer's contents.  The fact that the little-endian
+	result matches the big-endian result is left as an exercise to
+	the reader.
+      </para>
+      <para>
+	Now, suppose instead that the original PCV does not reorder
+	entire integers at once:
+      </para>
+      <programlisting>  vector char c = (vector char){0,20,31,4,7,17,6,19,30,3,2,8,9,13,5,22};</programlisting>
+      <para>
+	The result of the big-endian implementation would be:
+      </para>
+      <programlisting>  t = {0x00141f04, 0x07110613, 0x1e030208, 0x090d0516};</programlisting>
+      <para>
+	For little-endian, the modified PCV would be
+	{31,11,0,27,24,14,25,12,1,28,29,23,22,18,26,9}, appearing in
+	the register as
+	{9,26,18,22,23,29,28,1,12,25,14,24,27,0,11,31}.  The final
+	little-endian result would be
+      </para>
+      <programlisting>  t = {0x071c1703, 0x10051204, 0x0b01001d, 0x15060e0a};</programlisting>
+      <para>
+	which bears no resemblance to the big-endian result.
+      </para>
+      <para>
+	The lesson here is to only use <code>vec_perm</code> to
+	reorder entire elements of a vector.  If you must use vec_perm
+	for another purpose, your code must include a test for
+	endianness and separate algorithms for big- and
+	little-endian.
+      </para>
+    </section>
  </section>

 </chapter>