vec_bperm picture #14

PC has requested a picture.

vec_bperm presents some unique issues when running in little-endian mode.
The function signatures of the builtin are:

vector unsigned char vec_bperm (vector unsigned char, vector unsigned char);
vector unsigned long long vec_bperm (vector unsigned __int128, vector unsigned char);
vector unsigned long long vec_bperm (vector unsigned long long, vector unsigned char);

The underlying instructions used for the implementation are, respectively:

vbpermq
vbpermq
vbpermd

Comparing the first two instances which use the same underlying implementation...

Define a constant permute vector:

vector unsigned char b = { 0x7F, 0x7E, 0x7D, 0x7C, 0x7B, 0x7A, 0x79, 0x78, 0x77, 0x76, 0x75, 0x74, 0x73, 0x72, 0x71, 0x70 };

...which seeks to collect the low-order 16 bits of the source vector.

Define vectors which have the same in-register representation:

        vector unsigned long long a64 = { 0xFFFFFFFFFFFFFFAA, 0xFFFFFFFFFFFFFFFF };
        vector unsigned char a8 = { 0xAA, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF };

The in-register (big-endian) representation for both is:

FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFAA

Since the same instruction is used identically in the implementation, the result is the same:

FFFFFFFFFFFFFFAAFFFFFFFFFFFFFFFF

This doesn't match the current description in the document, which says:

If bit index j is smaller than 128, bit i of r is set to the value of the j th bit of a.

I think the above statement should instead say:

If bit index j is smaller than 128, bit i of r[0] is set to the value of the j th bit of a.

...at least for a return type of vector unsigned long long.

The above statement would have to be changed as follows for a return type of vector unsigned char:

If bit index j is smaller than 128, bit i of r[8/9] is set to the value of the j th bit of a.

...or something like that. A simpler solution would be to define the return value in this case to be unsigned long long.

Questions:

Is the document in error with respect to the "r[0]" issue noted above?
Can the API be changed to reflect the correct return value of unsigned long long in all cases?

`vec_bperm` presents some unique issues when running in little-endian mode. The function signatures of the builtin are: ``` vector unsigned char vec_bperm (vector unsigned char, vector unsigned char); vector unsigned long long vec_bperm (vector unsigned __int128, vector unsigned char); vector unsigned long long vec_bperm (vector unsigned long long, vector unsigned char); ``` The underlying instructions used for the implementation are, respectively: - `vbpermq` - `vbpermq` - `vbpermd` Comparing the first two instances which use the same underlying implementation... Define a constant permute vector: ``` vector unsigned char b = { 0x7F, 0x7E, 0x7D, 0x7C, 0x7B, 0x7A, 0x79, 0x78, 0x77, 0x76, 0x75, 0x74, 0x73, 0x72, 0x71, 0x70 }; ``` ...which seeks to collect the low-order 16 bits of the source vector. Define vectors which have the same in-register representation: ``` vector unsigned long long a64 = { 0xFFFFFFFFFFFFFFAA, 0xFFFFFFFFFFFFFFFF }; vector unsigned char a8 = { 0xAA, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF }; ``` The in-register (big-endian) representation for both is: ``` FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFAA ``` Since the same instruction is used identically in the implementation, the result is the same: ``` FFFFFFFFFFFFFFAAFFFFFFFFFFFFFFFF ``` This doesn't match the current description in the document, which says: > If bit index _j_ is smaller than 128, bit _i_ of **r** is set to the value of the _j_ th bit of **a**. I think the above statement should instead say: > If bit index _j_ is smaller than 128, bit _i_ of **r**[0] is set to the value of the _j_ th bit of **a**. ...at least for a return type of `vector unsigned long long`. The above statement would have to be changed as follows for a return type of `vector unsigned char`: > If bit index _j_ is smaller than 128, bit _i_ of **r**[8/9] is set to the value of the _j_ th bit of **a**. ...or something like that. A simpler solution would be to define the return value in this case to be `unsigned long long`. Questions: - Is the document in error with respect to the "**r**[0]" issue noted above? - Can the API be changed to reflect the correct return value of `unsigned long long` in all cases?

You're misunderstanding the endian restriction, I think. You have to provide data in big-endian order, and expect results in big-endian order, with all bit and byte designations defined in big-endian order. There is no expectation that providing little-endian data order will make any sense. The probable best to-do for this is to clarify this in the description.

The same is true for things like vec_pmsum_be. The data model for using such things is purely big-endian.

What we need to do is review all of these big-endian-only instructions and review the need for little-endian versions of them supported in a future version of the ISA. This may lead to creating new builtins that are guaranteed to handle both BE and LE usages.

You're misunderstanding the endian restriction, I think. You have to provide data in big-endian order, and expect results in big-endian order, with all bit and byte designations defined in big-endian order. There is no expectation that providing little-endian data order will make any sense. The probable best to-do for this is to clarify this in the description. The same is true for things like vec_pmsum_be. The data model for using such things is purely big-endian. What we need to do is review all of these big-endian-only instructions and review the need for little-endian versions of them supported in a future version of the ISA. This may lead to creating new builtins that are guaranteed to handle both BE and LE usages.

Actually, the "Endian considerations" already seem to describe this pretty adequately, but if you want to tickle that language, please suggest something.

This also applies to vec_cipher_be, vec_cipherlast_be, vec_gb, vec_ncipher_be, vec_ncipherlast_be, vec_pmsum_be, vec_sbox_be, and vec_shasigma_be. It's unfortunate that vec_bperm wasn't originally named vec_bperm_be, but it's too late to correct this oversight except by allowing an alias.

Actually, the "Endian considerations" already seem to describe this pretty adequately, but if you want to tickle that language, please suggest something. This also applies to vec_cipher_be, vec_cipherlast_be, vec_gb, vec_ncipher_be, vec_ncipherlast_be, vec_pmsum_be, vec_sbox_be, and vec_shasigma_be. It's unfortunate that vec_bperm wasn't originally named vec_bperm_be, but it's too late to correct this oversight except by allowing an alias.

How does this:

The in-register (big-endian) representation for both is:
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFAA
Since the same instruction is used identically in the implementation, the result is the same:
FFFFFFFFFFFFFFAAFFFFFFFFFFFFFFFF

ever match this:

If bit index j is smaller than 128, bit i of r is set to the value of the j th bit of a.

Even the ISA doesn't match the language immediately above:

For each integer value i from 0 to 15, do the following.

Let index be the contents of byte element i of VR[VRB].
If index is less than 128, then the contents of bit index of VR[VRA] are placed into bit 48+i of doubleword element i of VR[VRT]. Otherwise, bit 48+i of doubleword element i of VR[VRT] is set to 0.

The contents of bits 0:47 of VR[VRT] are set to 0.
The contents of bits 64:127 of VR[VRT] are set to 0.

The instruction stores the results in the low-order bits of the first (high-order) doubleword element of the vector. So, the description in this document is not correct, even for BE. And, it is non-sensical when treated as vector unsigned char, because the results show up in elements [6..7] in BE.

As an example, this series of code attempts to BE-ize and un-BE-ize the use of the instruction on an LE system, here the goal is simply to produce the low-order (BE) 16 bits, in reverse:

vector unsigned char a8_le = { 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xA9 };
vector unsigned char pcv_uc = { 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1 };
vector unsigned char a8_be = vec_perm (a8_le, a8_le, pcv_uc); out_uc_be (a8_be);
vector unsigned char b_le = { 0x7F, 0x7E, 0x7D, 0x7C, 0x7B, 0x7A, 0x79, 0x78, 0x77, 0x76, 0x75, 0x74, 0x73, 0x72, 0x71, 0x70 };
vector unsigned char b_be = vec_perm (b_le, b_le, pcv_uc); out_uc_be (b_be);
ruc = vec_bperm (a8_be, b_be); out_uc_be (ruc);
ruc = vec_perm (ruc, ruc, pcv_uc); out_uc (ruc);

(Note that I display the "BE" vectors in BE-order.) The output is:

ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff a9 
7f 7e 7d 7c 7b 7a 79 78 77 76 75 74 73 72 71 70 
00 00 00 00 00 00 95 ff 00 00 00 00 00 00 00 00 
00 00 00 00 00 00 95 ff 00 00 00 00 00 00 00 00

The low order 16 bits are FF A9, which becomes 95 FF when reversed.
Note that this result shows up in elements 6 and 7 of the vector unsigned char in both BE and LE (after swapping).

This simply does not match the current description in this document.

How does this: > The in-register (big-endian) representation for both is: > > ``` > FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFAA > ``` > > Since the same instruction is used identically in the implementation, the result is the same: > > ``` > FFFFFFFFFFFFFFAAFFFFFFFFFFFFFFFF > ``` ever match this: > If bit index _j_ is smaller than 128, bit _i_ of **r** is set to the value of the _j_ th bit of **a**. Even the ISA doesn't match the language immediately above: > For each integer value _i_ from 0 to 15, do the following. > > Let _index_ be the contents of byte element _i_ of _VR[VRB]_. > > If _index_ is less than _128_, then the contents of bit _index_ of _VR[VRA]_ are placed into bit _48+i_ of doubleword element _i_ of _VR[VRT]_. Otherwise, bit _48+i_ of doubleword element _i_ of _VR[VRT]_ is set to _0_. > The contents of bits _0:47_ of _VR[VRT]_ are set to _0_. > The contents of bits _64:127_ of VR[VRT] are set to _0_. The instruction stores the results in the low-order bits of the **first** (high-order) **doubleword** element of the vector. So, the description in this document is not correct, even for BE. And, it is non-sensical when treated as `vector unsigned char`, because the results show up in elements [6..7] in BE. As an example, this series of code attempts to BE-ize and un-BE-ize the use of the instruction on an LE system, here the goal is simply to produce the low-order (BE) 16 bits, in reverse: ``` vector unsigned char a8_le = { 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xA9 }; vector unsigned char pcv_uc = { 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1 }; vector unsigned char a8_be = vec_perm (a8_le, a8_le, pcv_uc); out_uc_be (a8_be); vector unsigned char b_le = { 0x7F, 0x7E, 0x7D, 0x7C, 0x7B, 0x7A, 0x79, 0x78, 0x77, 0x76, 0x75, 0x74, 0x73, 0x72, 0x71, 0x70 }; vector unsigned char b_be = vec_perm (b_le, b_le, pcv_uc); out_uc_be (b_be); ruc = vec_bperm (a8_be, b_be); out_uc_be (ruc); ruc = vec_perm (ruc, ruc, pcv_uc); out_uc (ruc); ``` (Note that I display the "BE" vectors in BE-order.) The output is: ``` ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff a9 7f 7e 7d 7c 7b 7a 79 78 77 76 75 74 73 72 71 70 00 00 00 00 00 00 95 ff 00 00 00 00 00 00 00 00 00 00 00 00 00 00 95 ff 00 00 00 00 00 00 00 00 ``` The low order 16 bits are `FF A9`, which becomes `95 FF` when reversed. Note that this result shows up in elements 6 and 7 of the `vector unsigned char` in both BE and LE (after swapping). This simply does not match the current description in this document.

The description is definitely wrong, especially for vbpermq. This has been misdescribed since the original document was written. I will work on a better description for it that matches the ISA.

But please don't use LE examples, because they simply do not apply meaningfully to these instructions and it complicates everything. vec_bperm does not follow the bi-endian programming model.

The description is definitely wrong, especially for vbpermq. This has been misdescribed since the original document was written. I will work on a better description for it that matches the ISA. But please don't use LE examples, because they simply do not apply meaningfully to these instructions and it complicates everything. vec_bperm does not follow the bi-endian programming model.

For vbpermq, the only error is that instead of "bit i of r", the third bullet should say "bit i+48 of r". That leaves bits 0:47 and 64:127 as 0, as they should be.

For vbpermd, the first subbullet should replace "jth element of b" with "(i*8+j)th element of b." The second and third subbullets should replace "bit j" with "bit 56+j".

That should follow the ISA correctly. Thanks for pointing this out! Please let me know if you disagree with those changes.

For vbpermq, the only error is that instead of "bit i of r", the third bullet should say "bit i+48 of r". That leaves bits 0:47 and 64:127 as 0, as they should be. For vbpermd, the first subbullet should replace "jth element of b" with "(i*8+j)th element of b." The second and third subbullets should replace "bit j" with "bit 56+j". That should follow the ISA correctly. Thanks for pointing this out! Please let me know if you disagree with those changes.

Opened #47 to track this; I'll fix.

Description fixed in #47. Please let me know if a BE example now makes better sense.

Labels Milestones

vec_bperm picture #14