Vectorization, 7-bit encoding

Question 1

I was curious how vectorization works and how to write suck code. To start simple I choose 7-bit encoding as my test suspect. I'm not expecting improved throughput in any way. My implementation works just fine, until we go over 2^31 due to compare not doing unsigned comparison. I would like to know how this could be tackled. Looking for some feedback, hints & improvements!

Here is my implementation

internal static class VariableInt
{
 internal static void Write7BitEncodedVector(uint value, Span<byte> output)
 {
 //Constants
 Vector128<uint> shiftBy = VariableInt.ReadVector<uint>(VariableInt.SHIFT);
 Vector128<byte> shuffle = VariableInt.ReadVector<byte>(VariableInt.SHUFFLE);
 Vector128<uint> mask = VariableInt.ReadVector<uint>(VariableInt.MASK);
 //A, B, C, D
 Vector128<uint> input = Vector128.Create(value);
 //A, B >> 7, C >> 14, D >> 21
 Vector128<uint> shift = Avx2.ShiftRightLogicalVariable(input, shiftBy);
 //If greter than or equal to 0x00000080, 0x00004000, 0x00200000, 0x10000000
 Vector128<int> compareResult = Sse2.CompareGreaterThan(shift.AsInt32(), mask.AsInt32());
 Vector128<uint> or = Sse2.And(compareResult.AsUInt32(), mask); //Get bit on 0x80
 Vector128<uint> final = Sse2.Or(shift, or); //Merge the "continue" flag, 0x80
 //Map every ints first byte to one integer
 Vector128<byte> shuffled = Ssse3.Shuffle(final.AsByte(), shuffle);
 ref byte outputRef = ref MemoryMarshal.GetReference(output);
 ref uint outputRefAsUint = ref Unsafe.As<byte, uint>(ref MemoryMarshal.GetReference(output));
 //Writes the shuffled int to the output
 outputRefAsUint = shuffled.AsUInt32().GetElement(0);
 //7-Bit encoding can have 5 bytes, write the last byte manually
 output[4] = (byte)(value >> 28);
 }
 private static Vector128<T> ReadVector<T>(ReadOnlySpan<sbyte> data) where T : struct => Unsafe.As<T, Vector128<T>>(ref Unsafe.As<sbyte, T>(ref MemoryMarshal.GetReference(data)));
 private static ReadOnlySpan<sbyte> SHIFT => new sbyte[]
 {
 0, 0, 0, 0, //1u >> 0
 7, 0, 0, 0, //1u >> 7
 14, 0, 0, 0, //1u >> 14
 21, 0, 0, 0 //1u >> 21
 };
 private static ReadOnlySpan<sbyte> MASK => new sbyte[]
 {
 -128, 0, 0, 0, //0x80
 -128, 0, 0, 0, //0x80
 -128, 0, 0, 0, //0x80
 -128, 0, 0, 0 //0x80
 };
 private static ReadOnlySpan<sbyte> SHUFFLE => new sbyte[]
 {
 0, 4, 8, 12,
 -1, -1, -1, -1,
 -1, -1, -1, -1,
 -1, -1, -1, -1
 };
}

And the point of vectorization is to improve throughput so lets throw the benchmarks too! Nothing interesting here really. Tested on i7-7700HQ.

| Method | Value | Mean | Error | StdDev |
|-------------- |---------- |---------:|----------:|----------:|
| WriteUnrolled | 0 | 6.614 ns | 0.1574 ns | 0.1395 ns |
| WriteUnrolled | 128 | 7.523 ns | 0.1718 ns | 0.1607 ns |
| WriteUnrolled | 16384 | 7.697 ns | 0.1244 ns | 0.1164 ns |
| WriteUnrolled | 2097152 | 8.259 ns | 0.1460 ns | 0.1366 ns |
| WriteUnrolled | 268435456 | 8.857 ns | 0.0888 ns | 0.0787 ns |
| Method | Mean | Error | StdDev |
|------------ |---------:|----------:|----------:|
| WriteVector | 7.441 ns | 0.1753 ns | 0.1640 ns |

For the curious, the JITted assembly is as follows:

mov r8,1A4F0843780h 
vmovupd xmm0,xmmword ptr [r8] 
mov r8,1A4F0843790h 
vmovupd xmm1,xmmword ptr [r8] 
mov r8,1A4F0843770h 
vmovupd xmm2,xmmword ptr [r8] 
vmovd xmm3,ecx 
vpbroadcastd xmm3,xmm3 
vpsrlvd xmm0,xmm3,xmm0 
vpcmpgtd xmm3,xmm0,xmm2 
vpand xmm2,xmm3,xmm2 
vpor xmm0,xmm0,xmm2 
vpshufb xmm0,xmm0,xmm1 
vmovd r8d,xmm0 
mov dword ptr [rax],r8d 
cmp edx,4
jbe 00007FFC21009B31 #Bounds Check
shr ecx,1Ch 
mov byte ptr [rax+4],cl 
add rsp,28h 
ret

Question 2

Please don't change your code once answers start coming in. That gets confusing very fast for other users passing by. If you have a new version of your code you'd like to get reviewed, please post a new question. Feel free to include a link to this question for extra context.

Question 3

My implementation works just fine, until we go over 2^31 due to compare not doing unsigned comparison.

The "incompleteness" of the set of comparisons is an old problem, and the workarounds are also old. Here are some options, with different trade-offs:

x >=u y -> max_u(x, y) == x. With SSE4.1 or later, PMAXUD exists, so this is supported. It's greater than or equal, so the constants need to be adjusted.
x >u y -> (x ^ INT_MIN) >s (y ^ INT_MIN). This even worked back in MMX. Might be considered "more strange". The XOR can be replaced by addition or subtraction, which may be useful if that results opportunities to merge those operations, but here it would not and then XOR is faster on average (as in, across different CPUs: Haswell can execute pxor on p015 but paddd only on p15, Ryzen has a similar sort of deal, for Skylake it doesn't matter).

Only in AVX512 was vpcmpud (with a comparison predicate operand) added.

vmovd xmm3,ecx 
vpbroadcastd xmm3,xmm3

This pattern comes from Vector128.Create(value), it may be fine but if value originally comes from memory then it would be better to try to get it broadcasted directly from memory: broadcast-from-memory is "free" (no cost on top of the load itself) which vmovd and vpbroadcastd on top of the load are not (obviously, they're something rather than nothing). You could pass a pointer and use Avx2.BroadcastScalarToVector128. That wouldn't be good if the value didn't come from memory though, forcing a store/reload isn't worth it.

Question 4

Thank you for your answer! By mentioning the max value I got an idea, what if I instead checked for min values and then Or them to the final value and indeed, this works for all values! By doing this I also was able to get rid of the extra vandpd instruction. Also regarding the Avx2.ShiftRightLogicalVariable, I'm shifting the values by multiplication to seven. I don't think the fixed-count shifts can do this? The constants are at bottom if u didn't notice.

Question 5

Regarding the vmovd and vpbroadcastd pattern, I did recognize it but had no idea what was the cause. Changed it to Avx2.BroadcastScalarToVector128 but I don't think that has any difference. Overall after all the changes benchmarks only show noise, nothing drastic.

Question 6

@Joni taking the address of an operand and then loading from it is almost always bad, it will tend to get passed in a register so that forces it to be stored just to reload it, possibly resulting in a load/store/reload sequence. Passing in the value by pointer may help, depending on where it originally comes from.

Question 7

Passing the operand produces lea rcx,[rsp+30h] followed by vpbroadcastd xmm3,dword ptr [rsp+30h] while passing by reference (ref uint) gives mov r8,1BFA5C837E0h followed by vmovupd xmm2,xmmword ptr [r8] and mov r8,rcx while passing by pointer (uint*) gives vpbroadcastd xmm3,dword ptr [rcx] but that requires pinning the memory, when it comes from the heap

Question 8

Actually, I made few overloads for passing by value and passing by ref which points to method that gets a uint* pointer. Looks like the JIT can better optimize out the overloads and inline them accordingly without extra instructions in between. Nothing catches my eye anymore at least and I think all the optimizations that can be done are there.

user555045 user555045 12.2k1 gold badge19 silver badges37 bronze badges · Accepted Answer · 2020-01-21 02:42:00Z

My implementation works just fine, until we go over 2^31 due to compare not doing unsigned comparison.

The "incompleteness" of the set of comparisons is an old problem, and the workarounds are also old. Here are some options, with different trade-offs:

x >=u y -> max_u(x, y) == x. With SSE4.1 or later, PMAXUD exists, so this is supported. It's greater than or equal, so the constants need to be adjusted.
x >u y -> (x ^ INT_MIN) >s (y ^ INT_MIN). This even worked back in MMX. Might be considered "more strange". The XOR can be replaced by addition or subtraction, which may be useful if that results opportunities to merge those operations, but here it would not and then XOR is faster on average (as in, across different CPUs: Haswell can execute pxor on p015 but paddd only on p15, Ryzen has a similar sort of deal, for Skylake it doesn't matter).

Only in AVX512 was vpcmpud (with a comparison predicate operand) added.

vmovd xmm3,ecx 
vpbroadcastd xmm3,xmm3

This pattern comes from Vector128.Create(value), it may be fine but if value originally comes from memory then it would be better to try to get it broadcasted directly from memory: broadcast-from-memory is "free" (no cost on top of the load itself) which vmovd and vpbroadcastd on top of the load are not (obviously, they're something rather than nothing). You could pass a pointer and use Avx2.BroadcastScalarToVector128. That wouldn't be good if the value didn't come from memory though, forcing a store/reload isn't worth it.

Thank you for your answer! By mentioning the max value I got an idea, what if I instead checked for min values and then Or them to the final value and indeed, this works for all values! By doing this I also was able to get rid of the extra vandpd instruction. Also regarding the Avx2.ShiftRightLogicalVariable, I'm shifting the values by multiplication to seven. I don't think the fixed-count shifts can do this? The constants are at bottom if u didn't notice.
Regarding the vmovd and vpbroadcastd pattern, I did recognize it but had no idea what was the cause. Changed it to Avx2.BroadcastScalarToVector128 but I don't think that has any difference. Overall after all the changes benchmarks only show noise, nothing drastic.
@Joni taking the address of an operand and then loading from it is almost always bad, it will tend to get passed in a register so that forces it to be stored just to reload it, possibly resulting in a load/store/reload sequence. Passing in the value by pointer may help, depending on where it originally comes from.
Passing the operand produces lea rcx,[rsp+30h] followed by vpbroadcastd xmm3,dword ptr [rsp+30h] while passing by reference (ref uint) gives mov r8,1BFA5C837E0h followed by vmovupd xmm2,xmmword ptr [r8] and mov r8,rcx while passing by pointer (uint*) gives vpbroadcastd xmm3,dword ptr [rcx] but that requires pinning the memory, when it comes from the heap
Actually, I made few overloads for passing by value and passing by ref which points to method that gets a uint* pointer. Looks like the JIT can better optimize out the overloads and inline them accordingly without extra instructions in between. Nothing catches my eye anymore at least and I think all the optimizations that can be done are there.

Stack Exchange Network

Vectorization, 7-bit encoding

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Vectorization, 7-bit encoding

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions