I was curious how vectorization works and how to write suck code. To start simple I choose 7-bit encoding as my test suspect. I'm not expecting improved throughput in any way. My implementation works just fine, until we go over 2^31 due to compare not doing unsigned comparison. I would like to know how this could be tackled. Looking for some feedback, hints & improvements!
Here is my implementation
internal static class VariableInt
{
internal static void Write7BitEncodedVector(uint value, Span<byte> output)
{
//Constants
Vector128<uint> shiftBy = VariableInt.ReadVector<uint>(VariableInt.SHIFT);
Vector128<byte> shuffle = VariableInt.ReadVector<byte>(VariableInt.SHUFFLE);
Vector128<uint> mask = VariableInt.ReadVector<uint>(VariableInt.MASK);
//A, B, C, D
Vector128<uint> input = Vector128.Create(value);
//A, B >> 7, C >> 14, D >> 21
Vector128<uint> shift = Avx2.ShiftRightLogicalVariable(input, shiftBy);
//If greter than or equal to 0x00000080, 0x00004000, 0x00200000, 0x10000000
Vector128<int> compareResult = Sse2.CompareGreaterThan(shift.AsInt32(), mask.AsInt32());
Vector128<uint> or = Sse2.And(compareResult.AsUInt32(), mask); //Get bit on 0x80
Vector128<uint> final = Sse2.Or(shift, or); //Merge the "continue" flag, 0x80
//Map every ints first byte to one integer
Vector128<byte> shuffled = Ssse3.Shuffle(final.AsByte(), shuffle);
ref byte outputRef = ref MemoryMarshal.GetReference(output);
ref uint outputRefAsUint = ref Unsafe.As<byte, uint>(ref MemoryMarshal.GetReference(output));
//Writes the shuffled int to the output
outputRefAsUint = shuffled.AsUInt32().GetElement(0);
//7-Bit encoding can have 5 bytes, write the last byte manually
output[4] = (byte)(value >> 28);
}
private static Vector128<T> ReadVector<T>(ReadOnlySpan<sbyte> data) where T : struct => Unsafe.As<T, Vector128<T>>(ref Unsafe.As<sbyte, T>(ref MemoryMarshal.GetReference(data)));
private static ReadOnlySpan<sbyte> SHIFT => new sbyte[]
{
0, 0, 0, 0, //1u >> 0
7, 0, 0, 0, //1u >> 7
14, 0, 0, 0, //1u >> 14
21, 0, 0, 0 //1u >> 21
};
private static ReadOnlySpan<sbyte> MASK => new sbyte[]
{
-128, 0, 0, 0, //0x80
-128, 0, 0, 0, //0x80
-128, 0, 0, 0, //0x80
-128, 0, 0, 0 //0x80
};
private static ReadOnlySpan<sbyte> SHUFFLE => new sbyte[]
{
0, 4, 8, 12,
-1, -1, -1, -1,
-1, -1, -1, -1,
-1, -1, -1, -1
};
}
And the point of vectorization is to improve throughput so lets throw the benchmarks too! Nothing interesting here really. Tested on i7-7700HQ.
| Method | Value | Mean | Error | StdDev |
|-------------- |---------- |---------:|----------:|----------:|
| WriteUnrolled | 0 | 6.614 ns | 0.1574 ns | 0.1395 ns |
| WriteUnrolled | 128 | 7.523 ns | 0.1718 ns | 0.1607 ns |
| WriteUnrolled | 16384 | 7.697 ns | 0.1244 ns | 0.1164 ns |
| WriteUnrolled | 2097152 | 8.259 ns | 0.1460 ns | 0.1366 ns |
| WriteUnrolled | 268435456 | 8.857 ns | 0.0888 ns | 0.0787 ns |
| Method | Mean | Error | StdDev |
|------------ |---------:|----------:|----------:|
| WriteVector | 7.441 ns | 0.1753 ns | 0.1640 ns |
For the curious, the JITted assembly is as follows:
mov r8,1A4F0843780h
vmovupd xmm0,xmmword ptr [r8]
mov r8,1A4F0843790h
vmovupd xmm1,xmmword ptr [r8]
mov r8,1A4F0843770h
vmovupd xmm2,xmmword ptr [r8]
vmovd xmm3,ecx
vpbroadcastd xmm3,xmm3
vpsrlvd xmm0,xmm3,xmm0
vpcmpgtd xmm3,xmm0,xmm2
vpand xmm2,xmm3,xmm2
vpor xmm0,xmm0,xmm2
vpshufb xmm0,xmm0,xmm1
vmovd r8d,xmm0
mov dword ptr [rax],r8d
cmp edx,4
jbe 00007FFC21009B31 #Bounds Check
shr ecx,1Ch
mov byte ptr [rax+4],cl
add rsp,28h
ret
-
\$\begingroup\$ Please don't change your code once answers start coming in. That gets confusing very fast for other users passing by. If you have a new version of your code you'd like to get reviewed, please post a new question. Feel free to include a link to this question for extra context. \$\endgroup\$Mast– Mast ♦2020年01月21日 12:31:56 +00:00Commented Jan 21, 2020 at 12:31
1 Answer 1
My implementation works just fine, until we go over 2^31 due to compare not doing unsigned comparison.
The "incompleteness" of the set of comparisons is an old problem, and the workarounds are also old. Here are some options, with different trade-offs:
x >=u y
->max_u(x, y) == x
. With SSE4.1 or later, PMAXUD exists, so this is supported. It's greater than or equal, so the constants need to be adjusted.x >u y
->(x ^ INT_MIN) >s (y ^ INT_MIN)
. This even worked back in MMX. Might be considered "more strange". The XOR can be replaced by addition or subtraction, which may be useful if that results opportunities to merge those operations, but here it would not and then XOR is faster on average (as in, across different CPUs: Haswell can executepxor
on p015 butpaddd
only on p15, Ryzen has a similar sort of deal, for Skylake it doesn't matter).
Only in AVX512 was vpcmpud
(with a comparison predicate operand) added.
vmovd xmm3,ecx vpbroadcastd xmm3,xmm3
This pattern comes from Vector128.Create(value)
, it may be fine but if value
originally comes from memory then it would be better to try to get it broadcasted directly from memory: broadcast-from-memory is "free" (no cost on top of the load itself) which vmovd
and vpbroadcastd
on top of the load are not (obviously, they're something rather than nothing). You could pass a pointer and use Avx2.BroadcastScalarToVector128
. That wouldn't be good if the value didn't come from memory though, forcing a store/reload isn't worth it.
-
\$\begingroup\$ Thank you for your answer! By mentioning the max value I got an idea, what if I instead checked for min values and then Or them to the final value and indeed, this works for all values! By doing this I also was able to get rid of the extra
vandpd
instruction. Also regarding the Avx2.ShiftRightLogicalVariable, I'm shifting the values by multiplication to seven. I don't think the fixed-count shifts can do this? The constants are at bottom if u didn't notice. \$\endgroup\$Joni– Joni2020年01月21日 12:33:20 +00:00Commented Jan 21, 2020 at 12:33 -
\$\begingroup\$ Regarding the
vmovd
andvpbroadcastd
pattern, I did recognize it but had no idea what was the cause. Changed it toAvx2.BroadcastScalarToVector128
but I don't think that has any difference. Overall after all the changes benchmarks only show noise, nothing drastic. \$\endgroup\$Joni– Joni2020年01月21日 12:38:25 +00:00Commented Jan 21, 2020 at 12:38 -
\$\begingroup\$ @Joni taking the address of an operand and then loading from it is almost always bad, it will tend to get passed in a register so that forces it to be stored just to reload it, possibly resulting in a load/store/reload sequence. Passing in the value by pointer may help, depending on where it originally comes from. \$\endgroup\$user555045– user5550452020年01月22日 02:41:36 +00:00Commented Jan 22, 2020 at 2:41
-
\$\begingroup\$ Passing the operand produces
lea rcx,[rsp+30h]
followed byvpbroadcastd xmm3,dword ptr [rsp+30h]
while passing by reference (ref uint) givesmov r8,1BFA5C837E0h
followed byvmovupd xmm2,xmmword ptr [r8]
andmov r8,rcx
while passing by pointer (uint*) givesvpbroadcastd xmm3,dword ptr [rcx]
but that requires pinning the memory, when it comes from the heap \$\endgroup\$Joni– Joni2020年01月22日 16:08:14 +00:00Commented Jan 22, 2020 at 16:08 -
\$\begingroup\$ Actually, I made few overloads for passing by value and passing by ref which points to method that gets a uint* pointer. Looks like the JIT can better optimize out the overloads and inline them accordingly without extra instructions in between. Nothing catches my eye anymore at least and I think all the optimizations that can be done are there. \$\endgroup\$Joni– Joni2020年01月22日 16:30:27 +00:00Commented Jan 22, 2020 at 16:30