Why is processing an unsorted array the same speed as processing a sorted array with modern x86-64 clang?

Question 1

I discovered this popular ~9-year-old SO question and decided to double-check its outcomes.

So, I have AMD Ryzen 9 5950X, clang++ 10 and Linux, I copy-pasted code from the question and here is what I got:

Sorted - 0.549702s:

~/d/so_sorting_faster$ cat main.cpp | grep "std::sort" && clang++ -O3 main.cpp && ./a.out
 std::sort(data, data + arraySize);
0.549702
sum = 314931600000

Unsorted - 0.546554s:

~/d/so_sorting_faster $ cat main.cpp | grep "std::sort" && clang++ -O3 main.cpp && ./a.out
 // std::sort(data, data + arraySize);
0.546554
sum = 314931600000

I am pretty sure that the fact that unsorted version turned out to be faster by 3ms is just noise, but it seems it is not slower anymore.

So, what has changed in the architecture of CPU (so that it is not an order of magnitude slower anymore)?

Here are results from multiple runs:

Unsorted: 0.543557 0.551147 0.541722 0.555599
Sorted: 0.542587 0.559719 0.53938 0.557909

Just in case, here is my main.cpp:

#include <algorithm>
#include <ctime>
#include <iostream>
int main()
{
 // Generate data
 const unsigned arraySize = 32768;
 int data[arraySize];
 for (unsigned c = 0; c < arraySize; ++c)
 data[c] = std::rand() % 256;
 // !!! With this, the next loop runs faster.
 // std::sort(data, data + arraySize);
 // Test
 clock_t start = clock();
 long long sum = 0;
 for (unsigned i = 0; i < 100000; ++i)
 {
 // Primary loop
 for (unsigned c = 0; c < arraySize; ++c)
 {
 if (data[c] >= 128)
 sum += data[c];
 }
 }
 double elapsedTime = static_cast<double>(clock() - start) / CLOCKS_PER_SEC;
 std::cout << elapsedTime << std::endl;
 std::cout << "sum = " << sum << std::endl;
 return 0;
}

Update

With larger number of elements (627680):

Unsorted
cat main.cpp | grep "std::sort" && clang++ -O3 main.cpp && ./a.out
 // std::sort(data, data + arraySize);
10.3814
Sorted:
cat main.cpp | grep "std::sort" && clang++ -O3 main.cpp && ./a.out
 std::sort(data, data + arraySize);
10.6885

I think the question is still relevant - almost no difference.

Question 2

You were correct to post this as a new question. It's not a duplicate, it's a follow-up question, and should most definitely not be posted as an answer there. If you already knew why the effect was happening with modern tools, you could write it up into a form that would work as an answer to that older question. But neither of @rsjaffe's suggestions were correct for this specific case.

Question 3

Just for the record This is not a duplicate of Why is processing a sorted array faster than processing an unsorted array? , it's a followup. The compiler used in this question makes different choices from in that original question (or gcc optimization flag -O3 makes code slower than -O2), and explaining what the compiler did differently (branchless SIMD vectorization) is the answer to this question. Let me know if this gets closed; I can reopen. (But gold badges in 3 of the tags is still only one vote :P) @Mukyuu

Question 4

@jpaugh: With -O2: Sorted: 10.4747, Unsorted: 10.4589. With -O1: Sorted: 27.6086, Unsorted: 26.7066. With -O0: Sorted: 118.997, Unsorted: 316.762.

Question 5

Wow! I guess even -O1 includes the vectorization optimization. That's interesting!

Question 6

@jpaugh: clang needs at least -O2 to auto-vectorize, it seems, but even at -O1 it generates branchless scalar code: see the conditional move cmovle at line 40, where edx contains data[c] and r15d is zero.

Question 7

Several of the answers in the question you link talk about rewriting the code to be branchless and thus avoiding any branch prediction issues. That's what your updated compiler is doing.

Specifically, clang++ 10 with -O3 vectorizes the inner loop. See the code on godbolt, lines 36-67 of the assembly. The code is a little bit complicated, but one thing you definitely don't see is any conditional branch on the data[c] >= 128 test. Instead it uses vector compare instructions (pcmpgtd) whose output is a mask with 1s for matching elements and 0s for non-matching. The subsequent pand with this mask replaces the non-matching elements by 0, so that they do not contribute anything when unconditionally added to the sum.

The rough C++ equivalent would be

sum += data[c] & -(data[c] >= 128);

The code actually keeps two running 64-bit sums, for the even and odd elements of the array, so that they can be accumulated in parallel and then added together at the end of the loop.

Some of the extra complexity is to take care of sign-extending the 32-bit data elements to 64 bits; that's what sequences like pxor xmm5, xmm5 ; pcmpgtd xmm5, xmm4 ; punpckldq xmm4, xmm5 accomplish. Turn on -mavx2 and you'll see a simpler vpmovsxdq ymm5, xmm5 in its place.

The code also looks long because the loop has been unrolled, processing 8 elements of data per iteration.

Question 8

Also note that clang unrolls small loops by default (unlike GCC); if you want to see the most simple version of how it's vectorizing, use -fno-unroll-loops. godbolt.org/z/z6WYG9. (I threw in -march=nehalem to enable SSE4 including pmovsxdq sign-extension to let it make the asm simpler than with manual sign extension. Strangely, even without it, it still only does 8 bytes at a time, not using punpckldq + punpckhdq to use low and high halves of a load + compare result. To be fair, sometimes GCC shoots itself in the foot by not using narrower loads when it has to wide :/)

Question 9

Also, it would probably be better for clang's strategy (with SSE4.2 from -march=nehalem) to use pmovsxdq xmm, [mem] loads and widen the compare to 64-bit, instead of widening the compare result. GCC does 16-byte loads like I mentioned in my first comment. With SSE4 it takes 2 shuffles to sign-extend the high two masked elements (still probably worth it), and without SSE4 it's pure win vs. clang to get twice as much work done with each pcmpgtd / pand on the initial data, and even the sign-extending can share some work between halves. godbolt.org/z/nWhz3n

Question 10

Anyway, so yes the answer to this question is that it auto-vectorizes. As usual compilers don't pick perfect strategies. (Although GCC's might be optimal for SSE2 or SSE4.)

Question 11

Also related: gcc optimization flag -O3 makes code slower than -O2 for this same code where branchless (without vectorization) isn't profitable for sorted, and you need PGO (profile guided optimization) to get GCC to make the optimal choice to not do if-conversion, if you're using an old GCC, or compiling with -fno-tree-vectorize.

Question 12

So... compiler have gotten better over the years :)

Nate Eldredge Nate Eldredge 61.5k7 gold badges76 silver badges119 bronze badges · Accepted Answer · 2021-03-07 21:22:58Z

Several of the answers in the question you link talk about rewriting the code to be branchless and thus avoiding any branch prediction issues. That's what your updated compiler is doing.

Specifically, clang++ 10 with -O3 vectorizes the inner loop. See the code on godbolt, lines 36-67 of the assembly. The code is a little bit complicated, but one thing you definitely don't see is any conditional branch on the data[c] >= 128 test. Instead it uses vector compare instructions (pcmpgtd) whose output is a mask with 1s for matching elements and 0s for non-matching. The subsequent pand with this mask replaces the non-matching elements by 0, so that they do not contribute anything when unconditionally added to the sum.

The rough C++ equivalent would be

sum += data[c] & -(data[c] >= 128);

The code actually keeps two running 64-bit sums, for the even and odd elements of the array, so that they can be accumulated in parallel and then added together at the end of the loop.

Some of the extra complexity is to take care of sign-extending the 32-bit data elements to 64 bits; that's what sequences like pxor xmm5, xmm5 ; pcmpgtd xmm5, xmm4 ; punpckldq xmm4, xmm5 accomplish. Turn on -mavx2 and you'll see a simpler vpmovsxdq ymm5, xmm5 in its place.

The code also looks long because the loop has been unrolled, processing 8 elements of data per iteration.

Also note that clang unrolls small loops by default (unlike GCC); if you want to see the most simple version of how it's vectorizing, use -fno-unroll-loops. godbolt.org/z/z6WYG9. (I threw in -march=nehalem to enable SSE4 including pmovsxdq sign-extension to let it make the asm simpler than with manual sign extension. Strangely, even without it, it still only does 8 bytes at a time, not using punpckldq + punpckhdq to use low and high halves of a load + compare result. To be fair, sometimes GCC shoots itself in the foot by not using narrower loads when it has to wide :/)
Also, it would probably be better for clang's strategy (with SSE4.2 from -march=nehalem) to use pmovsxdq xmm, [mem] loads and widen the compare to 64-bit, instead of widening the compare result. GCC does 16-byte loads like I mentioned in my first comment. With SSE4 it takes 2 shuffles to sign-extend the high two masked elements (still probably worth it), and without SSE4 it's pure win vs. clang to get twice as much work done with each pcmpgtd / pand on the initial data, and even the sign-extending can share some work between halves. godbolt.org/z/nWhz3n
Anyway, so yes the answer to this question is that it auto-vectorizes. As usual compilers don't pick perfect strategies. (Although GCC's might be optimal for SSE2 or SSE4.)
Also related: gcc optimization flag -O3 makes code slower than -O2 for this same code where branchless (without vectorization) isn't profitable for sorted, and you need PGO (profile guided optimization) to get GCC to make the optimal choice to not do if-conversion, if you're using an old GCC, or compiling with -fno-tree-vectorize.

CollectivesTM on Stack Overflow

Why is processing an unsorted array the same speed as processing a sorted array with modern x86-64 clang?

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

CollectivesTM on Stack Overflow

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related