Job Title: Sarcastic Architect
Hobbies: Thinking Aloud, Arguing with Managers, Annoying HRs,
Calling a Spade a Spade, Keeping Tongue in Cheek
Ubi nihil vales, ibi nihil velis
Where you are worth nothing, there you should want nothing
— Arnold Geulincx, XVII century —
Disclaimer: if you’re doing parallel programming for living – please ignore this post (this stuff will be way way way too obvious for you). However, if you’re just about to believe claims that parallel programming can be made easy-as-pie – make sure to read it.
With C++17 supporting1 parallel versions of the std:: algorithms, there are quite a few people saying “hey, it became really simple to write parallel code!”.
Just as one example, [MSDN] wrote: “Only a few years ago, writing parallel code in C++ was a domain of the experts.” (implying that these days, to write parallel code, you don’t need to be an expert anymore).
[画像:Inquisitive hare:]“I made an experiment which demonstrates Big Fat Dangers(tm) of implying that parallelization can be made as simple as just adding a policy parameter to your std:: call.I always had my extremely strong suspicions about this position being deadly wrong, but recently I made an experiment which demonstrates Big Fat Dangers(tm) of implying that parallelization can be made as simple as just adding a policy parameter to your std:: call.
1 well, at least on paper; to the best of my knowledge, both libstd++ and libc++ do NOT support it yet (all my GCC/Clang compilers, as well as Godbolt, are choking on #include <execution>)
Task at Hand
Let’s consider the following very simple scenario: we have to calculate a sum of the elements in a million-element integer array. The code we’re starting from, is as simple as
template<class Iterator>
size_t seq_calc_sum(Iterator begin, Iterator end) {
size_t x = 0;
std::for_each(begin, end, [&](int item) {
x += item;
});
return x;
}
When running this code on my pretty old Windows box (compiled with MSVC VS2017 in Release mode2) – it takes about 450us.
2 as noted above – as of now and to the best of my knowledge, MSVC is the only compiler+lib able to handle this kind of stuff; note that even in MSVC it is still “experimental”
Adding parallelism: Take 1 (badly broken)
First, we have to note that simple addition of std::execution::par to our std::foreach() call will NOT work. If trying to write something like
//DON'T USE: WILL CAUSE WRONG RESULT
template<class Iterator>
size_t par_calc_sum_deadly_broken(Iterator begin, Iterator end) {
size_t x = 0;
std::for_each(std::execution::par,begin, end, [&](int item) {
x += item;//data race; often causes wrong calculation result(!)
});
return x;
}
– it will compile and will run, but we’ll easily get wrong result (in my experiments with a million-element array, the result was wrong each and every time, but YMMV, which only makes things worse <sad-face />).
Adding parallelism: Take 2 (90x performance hit)
IF we were observant enough to note this problem – and found a neat recommendation in [CppReference], we’ll realize that in addition to specifying std::execution::par, we also have to use std::mutex (or std::atomic) to make our program correct.
Ok, so our next (still very naive BTW) version would be along the following lines:
//DON'T USE: DEADLY SLOW
template<class Iterator>
size_t par_calc_sum_mutex(Iterator begin, Iterator end) {
size_t x = 0;
std::mutex m;
std::for_each(std::execution::par,begin, end, [&](int item) {
std::lock_guard<std::mutex> guard(m);
x += item;
});
return x;
}
This does work correctly – and if we take a look at taskman, we’ll see that it DOES use all the cores (4 physical x 2-due-to-HT = 8 logical ones in my case). And, if we wouldn’t measure the performance compared to the sequential version – we could think that everything is good here. Not so fast <sad-face />.
Measurements have shown that the function above, takes about 40 milliseconds of wall-clock time, so instead of expected speedup of about 4x-8x, it is about 90x slowdown compared to the sequential one
(BTW, if you have doubts and want to run it yourself – the code is available at
[demo]).
To make things even worse, the code above is written
strictly along the lines recommended in [CppReference] (actually, it is almost exactly the same code).
Adding parallelism: Take 3. Atomics to the rescue? (still 50x performance hit)
Ok, as a next step we could think “hey, mutexes are bad for performance, so we should use atomics instead”. So, we rewrite it as
//DON'T USE: DEADLY SLOW
template<class Iterator>
size_t par_calc_sum_atomic(Iterator begin, Iterator end) {
std::atomic<size_t> x = 0;
std::for_each(std::execution::par, begin, end, [&](int item) {
x += item;//changing it to x.fetch_add(item) doesn't change anything - neither it should
});
return x;
}
Well, it is still correct, AND (as expected) it is faster than our previous mutex-based version. The problem is that
It is still 50x slower than sequential one
Oh, and BTW – replacing std::execution::par with std::execution::par_unseq (with an assertion on x.is_lock_free() to prevent potential for deadlocks due to vectorization) didn’t make it better.3
3 it is unclear whether using of std::atomic when it is_lock_free() is safe for par_unseq; IMO it is, but there are voices out there that formally it isn’t; in any case, currently MSVC doesn’t really implement par_unseq (falling back to par), so as of now, it is a moot issue.
Results
Box
non-parallelized
std::execution::par with std::mutex
std::execution::par with std::atomic
std::execution::par_unseq with std::atomic
#1 (4 physical, 8 logical cores)
470+-4us
41200+-900us (90x slower, 600x+ less power-efficient)
23400+-140us (50x slower, 300x+ less power-efficient)
23400+-140us (50x slower, 300x+ less power-efficient)
#2 (2 physical, 4 logical cores)
900+-150us
52500+-6000us (60x slower, 200x+ less power-efficient)
25100+-4500us (30x slower, 100x+ less power-efficient)
21800+-2800us (25x slower, 100x+ less power-efficient)
As we can see, not only our naive parallel code is hopelessly inefficient and greatly increases CO2 footprint for absolutely no reason,4
it also punishes users with more powerful boxes
(more strictly, it seems that the more cores we have – the more penalty we get; this will start making sense in the second part of this mini-series).
4 desire to utilize all cores, or to get the program parallel does not qualify as a reason
Intermediate Conclusions and To Be Continued
As we have seen from the results above, naive attempts to make our code parallel (while having no clue about the way parallel code works) can easily cause HUGE problems (starting from wrong results and even crashes, and to having even correct programs slowing down by factors of 50x-90x).
In other words (arguing with the quote from [MSDN] cited in the very beginning of this post):
Writing parallel code in C++ is still a domain of the experts.
5
[画像:Surprised hare:]“it is still necessary to understand what we're doingOTOH, the point of the exercise above is NOT to say that it is not possible to write efficient code with parallel std:: functions (it is). However, to do it – it is still necessary to understand what we’re doing. An explanation of what-is-wrong-with-the-code-above is planned for the next post of mine (spoiler: it is all about overhead which happens to be waaay too high in the code above).
5 It MIGHT change with the async stuff such as
[HPX], but current support for parallel algos in std:: (except for std::reduce()), which forces us to mix calculations with thread sync, is not going to make writing of correct-and-high-performance programs any simpler <sigh />
Acknowledgement
Cartoons by Sergey GordeevIRL from Gordeev Animation Graphics, Prague.