Job Title: Sarcastic Architect
Hobbies: Thinking Aloud, Arguing with Managers, Annoying HRs,
Calling a Spade a Spade, Keeping Tongue in Cheek
Disclaimer: if you’re doing parallel programming for living – please ignore this post (this stuff will be way way way too obvious for you). However, if you’re just about to believe claims that parallel programs can be achieved just by adding a few things here and there to your usual single-threaded code – make sure to read it.
In my previous post, we have observed pretty bad results for calculations as we tried to use mutexes and even atomics to do things parallel. OTOH, it was promised to show how parallel <algorithm> CAN be used both correctly and efficiently (that is, IF you need it, which is a separate question); this is what we’ll start discussing within this post.
Parallel Code Which Made It 90x Slower on 8-Core Box
As it was mentioned in the previous post, the code below – while being correct – was observed to be REALLY REALLY slow (90x slower than original, that’s while using all 8 cores instead of one):
//DON'T USE: DEADLY SLOW
template<class Iterator>
size_t par_calc_sum_mutex(Iterator begin, Iterator end) {
size_t x = 0;
std::mutex m;
std::for_each(std::execution::par, begin, end, [&](int item) {
std::lock_guard<std::mutex> guard(m);
x += item;
});
return x;
}
And BTW, while replacing mutex with atomic did improve things, it still was 50x slower than original serialized code.
The Big Fat Overhead Problem
The problem we’re facing here, is actually a problem of the overhead. It is just that the cost of thread context switch (which is likely to happen pretty often as we’re trying to obtain lock_guard) is sooo huge,1 that addition (which costs about 1 CPU cycle2) happens to be about zero-cost compared to the costs of synchronization. That’s exactly what happens here – and in this extreme case, we have this generic problem amplified sooooo badly that it happens that ALL the calculations are actually serialized by this mutex (so effectively, there is no parallelism at all3).
1 in general, it can be within 2000-1’000’000 CPU cycles
[NoBugs2016], though here it is on the lower side of things, as cache invalidation doesn’t really happen
2 NB: this is a wrong place to go into a long discussion whether it is 1 CPU cycle or we should speak about it statistically being 3/4th of CPU cycle
3 however, if not for the overhead, 100% serialization would mean only that we’re not gaining anything from going parallel; it is the overhead which is responsible for 90x slowdown
Tackling The Overhead Manually
Let’s see what we could do to speed things up; note that it is NOT intended to be a code-for-real-world-use – but rather a code-to-provide-some-“feeling”-of-what-is-going-on. While performance-wise it is a huge improvement over previous code, but compared to alternatives-which-we’ll-see-later – it is certainly NOT the best option.
//DON'T USE: UGLY, ERROR-PRONE, AND A-BIT-SLOWER-THAN-OPTIMAL
size_t TestFuncManualParSum::run(RandomAccessIterator begin, RandomAccessIterator end) {
std::atomic<size_t> x = 0;
constexpr int NCHUNKS = 128;
assert( (end-begin) % NCHUNKS ==0 );//too lazy to handle it for sample code
RandomAccessIterator starts[NCHUNKS];
size_t sz = (end - begin) / NCHUNKS;
for (int i = 0; i < NCHUNKS; ++i) {
starts[i] = begin + sz * i;
assert(starts[i]<end);
}
std::for_each(std::execution::par, starts, starts + NCHUNKS, [&](RandomAccessIterator start) {
size_t y = 0;
for (auto it = start; it < start + sz; ++it) {
y += *it;//NO synchronization here
}
x += y;//synchronization
});
return x;
}
Associative Property Within an expression containing two or more occurrences in a row of the same associative operator, the order in which the operations are performed does not matter as long as the sequence of the operands is not changed.— Wikipedia —The basic idea here is to avoid synchronization on each and every element; instead – we’re splitting our array into 128 chunks – and running for_each over these large chunks (and within each chunk – we’re calculating partial sum in a local variable y, avoiding any synchronization until the very end of the chunk).4 Let’s note that in the code above, we’re actually relying on the sum to be associative to achieve the speed improvement (we DO change the order of additions compared to sequential code).
If we run this code – we’ll see that it is ACTUALLY FASTER than sequential one (in my measurements – see also a table below – it was by a factor of almost-2x on an 8-core box, while using all the 8 cores).
This is a HUGE improvement over our previous 90x-slower-than-sequential code – AND is the first time we’ve got it faster than sequential.
4 As noted above – it is NOT a production-level code, so I didn’t bother with creating an iterator-which-jumps-over-each-sz-elements, and created an ugly temporary array instead; however, it should be good enough to illustrate the point
Enter std::reduce()
Well, while we did manage to improve speed by using multiple cores in our previous code <phew />, it is NOT what we really want to write in our app-level code (that’s to put it mildly). Fortunately, there is a function which will do pretty much the same thing (actually, even a bit better) for us: it is std::reduce().
The point of std::reduce() is that it – just like our code above – exploits the fact that operation-which-is-used-for-reduce (default is ‘+’), is associative.5 Then, the whole thing we wrote above, can be written as follows (it is MUCH less ugly, MUCH less error-prone, and a tiny bit more efficient):
//DO USE
size_t TestFuncReduceParSum::run(Iterator begin, Iterator end) {
return std::reduce(std::execution::par, begin, end, (size_t)0);
//NB: (size_t)0 specifies not only an initial value, but also the TYPE of the accumulator
// which ensures that accumulation is done in terms of size_t, and NOT in terms of *begin
// This, in turn, is necessary to keep our calculation EXACTLY the same as our previous ones
}
Interestingly enough, when I run the code with std::reduce(), I observed that it is merely 2-3% faster6 than our manual parallelization which we used above. While it is certainly not a proof, but it is an indication that the methods used within std::reduce(), are conceptually similar to the stuff-we-wrote-ourselves-above.7 This approximation, in turn, can be useful to “feel” what we can realistically expect from std::reduce(). After all, there is NO magic involved, and everything-what-happens-in-the-world can (and IMO SHOULD) be explained at least at the very high level of abstraction (and BTW, if you have a better explanation – without going into “we’re experts, just trust us to do things right” – please LMK).
5 or at least almost-associative, which happens with floats, more on it in the next post
6 actually, given the standard deviation being 3%, it qualifies as “barely noticeable”
7 except for “work stealing” which is most likely used by std::reduce(), but I don’t think we’re ready to discuss work stealing at this point
Per-Element Mutations
In terms of std::reduce() we can express LOTS of parallelizable things (a bit more on related trickery in the next post), but all of them are inherently limited to our array-being-processed, being unmodified in the process (in other words – std::reduce() can be seen as a “pure” function). However, as in C++, we’re NOT confined to the world of “pure” functions such as std::reduce(), we MAY consider certain mutation operations over our large collection. However, learned from experience with mutexes, we WON’T try to modify more-than-one-single-element of our collection in our parallel algorithm. And as long as all our mutations are limited to only one element of our array – we are good to go without mutexes / atomics / etc. <phew />:
void TestFuncForEachParModify::run(Iterator begin, Iterator end) {
std::for_each(std::execution::par,begin, end, [](int& item) {
item += 2;
});
In general, it is The Right Way(tm) of doing things parallel; however – in this particular case of ultra-simple ops (which happen to cause our app to consume all the memory bandwidth at least on my box) – parallel version still happens to be slower than sequential one(!).8 This is NOT to say that parallel mutations are always slower than serial ones – but rather that while with some experience it MIGHT possible to predict which parallelization WON’T work (one such example is our mutex-based calculation above) – it is next-to-impossible to predict which parallelization WILL work for sure; hence – whatever-we-think about our parallelization, it is necessary to test its performance (and under conditions which are as close as possible to real-world ones).
Results
Cores Used
Wall-Clock Time (us)
CPU Time (us)
Walk-Clock Compared to Sequential
Power Consumption Compared to Sequential (guesstimate)
“Pure” (mutation-free) Calculation
sequential (for)
1
2960+-20
2960
1x
1x
sequential (for_each)
1
4480+-50
9
4480
1.5x slower
1.5x higher
std::accumulate()
10
1
2940+-30
2940
1x
1x
std::reduce(seq)
1
2960+-60
2960
1x
1x
naive std::mutex
11
8
201’000+-4000
1’600’000
70x slower
300x higher
naive std::atomic
11
8
68’600+-20
550’000
25x slower
100x higher
manual chunking
8
1620+-50
13000
1.82x faster
2.5x higher
std::reduce(par)
8
1580+-50
12600
1.87x faster
2.5x higher
Mutation
sequential (for)
1
3310+-20
3310
1x
1x
sequential (for_each)
1
3300+-4
3300
1x
1x
sequential (in-place transform)
1
4510+-30
4510
1.36x slower12
1.3x higher
manual chunking
8
3540+-90
28300
1.07x slower
4x higher
for_each(par)
8
3520+-100
28100
1.06x slower
4x higher
8 once again, manually-parallelized version performs pretty much the same as the best-library-provided one, see
[demo2] for source
9 To the my best understanding, it SHOULD NOT be the case, hopefully is merely a flaw in MSVC handling of lambdas
11 from previous post, see also
[demo2] 12 probably – but not 100% sure – this is another under-optimization by MSVC
Hey fellas, don’t be jealous
When they made him they broke the mould
So charismatic with an automatic
Never prematurely shooting his load
My observations from the table above:
- All sequential methods are equivalent (all the differences are within measurement errors).
- The only exceptions are related to sequential for_each(), and in-place transform() – but hopefully they’re merely a flaw in optimizing lambdas out by MSVC; OTOH, for_each() case does highlight risks of using lambdas in performance-critical code even in 2018(!).
- Out of the parallel methods – those which serialize too often (“naive” versions above) fail BADLY performance-wise.
- [画像:Judging hare:]“out of all the premature optimizations, premature parallelization is the most premature oneThose parallel methods which serialize “rarely enough” MAY (or may not) outperform serialized version latency-wise, but are always losing energy-consumption-wise; in fact, it is a UNIVERSAL observation that parallel versions are (almost)-always losing to the serial ones in terms of power consumed, and CO2 footprint. Which in turn means that – as long as serial code provides good-enough response times from end-user perspective – DON’T PARALLELIZE; you’ll do a favour BOTH to your user (she will be able to use the cores for something else, AND will pay less in electricity bills), AND to the planet. Or from a bit different perspective – out of all the premature optimizations, premature parallelization is the most premature one.
- mutexes are evil;13 even atomics are to be avoided as long as possible
- Even IF we’re mutex-free, parallelization MAY slow things down. So after you found the bottleneck and parallelized – make sure to re-test performance – and under close-to-real-world conditions too.
- FWIW, effects of our manual parallelization look very close to those of std::reduce(); it is NOT to argue to do things manually – but to get a very-simplified idea of how std::reduce() does it magic internally.
13 especially so as you’re reading this post (see disclaimer at the very beginning)
Intermediate Conclusions and To Be Continued
In the previous post, we took a look at “how NOT to write parallel programs” (which apparently where LOTS of parallel programming tutorials will lead you). In this one, we have taken a look at “how efficient parallel programs CAN be written”. In the next (and hopefully last-in-this mini-series) post we’ll discuss some guidelines-and-hints for-NOVICE-parallel programmers, which MAY allow speeding up their programs here and there, without the risk to jeopardize program correctness – and without the risk of slowing things down by 50x too. Sure, it won’t be the-very-best optimization – but it MAY bring SOME benefits to the table (and it is rather low-risk too, so as a whole it MAY fly).
Acknowledgement
Cartoons by Sergey GordeevIRL from Gordeev Animation Graphics, Prague.