Parallism is achieved by using a simple call to tbb::parallel_invoke()
. However I can't break a speedup factor of x2 (after ~3-4 parallel HW threads), although 8 parallel HW threads are used. Advice on this is highly appreciated.
template<class In>
void insertion_sort(In b, In e) { // pretty standard insertion sort for smaller pieces
for(auto i = b; i != e; ++i) {
auto key = *i;
auto j = i - 1;
for(; j != b - 1 && *j > key; --j)
*(j + 1) = *j;
*(j + 1) = key;
}
}
template <class In>
void merge(In b, In m, In e) { // this is the merger, doing the important job
std::vector<typename In::value_type> left(b, m);
std::vector<typename In::value_type> right(m, e);
// guards for the two ranges
left.push_back(std::numeric_limits<typename In::value_type>::max());
right.push_back(std::numeric_limits<typename In::value_type>::max());
auto itl = std::begin(left);
auto itr = std::begin(right);
while (b != e) {
if(*itl <= *itr)
*b++ = *itl++;
else
*b++ = *itr++;
}
}
template <class In>
void merge_sort(In b, In e) { // serial merge_sort, used for pieces smaller < 500
if(b < e - 1) {
auto dis = std::distance(b, e) / 2;
In m = b + dis;
if(dis > 10) {
merge_sort(b, m);
merge_sort(m, e);
merge(b, m, e);
}
//switch to insertion sort for pieces < 10
else
insertion_sort(b,e);
}
}
template <class In>
void merge_sort_parallel(In b, In e) { // merge_sort parallel
if(b != e - 1) {
auto dis = std::distance(b, e);
In m = b + dis / 2;
if(dis > 500) {
tbb::parallel_invoke([&] () { merge_sort_parallel(b, m); },
[&] () { merge_sort_parallel(m, e); });
}
else {
merge_sort(b, m);
merge_sort(m, e);
}
merge(b, m, e);
}
}
If something is not understandable, please ask.
2 Answers 2
It's due to the algorithm itself. After all the stuff you do in parallel, you call merge in serial.
merge(b, m, e);
No matter how fast you sort the fragments, there's one big serial merge at the end that only uses one core. In addition, halves get merged on only 2 cores, quarters get merged on only 4 cores, etc.
See this question on Stack Overflow for more detail:
-
\$\begingroup\$ That happens in the last parts and is like you said a constant(well, not really) factor. However I included them in my calculations and they won't put efficiency up that much. One can observe them quite well from taskmanager. \$\endgroup\$inf– inf2012年01月13日 22:07:59 +00:00Commented Jan 13, 2012 at 22:07
-
\$\begingroup\$ If you find and fix other problems, you'll get a somewhat better speed-up, but I think that the algorithm is the biggest limiting factor. If you add more cores, the serial portion will take a higher percentage of the overall time. \$\endgroup\$Craig– Craig2012年01月14日 14:21:29 +00:00Commented Jan 14, 2012 at 14:21
-
\$\begingroup\$ That serial portion is only n steps out of n*log2(n) steps. In other words, instead of the ideal n/8 + n/8 + n/8 + n/8 + ... + n/8 steps (repeated log2(n) times), you have n + n/2 + n/4 + n/8 + ... + n/8, which is still about 5x the non-parallel speed (for n=2^20). \$\endgroup\$xan– xan2012年01月18日 04:11:35 +00:00Commented Jan 18, 2012 at 4:11
-
\$\begingroup\$ @xan I don't follow your math, but it sounds like you're not thinking about it the right way. Serial takes O(n log n). A perfect algorithm would take O(n/k log n) across k cores. The serial merge at the end takes O(n). How big is O(n) compared to O(n/k log n)? It depends on both n and k obviously - so saying 5x is simplistic. Notice that the slowdown is worse with more cores, and it's more complex when it comes to the size of the collection. \$\endgroup\$Craig– Craig2012年01月18日 17:35:23 +00:00Commented Jan 18, 2012 at 17:35
-
\$\begingroup\$ @Craig. I'm thinking about it in terms of the original question involving parallel mergesort for big n and k=8. The 5x speed-up estimate for parallelizing the algorithm (if memory wasn't a bottleneck) comes from using n=2^20 and k=8. Instead of n*log2(n) = 20n steps, the parallel version would take have 4n steps because 4n = n + n/2 + n/4 + n/8 + ... + n/8. The "..." is 15 more n/8 terms. 20n to 4n is the 5x speed-up. Bigger n would give a bigger theoretical speed-up, and, sure, it's more complicated if k also varies. \$\endgroup\$xan– xan2012年01月18日 18:05:14 +00:00Commented Jan 18, 2012 at 18:05
Maybe you're at a memory bandwidth limit. I've done similar experiments with Java threads on an 8 core machine (2 quad core Xeons). Non-memory tasks scaled very well, but memory intensive tasks did not. Try something more computation just to see if memory is a limiting factor for you, too.
Or, in the same spirit, try to make your code less memory intensive by using a higher optimization level or hand-optimizing. For instance, the inner merge loop appears to do three memory reads each iteration when it could do one (plus one write).
typename In::value_type kl = *itl++;
typename In::value_type kr = *itr++;
while (b != e) {
if(kl <= kr) {
*b++ = kl;
kl = *itl++;
}
else {
*b++ = kr;
kr = *itr++;
}
}
UPDATE: I tried experimenting with the code myself on 8 cores and found:
In debug build, the insertion sort will trigger some asserts because
b-1
can go off the end of the base vector. I just usedstd::sort()
instead of insertion sort.My above code suggestion doesn't make any difference. Either the compiler or the CPU must already be avoiding the extra fetching I was trying to get rid of.
It did make a difference to avoid the
push_back
calls inmerge()
. Apparently they cause a reallocation of the vectors. Instead of using a sentinel, I added bounds checking to the merge loop. I guess an alternative would be to usereserve
to make the vectors big enough before the copy and append.I also tried parallelizing merge to see if it really was an issue doing it serially. I could only see to split the work into two parts (not recursively) by merging n/2 elements from the beginning and n-n/2 elements from the end in parallel. That didn't make any difference in total timing values, which further supports memory access being the bottleneck (though an error on my part is quite likely).
Below is the parallel merge code:
template <class In>
void merge_partial(In itl, In itle, In itr, In itre, In b, In e, int n, bool le) {
auto itm = b;
for (;n != 0;n--){
if( le ? (*itl <= *itr) : (*itl >= *itr)) {
*itm++ = *itl++;
if (itl == itle) { // all the rest from the right
for (;n != 0;n--)
*itm++ = *itr++;
break;
}
}
else {
*itm++ = *itr++;
if (itr == itre) { // all the rest from the left
for (;n != 0;n--)
*itm++ = *itl++;
break;
}
}
}
}
...
if(dis > 500) {
tbb::parallel_invoke([&] () { merge_sort_parallel(b, m); },
[&] () { merge_sort_parallel(m, e); });
std::vector<typename In::value_type> left(b, m);
std::vector<typename In::value_type> right(m, e);
auto n = std::distance(b, e);
auto itl = std::begin(left);
auto itle = std::end(left);
auto itr = std::begin(right);
auto itre = std::end(right);
tbb::parallel_invoke(
[&] () { merge_partial(itl, itle, itr, itre, b, e, n / 2, true); },
[&] () { merge_partial(reverse_iterator<In>(itle), reverse_iterator<In>(itl),
reverse_iterator<In>(itre), reverse_iterator<In>(itr),
reverse_iterator<In>(e), reverse_iterator<In>(b),
n - n / 2, false); });
}
-
\$\begingroup\$ I do indeed believe it is a cache/memory bandwidth problem. I checked the Intel tbb book, where they have a small chapter about cache and there they do explicitly mention that merge sort is, due to his out of place nature, probably not as good as quicksort for parallism. \$\endgroup\$inf– inf2012年01月12日 17:37:03 +00:00Commented Jan 12, 2012 at 17:37
-
\$\begingroup\$ thanks, for late edit. Concerning 1), yes I think it should be
auto i = b + 1
in the first loop, that should avoid the potential overtheedge case. Concerning 2), that is indeed quite a bottleneck, didn't think about that in the first place, thanks. \$\endgroup\$inf– inf2012年01月23日 10:11:24 +00:00Commented Jan 23, 2012 at 10:11 -
1\$\begingroup\$ Note that for larger n, the cache lines for threads might coincide. For ints, which are smaller than cache lines, the serial version automatically performs some pre-fetching, where the parallel version has a chance to trash the pre-fetched values or force single writes that might have been combined in the serial version. Some care to ensure that either threads do not start at the same cache line or wait some time to give the other threads time to move to different cache lines, might help performance. But that's just a theoretical thought, I haven't tried it out yet. \$\endgroup\$Sjoerd– Sjoerd2012年01月25日 03:24:45 +00:00Commented Jan 25, 2012 at 3:24
-
\$\begingroup\$ @Sjoerd I would be highly interested in a deeper explanation - mabye in a speratore answer? \$\endgroup\$inf– inf2012年01月26日 20:28:58 +00:00Commented Jan 26, 2012 at 20:28
Explore related questions
See similar questions with these tags.
int
s \$\endgroup\$threads Jobs
to do parallel work. Is there not some overhead to build and maintain this list? I would limit it to 4 times as many jobs as you have HW threads (a complete guess as a starting point). So lets say 32 then I would make it stop doing parallel sorts after 4 (maybe 5) recursive levels of calling the parallel version. \$\endgroup\$