Parallel Merge Sort

Question 1

Parallism is achieved by using a simple call to tbb::parallel_invoke(). However I can't break a speedup factor of x2 (after ~3-4 parallel HW threads), although 8 parallel HW threads are used. Advice on this is highly appreciated.

template<class In>
void insertion_sort(In b, In e) { // pretty standard insertion sort for smaller pieces
 for(auto i = b; i != e; ++i) {
 auto key = *i;
 auto j = i - 1;
 for(; j != b - 1 && *j > key; --j)
 *(j + 1) = *j;
 *(j + 1) = key;
 }
}
template <class In>
void merge(In b, In m, In e) { // this is the merger, doing the important job
 std::vector<typename In::value_type> left(b, m);
 std::vector<typename In::value_type> right(m, e);
 // guards for the two ranges
 left.push_back(std::numeric_limits<typename In::value_type>::max());
 right.push_back(std::numeric_limits<typename In::value_type>::max());
 auto itl = std::begin(left);
 auto itr = std::begin(right);
 while (b != e) {
 if(*itl <= *itr)
 *b++ = *itl++;
 else
 *b++ = *itr++;
 }
}
template <class In>
void merge_sort(In b, In e) { // serial merge_sort, used for pieces smaller < 500
 if(b < e - 1) {
 auto dis = std::distance(b, e) / 2;
 In m = b + dis;
 if(dis > 10) {
 merge_sort(b, m);
 merge_sort(m, e);
 merge(b, m, e);
 }
 //switch to insertion sort for pieces < 10
 else
 insertion_sort(b,e);
 }
}
template <class In>
void merge_sort_parallel(In b, In e) { // merge_sort parallel 
 if(b != e - 1) {
 auto dis = std::distance(b, e);
 In m = b + dis / 2;
 if(dis > 500) {
 tbb::parallel_invoke([&] () { merge_sort_parallel(b, m); },
 [&] () { merge_sort_parallel(m, e); });
 }
 else {
 merge_sort(b, m);
 merge_sort(m, e);
 }
 merge(b, m, e);
 }
}

If something is not understandable, please ask.

Question 2

What is the type and size of the data set that you used in your benchmarks which failed to yield a 2x speedup?

Question 3

@MikeNakis very simple ints

Question 4

Yes, and how many?

Question 5

@MikeNakis up to 1 billion, so I think that basic thread overhead is not the case.

Question 6

If you have 500 as your cut off for parallelism and you start with 1 Billion integers then you are spawning 2^21 (approx) threads Jobs to do parallel work. Is there not some overhead to build and maintain this list? I would limit it to 4 times as many jobs as you have HW threads (a complete guess as a starting point). So lets say 32 then I would make it stop doing parallel sorts after 4 (maybe 5) recursive levels of calling the parallel version.

Question 7

It's due to the algorithm itself. After all the stuff you do in parallel, you call merge in serial.

merge(b, m, e);

No matter how fast you sort the fragments, there's one big serial merge at the end that only uses one core. In addition, halves get merged on only 2 cores, quarters get merged on only 4 cores, etc.

See this question on Stack Overflow for more detail:

https://stackoverflow.com/questions/3969813/which-parallel-sorting-algorithm-has-the-best-average-case-performance

Question 8

That happens in the last parts and is like you said a constant(well, not really) factor. However I included them in my calculations and they won't put efficiency up that much. One can observe them quite well from taskmanager.

Question 9

If you find and fix other problems, you'll get a somewhat better speed-up, but I think that the algorithm is the biggest limiting factor. If you add more cores, the serial portion will take a higher percentage of the overall time.

Question 10

That serial portion is only n steps out of n*log2(n) steps. In other words, instead of the ideal n/8 + n/8 + n/8 + n/8 + ... + n/8 steps (repeated log2(n) times), you have n + n/2 + n/4 + n/8 + ... + n/8, which is still about 5x the non-parallel speed (for n=2^20).

Question 11

@xan I don't follow your math, but it sounds like you're not thinking about it the right way. Serial takes O(n log n). A perfect algorithm would take O(n/k log n) across k cores. The serial merge at the end takes O(n). How big is O(n) compared to O(n/k log n)? It depends on both n and k obviously - so saying 5x is simplistic. Notice that the slowdown is worse with more cores, and it's more complex when it comes to the size of the collection.

Question 12

@Craig. I'm thinking about it in terms of the original question involving parallel mergesort for big n and k=8. The 5x speed-up estimate for parallelizing the algorithm (if memory wasn't a bottleneck) comes from using n=2^20 and k=8. Instead of n*log2(n) = 20n steps, the parallel version would take have 4n steps because 4n = n + n/2 + n/4 + n/8 + ... + n/8. The "..." is 15 more n/8 terms. 20n to 4n is the 5x speed-up. Bigger n would give a bigger theoretical speed-up, and, sure, it's more complicated if k also varies.

Question 13

Maybe you're at a memory bandwidth limit. I've done similar experiments with Java threads on an 8 core machine (2 quad core Xeons). Non-memory tasks scaled very well, but memory intensive tasks did not. Try something more computation just to see if memory is a limiting factor for you, too.

Or, in the same spirit, try to make your code less memory intensive by using a higher optimization level or hand-optimizing. For instance, the inner merge loop appears to do three memory reads each iteration when it could do one (plus one write).

 typename In::value_type kl = *itl++;
 typename In::value_type kr = *itr++;
 while (b != e) {
 if(kl <= kr) {
 *b++ = kl;
 kl = *itl++;
 }
 else {
 *b++ = kr;
 kr = *itr++;
 }
 }

UPDATE: I tried experimenting with the code myself on 8 cores and found:

In debug build, the insertion sort will trigger some asserts because b-1 can go off the end of the base vector. I just used std::sort() instead of insertion sort.
My above code suggestion doesn't make any difference. Either the compiler or the CPU must already be avoiding the extra fetching I was trying to get rid of.
It did make a difference to avoid the push_back calls in merge(). Apparently they cause a reallocation of the vectors. Instead of using a sentinel, I added bounds checking to the merge loop. I guess an alternative would be to use reserve to make the vectors big enough before the copy and append.
I also tried parallelizing merge to see if it really was an issue doing it serially. I could only see to split the work into two parts (not recursively) by merging n/2 elements from the beginning and n-n/2 elements from the end in parallel. That didn't make any difference in total timing values, which further supports memory access being the bottleneck (though an error on my part is quite likely).

Below is the parallel merge code:

 template <class In>
 void merge_partial(In itl, In itle, In itr, In itre, In b, In e, int n, bool le) {
 auto itm = b;
 for (;n != 0;n--){
 if( le ? (*itl <= *itr) : (*itl >= *itr)) {
 *itm++ = *itl++;
 if (itl == itle) { // all the rest from the right
 for (;n != 0;n--)
 *itm++ = *itr++;
 break;
 }
 }
 else {
 *itm++ = *itr++;
 if (itr == itre) { // all the rest from the left
 for (;n != 0;n--)
 *itm++ = *itl++;
 break;
 }
 }
 }
 }

...

 if(dis > 500) {
 tbb::parallel_invoke([&] () { merge_sort_parallel(b, m); },
 [&] () { merge_sort_parallel(m, e); });
 std::vector<typename In::value_type> left(b, m);
 std::vector<typename In::value_type> right(m, e);
 auto n = std::distance(b, e);
 auto itl = std::begin(left);
 auto itle = std::end(left);
 auto itr = std::begin(right);
 auto itre = std::end(right);
 tbb::parallel_invoke(
 [&] () { merge_partial(itl, itle, itr, itre, b, e, n / 2, true); },
 [&] () { merge_partial(reverse_iterator<In>(itle), reverse_iterator<In>(itl),
 reverse_iterator<In>(itre), reverse_iterator<In>(itr), 
 reverse_iterator<In>(e), reverse_iterator<In>(b),
 n - n / 2, false); });
 }

Question 14

I do indeed believe it is a cache/memory bandwidth problem. I checked the Intel tbb book, where they have a small chapter about cache and there they do explicitly mention that merge sort is, due to his out of place nature, probably not as good as quicksort for parallism.

Question 15

thanks, for late edit. Concerning 1), yes I think it should be auto i = b + 1 in the first loop, that should avoid the potential overtheedge case. Concerning 2), that is indeed quite a bottleneck, didn't think about that in the first place, thanks.

Question 16

Note that for larger n, the cache lines for threads might coincide. For ints, which are smaller than cache lines, the serial version automatically performs some pre-fetching, where the parallel version has a chance to trash the pre-fetched values or force single writes that might have been combined in the serial version. Some care to ensure that either threads do not start at the same cache line or wait some time to give the other threads time to move to different cache lines, might help performance. But that's just a theoretical thought, I haven't tried it out yet.

Question 17

@Sjoerd I would be highly interested in a deeper explanation - mabye in a speratore answer?

Craig Craig 5383 silver badges9 bronze badges · Answer 1 · 2012-01-13 21:58:11Z

6

\$\begingroup\$

It's due to the algorithm itself. After all the stuff you do in parallel, you call merge in serial.

merge(b, m, e);

No matter how fast you sort the fragments, there's one big serial merge at the end that only uses one core. In addition, halves get merged on only 2 cores, quarters get merged on only 4 cores, etc.

See this question on Stack Overflow for more detail:

https://stackoverflow.com/questions/3969813/which-parallel-sorting-algorithm-has-the-best-average-case-performance

Share

edited May 23, 2017 at 12:40

Community's user avatar

Community Bot

1

answered Jan 13, 2012 at 21:58

Craig's user avatar

Craig Craig

5383 silver badges9 bronze badges

\$\endgroup\$

6

\$\begingroup\$ That happens in the last parts and is like you said a constant(well, not really) factor. However I included them in my calculations and they won't put efficiency up that much. One can observe them quite well from taskmanager. \$\endgroup\$

inf
– inf

2012年01月13日 22:07:59 +00:00
Commented Jan 13, 2012 at 22:07
\$\begingroup\$ If you find and fix other problems, you'll get a somewhat better speed-up, but I think that the algorithm is the biggest limiting factor. If you add more cores, the serial portion will take a higher percentage of the overall time. \$\endgroup\$

Craig
– Craig

2012年01月14日 14:21:29 +00:00
Commented Jan 14, 2012 at 14:21
\$\begingroup\$ That serial portion is only n steps out of n*log2(n) steps. In other words, instead of the ideal n/8 + n/8 + n/8 + n/8 + ... + n/8 steps (repeated log2(n) times), you have n + n/2 + n/4 + n/8 + ... + n/8, which is still about 5x the non-parallel speed (for n=2^20). \$\endgroup\$

xan
– xan

2012年01月18日 04:11:35 +00:00
Commented Jan 18, 2012 at 4:11
\$\begingroup\$ @xan I don't follow your math, but it sounds like you're not thinking about it the right way. Serial takes O(n log n). A perfect algorithm would take O(n/k log n) across k cores. The serial merge at the end takes O(n). How big is O(n) compared to O(n/k log n)? It depends on both n and k obviously - so saying 5x is simplistic. Notice that the slowdown is worse with more cores, and it's more complex when it comes to the size of the collection. \$\endgroup\$

Craig
– Craig

2012年01月18日 17:35:23 +00:00
Commented Jan 18, 2012 at 17:35
\$\begingroup\$ @Craig. I'm thinking about it in terms of the original question involving parallel mergesort for big n and k=8. The 5x speed-up estimate for parallelizing the algorithm (if memory wasn't a bottleneck) comes from using n=2^20 and k=8. Instead of n*log2(n) = 20n steps, the parallel version would take have 4n steps because 4n = n + n/2 + n/4 + n/8 + ... + n/8. The "..." is 15 more n/8 terms. 20n to 4n is the 5x speed-up. Bigger n would give a bigger theoretical speed-up, and, sure, it's more complicated if k also varies. \$\endgroup\$

xan
– xan

2012年01月18日 18:05:14 +00:00
Commented Jan 18, 2012 at 18:05

| Show 1 more comment

xan xan 1711 silver badge7 bronze badges · Answer 2 · 2012-01-12 04:38:12Z

Maybe you're at a memory bandwidth limit. I've done similar experiments with Java threads on an 8 core machine (2 quad core Xeons). Non-memory tasks scaled very well, but memory intensive tasks did not. Try something more computation just to see if memory is a limiting factor for you, too.

Or, in the same spirit, try to make your code less memory intensive by using a higher optimization level or hand-optimizing. For instance, the inner merge loop appears to do three memory reads each iteration when it could do one (plus one write).

 typename In::value_type kl = *itl++;
 typename In::value_type kr = *itr++;
 while (b != e) {
 if(kl <= kr) {
 *b++ = kl;
 kl = *itl++;
 }
 else {
 *b++ = kr;
 kr = *itr++;
 }
 }

UPDATE: I tried experimenting with the code myself on 8 cores and found:

In debug build, the insertion sort will trigger some asserts because b-1 can go off the end of the base vector. I just used std::sort() instead of insertion sort.
My above code suggestion doesn't make any difference. Either the compiler or the CPU must already be avoiding the extra fetching I was trying to get rid of.
It did make a difference to avoid the push_back calls in merge(). Apparently they cause a reallocation of the vectors. Instead of using a sentinel, I added bounds checking to the merge loop. I guess an alternative would be to use reserve to make the vectors big enough before the copy and append.
I also tried parallelizing merge to see if it really was an issue doing it serially. I could only see to split the work into two parts (not recursively) by merging n/2 elements from the beginning and n-n/2 elements from the end in parallel. That didn't make any difference in total timing values, which further supports memory access being the bottleneck (though an error on my part is quite likely).

Below is the parallel merge code:

 template <class In>
 void merge_partial(In itl, In itle, In itr, In itre, In b, In e, int n, bool le) {
 auto itm = b;
 for (;n != 0;n--){
 if( le ? (*itl <= *itr) : (*itl >= *itr)) {
 *itm++ = *itl++;
 if (itl == itle) { // all the rest from the right
 for (;n != 0;n--)
 *itm++ = *itr++;
 break;
 }
 }
 else {
 *itm++ = *itr++;
 if (itr == itre) { // all the rest from the left
 for (;n != 0;n--)
 *itm++ = *itl++;
 break;
 }
 }
 }
 }

...

 if(dis > 500) {
 tbb::parallel_invoke([&] () { merge_sort_parallel(b, m); },
 [&] () { merge_sort_parallel(m, e); });
 std::vector<typename In::value_type> left(b, m);
 std::vector<typename In::value_type> right(m, e);
 auto n = std::distance(b, e);
 auto itl = std::begin(left);
 auto itle = std::end(left);
 auto itr = std::begin(right);
 auto itre = std::end(right);
 tbb::parallel_invoke(
 [&] () { merge_partial(itl, itle, itr, itre, b, e, n / 2, true); },
 [&] () { merge_partial(reverse_iterator<In>(itle), reverse_iterator<In>(itl),
 reverse_iterator<In>(itre), reverse_iterator<In>(itr), 
 reverse_iterator<In>(e), reverse_iterator<In>(b),
 n - n / 2, false); });
 }

I do indeed believe it is a cache/memory bandwidth problem. I checked the Intel tbb book, where they have a small chapter about cache and there they do explicitly mention that merge sort is, due to his out of place nature, probably not as good as quicksort for parallism.
thanks, for late edit. Concerning 1), yes I think it should be auto i = b + 1 in the first loop, that should avoid the potential overtheedge case. Concerning 2), that is indeed quite a bottleneck, didn't think about that in the first place, thanks.
Note that for larger n, the cache lines for threads might coincide. For ints, which are smaller than cache lines, the serial version automatically performs some pre-fetching, where the parallel version has a chance to trash the pre-fetched values or force single writes that might have been combined in the serial version. Some care to ensure that either threads do not start at the same cache line or wait some time to give the other threads time to move to different cache lines, might help performance. But that's just a theoretical thought, I haven't tried it out yet.
@Sjoerd I would be highly interested in a deeper explanation - mabye in a speratore answer?

Stack Exchange Network

Parallel Merge Sort

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Parallel Merge Sort

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions