Parallel Quicksort

Question 1

As my parallel mergesort implementation was very memory dependent, I wanted to write a parallel quicksort one. Here it comes:

template<class In>
In partition(In b, In e) {
 // create uniform distribuiton for random engine
 std::uniform_int_distribution<int> rand_distribution(0, std::distance(b, e - 1));
 // call static random engine to get index of random pivot
 tbb::spin_mutex::scoped_lock lock;
 lock.acquire(mutex);
 auto rand_pivot_index = rand_distribution(rand_engine);
 lock.release();
 // put pivot to last place in range
 std::swap(*(b + rand_pivot_index), *(e - 1));
 // save pivot value
 auto pivot = *(e - 1);
 auto pivotiterator = b;
 // sort the range
 for(auto it = b; it != e - 1; ++it) {
 if(*it < pivot) {
 std::swap(*it, *pivotiterator);
 ++pivotiterator;
 }
 }
 // put pivot to its according position and return position
 std::swap(*(pivotiterator), *(e - 1));
 return pivotiterator;
}
template<class In>
void quick_sort_parallel(In b, In e) {
 if (b != e) {
 In m = partition(b, e);
 // switch to different sort on worst case or smaller ranges
 if(std::distance(b, m) > 500) {
 tbb::parallel_invoke([&] { quick_sort_parallel(b, m); }, 
 [&] { quick_sort_parallel(m + 1, e); });
 }
 else 
 merge_sort_parallel(b, e); //defined somewhere else, pretty standard
 }
}

Question 2

Code

I would suggest using RAII for your lock.

Instead of:

std::uniform_int_distribution<int> rand_distribution(0, std::distance(b, e - 1));
// call static random engine to get index of random pivot
tbb::spin_mutex::scoped_lock lock;
lock.acquire(mutex);
auto rand_pivot_index = rand_distribution(rand_engine);
lock.release();
// put pivot to last place in range
std::swap(*(b + rand_pivot_index), *(e - 1));

Use:

std::uniform_int_distribution<int> rand_distribution(0, std::distance(b, e - 1));
// call static random engine to get index of random pivot
int rand_pivot_index;
{
 tbb::spin_mutex::scoped_lock lock(mutex);
 rand_pivot_index = rand_distribution(rand_engine);
}
// put pivot to last place in range
std::swap(*(b + rand_pivot_index), *(e - 1));

This ensures that if rand_distribution throws, the lock is released correctly.

Algorithm

Choice of pivot is important in any quick-sort algorithm, and even more so in the parallel case, when you want to garuntee a good distribution of load over cores by avoiding unbalanced recursion.

A good basic strategy is to choose the median of k random elements as your pivot.

Taking this further, using a median of medians algorithm can guarantee your quicksort a O(n log n) running time.

The correct approach and value of k will largely depend on the inputs your function will expect to take (e.g. adapt k based on size on input).

Question 3

Of course RAII is the way to manage locks. Today I would do it all a little bit different anyway. Concerning the median of median thing, I don't understand how you can guarantee O(n log n) on an array of only dupes.

Question 4

If your input can have non distinct elements then you also need your partition step to split into three. 1: (< pivot), 2: (= pivot), 3: (> pivot). Then recurse on 1 and 3. This will get you to O(n log n).

Question 5

from where rand_engine come from?

Question 6

@MORTAL from the original question - it's an unimportant implementation detail.

Question 7

Actually, how your pseudo-random number generator is declared does matter, for depending on how it is declared, you may avoid using a lock to generate random numbers. Generating a random number is alreadu an expensive operation, so locking them makes it really expensive.

If you make your pseudo-random number generator thread_local, you don't need to lock anything anymore before generating a random number, it will be inherently thread-safe:

thread_local std::random_device rd;
thread_local std::mt19937 gen(rd());

Moreover, if you also make thread_local your std::uniform_int_distribution, you won't have to update it every time you call partition, but only once in each thread. That said, you can't initialize it with its "range" anymore, you have to pass it when you call it:

// Initialize it once
thread_local std::uniform_int_distribution<int> rand_distribution{};
// Call it
using param_type = std::uniform_int_distribution<int>::param_type;
auto rand_pivot_index = rand_distribution(gen, {0, std::distance(b, e - 1)});

If I didn't make any error while writing the code (untested), this should be thread-safe, lock-free and faster than your original solution :)

Question 8

That's a good idea these days. Guess thread_local wasn't even supported back in 2012.

Question 9

@inf Oh yeah, the question is becoming a bit old. I know that GCC and CLang have supported an almost equivalent __thread for a while, but I guess it took a bit longer to MSVC :)

cmh cmh 4953 silver badges9 bronze badges · Answer 1 · 2012-04-13 13:12:36Z

Code

I would suggest using RAII for your lock.

Instead of:

std::uniform_int_distribution<int> rand_distribution(0, std::distance(b, e - 1));
// call static random engine to get index of random pivot
tbb::spin_mutex::scoped_lock lock;
lock.acquire(mutex);
auto rand_pivot_index = rand_distribution(rand_engine);
lock.release();
// put pivot to last place in range
std::swap(*(b + rand_pivot_index), *(e - 1));

Use:

std::uniform_int_distribution<int> rand_distribution(0, std::distance(b, e - 1));
// call static random engine to get index of random pivot
int rand_pivot_index;
{
 tbb::spin_mutex::scoped_lock lock(mutex);
 rand_pivot_index = rand_distribution(rand_engine);
}
// put pivot to last place in range
std::swap(*(b + rand_pivot_index), *(e - 1));

This ensures that if rand_distribution throws, the lock is released correctly.

Algorithm

Choice of pivot is important in any quick-sort algorithm, and even more so in the parallel case, when you want to garuntee a good distribution of load over cores by avoiding unbalanced recursion.

A good basic strategy is to choose the median of k random elements as your pivot.

Taking this further, using a median of medians algorithm can guarantee your quicksort a O(n log n) running time.

The correct approach and value of k will largely depend on the inputs your function will expect to take (e.g. adapt k based on size on input).

Of course RAII is the way to manage locks. Today I would do it all a little bit different anyway. Concerning the median of median thing, I don't understand how you can guarantee O(n log n) on an array of only dupes.
If your input can have non distinct elements then you also need your partition step to split into three. 1: (< pivot), 2: (= pivot), 3: (> pivot). Then recurse on 1 and 3. This will get you to O(n log n).
@MORTAL from the original question - it's an unimportant implementation detail.

Morwenn Morwenn 20.2k3 gold badges69 silver badges132 bronze badges · Answer 2 · 2016-04-06 13:56:28Z

Actually, how your pseudo-random number generator is declared does matter, for depending on how it is declared, you may avoid using a lock to generate random numbers. Generating a random number is alreadu an expensive operation, so locking them makes it really expensive.

If you make your pseudo-random number generator thread_local, you don't need to lock anything anymore before generating a random number, it will be inherently thread-safe:

thread_local std::random_device rd;
thread_local std::mt19937 gen(rd());

Moreover, if you also make thread_local your std::uniform_int_distribution, you won't have to update it every time you call partition, but only once in each thread. That said, you can't initialize it with its "range" anymore, you have to pass it when you call it:

// Initialize it once
thread_local std::uniform_int_distribution<int> rand_distribution{};
// Call it
using param_type = std::uniform_int_distribution<int>::param_type;
auto rand_pivot_index = rand_distribution(gen, {0, std::distance(b, e - 1)});

If I didn't make any error while writing the code (untested), this should be thread-safe, lock-free and faster than your original solution :)

That's a good idea these days. Guess thread_local wasn't even supported back in 2012.
@inf Oh yeah, the question is becoming a bit old. I know that GCC and CLang have supported an almost equivalent __thread for a while, but I guess it took a bit longer to MSVC :)

Stack Exchange Network

Parallel Quicksort

2 Answers 2

Code

Algorithm

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Parallel Quicksort

2 Answers 2

Code

Algorithm

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions