1

I have a large amount of data which i need to sort, several million array each with tens of thousand of values. What im wondering is the following:

Is it better to implement a parallel sorting algorithm, on the GPU, and run it across all the arrays

OR

implement a single thread algorithm, like quicksort, and assign each thread, of the GPU, a different array.

Obviously speed is the most important factor. For single thread sorting algorithm memory is a limiting factor. Ive already tried to implement a recursive quicksort but it doesnt seem to work for large amounts of data so im assuming there is a memory issue.

Data type to be sorted is long, so i dont believe a radix sort would be possible due to the fact that it a binary representation of the numbers would be too long.

Any pointers would be appreciated.

asked Jul 13, 2013 at 16:35
3
  • @JackOLantern Thanks for the link. I guess i wasnt clear, both options will be run on the GPU. Commented Jul 13, 2013 at 17:14
  • 1
    @HansRudel are all of those arrays of the same length? Even if the are of same length, if you want to use the thread per array approach, you should use an algorithm that always executes the same steps regardless of how (un)sorted the data is - you will want to avoid thread divergence because either of differences in array length or differences in sorting algorithm execution for different data Commented Jul 13, 2013 at 18:19
  • @RoBiK Hey bro, yeah they are of differing lengths. Ive finally got a quicksort algorithm written and working on a CPU but i havent tried it on the GPU. Times for the CPU version are 6 seconds for 16,000 values which isnt very good. So if ive understood ur comment, i should generate all the lists first, then implement a parallel version and sort a single row at a time? Commented Jul 13, 2013 at 18:24

1 Answer 1

5

Sorting is an operation that has received a lot of attention. Writing your own sort isn't advisable if you are interested in high performance. I would consider something like thrust, back40computing, moderngpu, or CUB for sorting on the GPU.

Most of the above will be handling an array at a time, using the full GPU to sort an array. There are techniques within thrust to do a vectorized sort which can handle multiple arrays "at once", and CUB may also be an option for doing a "per-thread" sort (let's say, "per thread block").

Generally I would say the same thing about CPU sorting code. Don't write your own.

EDIT: I guess one more comment. I would lean heavily towards the first approach you mention (i.e. not doing a sort per thread.) There are two related reasons for this:

  1. Most of the fast sorting work has been done along the lines of your first method, not the second.
  2. The GPU is generally better at being fast when the work is well adapted for SIMD or SIMT. This means we generally want each thread to be doing the same thing and minimizing branching and warp divergence. This is harder to achieve (I think) in the second case, where each thread appears to be following the same sequence but in fact data dependencies are causing "algorithm divergence". On the surface of it, you might wonder if the same criticism might be levelled at the first approach, but since these libraries I mention arer written by experts, they are aware of how best to utilize the SIMT architecture. The thrust "vectorized sort" and CUB approaches will allow multiple sorts to be done per operation, while still taking advantage of SIMT architecture.
answered Jul 13, 2013 at 19:26
3
  • thanks very much for the links. Im not 100% sure if these are supported by CUDAFY but i will dig around n see. Its taken several hrs to get quicksort working properly so ur comment "Don't write your own." is sound advice esp since quick sort is suppose to be an easy algorithm. Thanks again for your help, much appreciated. Commented Jul 14, 2013 at 0:48
  • 1
    I think none of them are managed code, so using them would be more challenging (from Cudafy/C#). If you really want max sort performance, however, I think it might make sense to consider building an unmanaged DLL or something like that, i.e. stepping outside of your Cudafy sandbox. I had a project I was working on where sort perf was critical, and in the end, by incorporating the modernGPU stuff, I was able to get ~2x even over thrust performance. It might make sense to not require everything to be cudafy-ready. But yes, it is more work. Commented Jul 14, 2013 at 1:19
  • Would it be possible to speak to you in one of the chatrooms please? This is the c# room but we can make a sep one if its busy. chat.stackoverflow.com/rooms/7/c Commented Jul 15, 2013 at 16:42

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.