10

I have an array of struct and I need to sort this array according to a property of the struct (N). The object looks like this:

 struct OBJ
 { 
 int N; //sort array of OBJ with respect to N
 OB *c; //OB is another struct
 } 

The array size is small, about 512 elements, but the size of every element is big therefore I cannot copy the array to shared memory.

What is the simplest and "good" way to sort this array? I do not need a complex algorithm that require a lot of time to implement (since the number of elements in the array is small) I just need a simple algorithm.

Note: I have read some papers about sorting algorithms using GPUs, but the speed gain from these papers only show up when the size of the array is very big. Therefore I did not try to implement their algorithms because the size of my array is small. I only need a simple way to parallel sort my array. Thanks.

Ashwin Nanjappa
78.9k90 gold badges222 silver badges298 bronze badges
asked Mar 13, 2011 at 11:05
1
  • For sorting just one small array, you won't be able to saturate the GPU either way. Might be easier to sort CPU side, especially if you've got other kernels to run in the meantime. However, if you have lots of small lists (with up to ~500 items as yours is) a similar problem is order-independent transparency. See this. You'd want to pull a list of keys and indices, sort those, then use the indices to reorder or just read in sorted order. Commented Mar 4, 2015 at 14:34

4 Answers 4

6

What means "big" and "small" ?

By "big" I assume you mean something of>1M elements, while small --- small enough to actually fit in shared memory (probably <1K elements). If my understanding of "small" matches yours, I would try the following:

  • Use only a single block to sort the array (it can be then a part of some bigger CUDA kernel)
  • Bitonic sort is one of good appraches which can be adopted for parallel algorithm.

Some pages on bitonic sort:

  • Bitonic sort (nice explanation, applet to visualise and java source which does not take too much space)
  • Wikipedia (a bit too short explanation for my taste, but more source codes - some abstract language and Java)
  • NVIDIA code Samples (A sample source in CUDA. I think it is a bit ovefocused on killing bank conflicts. I believe the simpler code may actually perform faster)

I once also implemented a bubble sort (lol!) for a single warp to sort arrays of 32 elements. Thanks to its simplicity it did not perform that bad actually. A well tuned bitonic sort will still perform faster though.

answered Mar 13, 2011 at 11:51
2
  • 8
    "I once also implemented a bubble sort " - you are so going to developer hell! Commented Mar 27, 2011 at 13:14
  • Insertion and bubble sort are great on GPUs, even up towards 100 items. Low SIMD divergence. Bitonic's good but you have to have 2^n items or not many below (i.e. 513 and it's comparably quite slow). Commented Mar 4, 2015 at 14:41
2

Why exactly are you heading towards CUDA? I mean, it smells like your problem is not one of those, CUDA is very good at. You just want to sort an array of 512 Elements and let some pointers refer to another location. This is nothing fancy, use a simple serial algorithm for that, e.g. Quicksort, Heapsort or Mergesort.

Additionally, think about the overhead it takes to copy data from your Heap/Stack to your CUDA device. Using CUDA just makes sense, when the calculations are intense enough so that COMPUTING_TIME_ON_CUDA+COPY_DATA_FROM_HEAP_TO_CUDA_DEVICE+COPY_DATA_FROM_CUDA_DEVICE_TO_HEAP < COMPUTING_TIME_ON_HOST_CPU.

Besides, CUDA is immersely powerful at math calculations with big vectors and matrices and rather simple data-types (numbers) because it is one of the problems that often arise on a GPU: Calculating graphics.

answered Mar 13, 2011 at 12:21
2

Use the sorting calls available in the CUDPP or the Thrust library.

If you use cudppSort, note that it only works with integers or floats. To sort your array of structures, you can first sort the keys along with an index array. Later, you can use the sorted index array to move the structures to their final sorted location. I have described how to do this for the cudppCompact compaction algorithm in a blog post here. The steps are similar for sorting an array of structs using cudppSort.

Konard
3,09436 silver badges28 bronze badges
answered Mar 14, 2011 at 2:51
1
  • 1
    cudpp has merge sort and radix sort, by the way. Commented Nov 22, 2016 at 11:03
0

Yes I would totally agree, the overhead of sorting small arrays (<5k elements) kills the possible speedup you will achieve with a "fine-tuned" parallel sorting algorithm implemented in CUDA. I would prefer CPU based sorting for such a small size...

answered Mar 13, 2011 at 17:34
1
  • 10
    Sometimes you have a bigger problem being solved on CUDA and those (sometimes several) small arrays that need sorting can be a "byproduct" of your code. In those scenario it is perfectly valid to do it on CUDA, rather than sending that stuff to host, using CPU and then back to GPU. Commented Mar 13, 2011 at 17:56

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.