Radix sort is deceiving

Question 1

I wanted to compare radix_sort to quick_sort for values limited to 0..127 so I implemented this :

void bin_radix_sort(int *a, const size_t size, int digits) {
 assert(digits % 2 == 0);
 int *b = malloc(size * sizeof(int));
 for (int exp = 0; exp < digits; exp++) {
 // Count elements
 size_t count[2] = {0};
 for (size_t i = 0; i < size; i++)
 count[(a[i] >> exp) & 1]++;
 // Cumulative sum
 count[1] += count[0];
 // Build output array
 for (int i = size - 1; i >= 0; i--)
 b[--count[(a[i] >> exp) & 1]] = a[i];
 int *p = a; a = b; b = p;
 };
 free(b);
}

I was able to compare it to qsort with :

struct timespec start;
void tic() {
 timespec_get(&start, TIME_UTC);
}
double toc() {
 struct timespec stop;
 timespec_get(&stop, TIME_UTC);
 return stop.tv_sec - start.tv_sec + (
 stop.tv_nsec - start.tv_nsec
 ) * 1e-9;
}
int cmpfunc (const void * a, const void * b) {
 return ( *(int*)a - *(int*)b );
}
int main(void) {
 const size_t n = 1024 * 1024 * 50;
 printf("Init memory (%ld MB)...\n", n / 1024 / 1024 * sizeof(int)); 
 int *data = calloc(n, sizeof(int)); 
 printf("Sorting n = %ld data elements...\n", n); 
 size_t O;
 tic();
 O = n * log(n);
 qsort(data, n, sizeof(data[0]), cmpfunc);
 printf("%ld %lf s\n", O, toc()); 
 int d = 6;
 tic();
 O = d * (n + 2);
 bin_radix_sort(data, n, d);
 printf("%ld %lf s\n", O, toc());
}

This gives me this output:

$ gcc -Os bench.c -lm
$ ./a.out 
Init memory (200 MB)...
Sorting n = 52428800 data elements...
931920169 1.600909 s
314572812 0.963436 s

I guess my code needs code-review since I was expecting to be six times better than quick sort.

Question 2

"values limited to 0..127" is unclear here. All the values are 0 given int *data = calloc(n, sizeof(int));. I'd expect some random filller for data[]. Was it the intention to compare perforce with the special case of all zero value?

Question 3

Well, I guarantee that qsort doesn't start with a heap allocation. Benchmark several different sizes of array and graph the results, to see where the lines hit the Y-axis: how much of your benchmark is just measuring the speed of malloc?

size_t count[2] = {0};

Depending on the smartness of your compiler, an array that doesn't have to be an array might be a big performance hit. Arrays are often stored in memory, on the stack, as opposed to scalar variables which can be stored in registers without any real cleverness on the compiler's part. Plus, in this case, your code seems to be needlessly convoluted by the use of an array instead of two different variables count0 and count1. Compare:

for (int exp = 0; exp < digits; ++exp) {
 // Count elements
 size_t count0 = 0;
 size_t count1 = 0;
 for (size_t i = 0; i < size; ++i) {
 if ((a[i] >> exp) & 1) {
 count1 += 1;
 } else {
 count0 += 1;
 }
 }
 // Cumulative sum
 count1 += count0;
 // Build output array
 for (int i = size - 1; i >= 0; --i) {
 if ((a[i] >> exp) & 1) {
 b[--count1] = a[i];
 } else {
 b[--count0] = a[i];
 }
 }
 int *p = a; a = b; b = p;
}

After rewriting like this, it becomes apparent that after the first loop, count0 + count1 == size; and after the "Cumulative sum" step, count1 == size. So we can eliminate half of the code.

 size_t count0 = 0;
 size_t count1 = size;
 for (size_t i = 0; i < size; ++i) {
 if (((a[i] >> exp) & 1) == 0) {
 count0 += 1;
 }
 }
 // Build output array
 for (int i = size - 1; i >= 0; --i) {
 if ((a[i] >> exp) & 1) {
 b[--count1] = a[i];
 } else {
 b[--count0] = a[i];
 }
 }

Then, the "Build output array" step is doing the exact same workload (a[i] >> exp) & 1 a second time! That seems like a fruitful source of optimization. What if you folded the second loop into the first loop, something like this?

for (int exp = 0; exp < digits; ++exp) {
 size_t up = 0;
 size_t down = size;
 for (size_t i = 0; i < size; ++i) {
 int x = a[i];
 if ((x >> exp) & 1) {
 b[--down] = x;
 } else {
 b[up++] = x;
 }
 }
 assert(up == down);
 // Now elements [up..size) are in reversed order,
 // so we need to flip them back around.
 reverse_array(b + up, b + size);
 int *temp = a; a = b; b = temp;
}

Writing reverse_array is left as an exercise for the reader.

I'd be interested to see your benchmark results for this "improved" algorithm.

Question 4

0.27s for your optimized algorithm (. Changing the array for two registered variables does not change anything. Now we have the expected 6x faster than quick sort :)

Question 5

heap allocation doesn't cost much, its effect is negligible on my machine. And yes, I don't like this need for a O(n) space buffer. It would be better to do it in place.

Question 6

Limitations on cmpfunc() with qsort()

Presently code only has a zero-filled array to sort, so cmpfunc() is OK.
This makes for an interesting performance test.

If the array was populated with [0..127] as suggested in the question, cmpfunc() is still OK.

If the array was populated with [INT_MIN...INT_MAX], cmpfunc() is UB.

For qsort to perform and complete, the follow is required:

The function shall return an integer less than, equal to, or greater than zero if the first argument is considered to be respectively less than, equal to, or greater than the second. §17dr 7.22.5.2 3

Unfortunately *(int*)a - *(int*)b is prone to overflow (UB) and returning the wrong signed difference.

int cmpfunc (const void * a, const void * b) {
 return ( *(int*)a - *(int*)b ); // UB
}

Suggest a stable alternative:

int cmpfunc (const void * a, const void * b) {
 int ia = * ((const int *)a);
 int ib = * ((const int *)b);
 return (ia > ib) - (ia < ib);
}

Quuxplusone Quuxplusone 19.7k2 gold badges44 silver badges91 bronze badges · Accepted Answer · 2020-04-29 21:58:00Z

Well, I guarantee that qsort doesn't start with a heap allocation. Benchmark several different sizes of array and graph the results, to see where the lines hit the Y-axis: how much of your benchmark is just measuring the speed of malloc?

size_t count[2] = {0};

Depending on the smartness of your compiler, an array that doesn't have to be an array might be a big performance hit. Arrays are often stored in memory, on the stack, as opposed to scalar variables which can be stored in registers without any real cleverness on the compiler's part. Plus, in this case, your code seems to be needlessly convoluted by the use of an array instead of two different variables count0 and count1. Compare:

for (int exp = 0; exp < digits; ++exp) {
 // Count elements
 size_t count0 = 0;
 size_t count1 = 0;
 for (size_t i = 0; i < size; ++i) {
 if ((a[i] >> exp) & 1) {
 count1 += 1;
 } else {
 count0 += 1;
 }
 }
 // Cumulative sum
 count1 += count0;
 // Build output array
 for (int i = size - 1; i >= 0; --i) {
 if ((a[i] >> exp) & 1) {
 b[--count1] = a[i];
 } else {
 b[--count0] = a[i];
 }
 }
 int *p = a; a = b; b = p;
}

After rewriting like this, it becomes apparent that after the first loop, count0 + count1 == size; and after the "Cumulative sum" step, count1 == size. So we can eliminate half of the code.

 size_t count0 = 0;
 size_t count1 = size;
 for (size_t i = 0; i < size; ++i) {
 if (((a[i] >> exp) & 1) == 0) {
 count0 += 1;
 }
 }
 // Build output array
 for (int i = size - 1; i >= 0; --i) {
 if ((a[i] >> exp) & 1) {
 b[--count1] = a[i];
 } else {
 b[--count0] = a[i];
 }
 }

Then, the "Build output array" step is doing the exact same workload (a[i] >> exp) & 1 a second time! That seems like a fruitful source of optimization. What if you folded the second loop into the first loop, something like this?

for (int exp = 0; exp < digits; ++exp) {
 size_t up = 0;
 size_t down = size;
 for (size_t i = 0; i < size; ++i) {
 int x = a[i];
 if ((x >> exp) & 1) {
 b[--down] = x;
 } else {
 b[up++] = x;
 }
 }
 assert(up == down);
 // Now elements [up..size) are in reversed order,
 // so we need to flip them back around.
 reverse_array(b + up, b + size);
 int *temp = a; a = b; b = temp;
}

Writing reverse_array is left as an exercise for the reader.

I'd be interested to see your benchmark results for this "improved" algorithm.

0.27s for your optimized algorithm (. Changing the array for two registered variables does not change anything. Now we have the expected 6x faster than quick sort :)
heap allocation doesn't cost much, its effect is negligible on my machine. And yes, I don't like this need for a O(n) space buffer. It would be better to do it in place.

Stack Exchange Network

Radix sort is deceiving

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Radix sort is deceiving

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions