I wanted to compare radix_sort to quick_sort for values limited to 0..127 so I implemented this :
void bin_radix_sort(int *a, const size_t size, int digits) {
assert(digits % 2 == 0);
int *b = malloc(size * sizeof(int));
for (int exp = 0; exp < digits; exp++) {
// Count elements
size_t count[2] = {0};
for (size_t i = 0; i < size; i++)
count[(a[i] >> exp) & 1]++;
// Cumulative sum
count[1] += count[0];
// Build output array
for (int i = size - 1; i >= 0; i--)
b[--count[(a[i] >> exp) & 1]] = a[i];
int *p = a; a = b; b = p;
};
free(b);
}
I was able to compare it to qsort
with :
struct timespec start;
void tic() {
timespec_get(&start, TIME_UTC);
}
double toc() {
struct timespec stop;
timespec_get(&stop, TIME_UTC);
return stop.tv_sec - start.tv_sec + (
stop.tv_nsec - start.tv_nsec
) * 1e-9;
}
int cmpfunc (const void * a, const void * b) {
return ( *(int*)a - *(int*)b );
}
int main(void) {
const size_t n = 1024 * 1024 * 50;
printf("Init memory (%ld MB)...\n", n / 1024 / 1024 * sizeof(int));
int *data = calloc(n, sizeof(int));
printf("Sorting n = %ld data elements...\n", n);
size_t O;
tic();
O = n * log(n);
qsort(data, n, sizeof(data[0]), cmpfunc);
printf("%ld %lf s\n", O, toc());
int d = 6;
tic();
O = d * (n + 2);
bin_radix_sort(data, n, d);
printf("%ld %lf s\n", O, toc());
}
This gives me this output:
$ gcc -Os bench.c -lm
$ ./a.out
Init memory (200 MB)...
Sorting n = 52428800 data elements...
931920169 1.600909 s
314572812 0.963436 s
I guess my code needs code-review since I was expecting to be six times better than quick sort.
2 Answers 2
Well, I guarantee that qsort
doesn't start with a heap allocation. Benchmark several different sizes of array and graph the results, to see where the lines hit the Y-axis: how much of your benchmark is just measuring the speed of malloc
?
size_t count[2] = {0};
Depending on the smartness of your compiler, an array that doesn't have to be an array might be a big performance hit. Arrays are often stored in memory, on the stack, as opposed to scalar variables which can be stored in registers without any real cleverness on the compiler's part. Plus, in this case, your code seems to be needlessly convoluted by the use of an array instead of two different variables count0
and count1
. Compare:
for (int exp = 0; exp < digits; ++exp) {
// Count elements
size_t count0 = 0;
size_t count1 = 0;
for (size_t i = 0; i < size; ++i) {
if ((a[i] >> exp) & 1) {
count1 += 1;
} else {
count0 += 1;
}
}
// Cumulative sum
count1 += count0;
// Build output array
for (int i = size - 1; i >= 0; --i) {
if ((a[i] >> exp) & 1) {
b[--count1] = a[i];
} else {
b[--count0] = a[i];
}
}
int *p = a; a = b; b = p;
}
After rewriting like this, it becomes apparent that after the first loop, count0 + count1 == size
; and after the "Cumulative sum" step, count1 == size
. So we can eliminate half of the code.
size_t count0 = 0;
size_t count1 = size;
for (size_t i = 0; i < size; ++i) {
if (((a[i] >> exp) & 1) == 0) {
count0 += 1;
}
}
// Build output array
for (int i = size - 1; i >= 0; --i) {
if ((a[i] >> exp) & 1) {
b[--count1] = a[i];
} else {
b[--count0] = a[i];
}
}
Then, the "Build output array" step is doing the exact same workload (a[i] >> exp) & 1
a second time! That seems like a fruitful source of optimization. What if you folded the second loop into the first loop, something like this?
for (int exp = 0; exp < digits; ++exp) {
size_t up = 0;
size_t down = size;
for (size_t i = 0; i < size; ++i) {
int x = a[i];
if ((x >> exp) & 1) {
b[--down] = x;
} else {
b[up++] = x;
}
}
assert(up == down);
// Now elements [up..size) are in reversed order,
// so we need to flip them back around.
reverse_array(b + up, b + size);
int *temp = a; a = b; b = temp;
}
Writing reverse_array
is left as an exercise for the reader.
I'd be interested to see your benchmark results for this "improved" algorithm.
-
1\$\begingroup\$ 0.27s for your optimized algorithm (. Changing the array for two registered variables does not change anything. Now we have the expected 6x faster than quick sort :) \$\endgroup\$nowox– nowox2020年04月29日 22:14:23 +00:00Commented Apr 29, 2020 at 22:14
-
\$\begingroup\$ heap allocation doesn't cost much, its effect is negligible on my machine. And yes, I don't like this need for a O(n) space buffer. It would be better to do it in place. \$\endgroup\$nowox– nowox2020年04月29日 22:17:25 +00:00Commented Apr 29, 2020 at 22:17
Limitations on cmpfunc()
with qsort()
Presently code only has a zero-filled array to sort, so cmpfunc()
is OK.
This makes for an interesting performance test.
If the array was populated with [0..127] as suggested in the question, cmpfunc()
is still OK.
If the array was populated with [INT_MIN...INT_MAX
], cmpfunc()
is UB.
For qsort
to perform and complete, the follow is required:
The function shall return an integer less than, equal to, or greater than zero if the first argument is considered to be respectively less than, equal to, or greater than the second. §17dr 7.22.5.2 3
Unfortunately *(int*)a - *(int*)b
is prone to overflow (UB) and returning the wrong signed difference.
int cmpfunc (const void * a, const void * b) {
return ( *(int*)a - *(int*)b ); // UB
}
Suggest a stable alternative:
int cmpfunc (const void * a, const void * b) {
int ia = * ((const int *)a);
int ib = * ((const int *)b);
return (ia > ib) - (ia < ib);
}
int *data = calloc(n, sizeof(int));
. I'd expect some random filller fordata[]
. Was it the intention to compare perforce with the special case of all zero value? \$\endgroup\$