As requested by @Reinderien, this is the complete gcc command that is invoked when the radix sort is compiled by gcc:
building 'pyorama.algs.radix_sort' extension
gcc -Wno-unused-result -Wsign-compare -DNDEBUG -march=x86-64 -mtune=generic -O2 -pipe -fwrapv -D__USE_MINGW_ANSI_STDIO=1 -D_WIN32_WINNT=0x0601 -DNDEBUG -march=x86-64 -mtune=generic -O2 -pipe -fwrapv -D__USE_MINGW_ANSI_STDIO=1 -D_WIN32_WINNT=0x0601 -DNDEBUG -DCIMGUI_DEFINE_ENUMS_AND_STRUCTS=True -DCGLTF_IMPLEMENTATION=True -DSTB_IMAGE_IMPLEMENTATION=True -DSTBI_FAILURE_USERMSG=True -DCGLM_CLIPSPACE_INCLUDE_ALL=True -DCGLM_ALL_UNALIGNED=True -I./pyorama/algs -I./pyorama/libs/include -IC:/msys64/mingw64/include/python3.9 -c ./pyorama/algs/radix_sort.c -o build/temp.mingw_x86_64-3.9/./pyorama/algs/radix_sort.o -w -std=c11 -O3 -ffast-math -march=native
And here is a link to the results (.csv) as well as a graph of the data:
As requested by @Reinderien, this is the complete gcc command that is invoked when the radix sort is compiled by gcc:
building 'pyorama.algs.radix_sort' extension
gcc -Wno-unused-result -Wsign-compare -DNDEBUG -march=x86-64 -mtune=generic -O2 -pipe -fwrapv -D__USE_MINGW_ANSI_STDIO=1 -D_WIN32_WINNT=0x0601 -DNDEBUG -march=x86-64 -mtune=generic -O2 -pipe -fwrapv -D__USE_MINGW_ANSI_STDIO=1 -D_WIN32_WINNT=0x0601 -DNDEBUG -DCIMGUI_DEFINE_ENUMS_AND_STRUCTS=True -DCGLTF_IMPLEMENTATION=True -DSTB_IMAGE_IMPLEMENTATION=True -DSTBI_FAILURE_USERMSG=True -DCGLM_CLIPSPACE_INCLUDE_ALL=True -DCGLM_ALL_UNALIGNED=True -I./pyorama/algs -I./pyorama/libs/include -IC:/msys64/mingw64/include/python3.9 -c ./pyorama/algs/radix_sort.c -o build/temp.mingw_x86_64-3.9/./pyorama/algs/radix_sort.o -w -std=c11 -O3 -ffast-math -march=native
And here is a link to the results (.csv) as well as a graph of the data:
I am trying to optimize the following (now C) radix sort code for use in my game engine library. The basis for this code was inspired by this youtube video on a C++ implementation called skasort
that used templating to sort STL containers. The current first pass code sorts in-place in most-significant-byte (MSB) order and is designed to handle [u]int[8/16/32/64]_t
types as well as float/f32
, double/f64
and ``char *`char *
strings of variable lengths. Here is the raw C code (header-only) I wrote that my cython library is wrapping:
The benchmarking test I have is writtingwritten in cython and runs radix sorts on each of the supported types (with the char * test being on random alphanumeric strings ranging from 20-50 chars in length), on a custom vector-like container from n=2^1
to n=2^24 items
. I can add the benchmarking code if needed; in the meantime, the code for the game engine repo can be found here. The performance of this code is slower than I would like, ranging from 5x
faster in the simple uint8_t
case to about 0.9x
the speed of the radix sort in the worst cases where the sort code needs to be recursively called (the 64-bit types and the variable length strings). I have the following questions:
I am trying to optimize the following (now C) radix sort code for use in my game engine library. The basis for this code was inspired by this youtube video on a C++ implementation called skasort
that used templating to sort STL containers. The current first pass code sorts in-place in most-significant-byte (MSB) order and is designed to handle [u]int[8/16/32/64]_t
types as well as float/f32
, double/f64
and ``char *` strings of variable lengths. Here is the raw C code (header-only) I wrote that my cython library is wrapping:
The benchmarking test I have is writting in cython and runs radix sorts on each of the supported types (with the char * test being on random alphanumeric strings ranging from 20-50 chars in length), on a custom vector-like container from n=2^1
to n=2^24 items
. I can add the benchmarking code if needed; in the meantime, the code for the game engine repo can be found here. The performance of this code is slower than I would like, ranging from 5x
faster in the simple uint8_t
case to about 0.9x
the speed of the radix sort in the worst cases where the sort code needs to be recursively called (the 64-bit types and the variable length strings). I have the following questions:
I am trying to optimize the following (now C) radix sort code for use in my game engine library. The basis for this code was inspired by this youtube video on a C++ implementation called skasort
that used templating to sort STL containers. The current first pass code sorts in-place in most-significant-byte (MSB) order and is designed to handle [u]int[8/16/32/64]_t
types as well as float/f32
, double/f64
and char *
strings of variable lengths. Here is the raw C code (header-only) I wrote that my cython library is wrapping:
The benchmarking test I have is written in cython and runs radix sorts on each of the supported types (with the char * test being on random alphanumeric strings ranging from 20-50 chars in length), on a custom vector-like container from n=2^1
to n=2^24 items
. I can add the benchmarking code if needed; in the meantime, the code for the game engine repo can be found here. The performance of this code is slower than I would like, ranging from 5x
faster in the simple uint8_t
case to about 0.9x
the speed of the radix sort in the worst cases where the sort code needs to be recursively called (the 64-bit types and the variable length strings). I have the following questions:
- 302
- 1
- 2
- 13
I am trying to optimize the following cython(now C) radix sort code I have been working on the past week to be usedfor use in my game engine for tasks such as sorting sprites by z-depth/distance from camera for renderinglibrary. The basis for this code was inspired by this youtube video on a C++ implementation called skasort
that used templating to sort STL containers. The current first pass code sorts in-place in most-significant-byte (MSB) order and is designed to handle [u]int[8/16/32/64]_t
types as well as float/f32
, double/f64
and ``char *` strings of variable lengths. I intend to use this code purely within cython and have removed any python object overhead.
The codeHere is split into a header (.pxd) file as follows:
from pyorama.libs.c cimport *
cpdef enum RadixSortType:
RADIX_SORT_TYPE_U8
RADIX_SORT_TYPE_I8
RADIX_SORT_TYPE_U16
RADIX_SORT_TYPE_I16
RADIX_SORT_TYPE_U32
RADIX_SORT_TYPE_I32
RADIX_SORT_TYPE_U64
RADIX_SORT_TYPE_I64
RADIX_SORT_TYPE_F32
RADIX_SORT_TYPE_F64
RADIX_SORT_TYPE_STR
ctypedef cmp_func_t BackupCmpFuncC
cdef int cmp_func_u8(const void *a, const void *b) nogil
cdef int cmp_func_i8(const void *a, const void *b) nogil
cdef int cmp_func_u16(const void *a, const void *b) nogil
cdef int cmp_func_i16(const void *a, const void *b) nogil
cdef int cmp_func_u32(const void *a, const void *b) nogil
cdef int cmp_func_i32(const void *a, const void *b) nogil
cdef int cmp_func_u64(const void *a, const void *b) nogil
cdef int cmp_func_i64(const void *a, const void *b) nogil
cdef int cmp_func_f32(const void *a, const void *b) nogil
cdef int cmp_func_f64(const void *a, const void *b) nogil
cdef int cmp_func_str(const void *a, const void *b) nogil
ctypedef uint8_t (* RadixKeyFuncC)(void *item, size_t byte_offset) nogil
cdef uint8_t radix_key_func_u8(void *item, size_t byte_offset) nogil
cdef uint8_t radix_key_func_i8(void *item, size_t byte_offset) nogil
cdef uint8_t radix_key_func_u16(void *item, size_t byte_offset) nogil
cdef uint8_t radix_key_func_i16(void *item, size_t byte_offset) nogil
cdef uint8_t radix_key_func_u32(void *item, size_t byte_offset) nogil
cdef uint8_t radix_key_func_i32(void *item, size_t byte_offset) nogil
cdef uint8_t radix_key_func_u64(void *item, size_t byte_offset) nogil
cdef uint8_t radix_key_func_i64(void *item, size_t byte_offset) nogil
cdef uint8_t radix_key_func_f32(void *item, size_t byte_offset) nogil
cdef uint8_t radix_key_func_f64(void *item, size_t byte_offset) nogil
cdef uint8_t radix_key_func_str(void *item, size_t byte_offset) nogil
cdef void c_radix_sort(void *items, size_t item_size, size_t start, size_t end, size_t start_offset, RadixSortType type_, RadixKeyFuncC key_func=*) nogil
And an implementationthe raw C code (.pyxheader-only) file as followsI wrote that my cython library is wrapping:
DEF RADIX = 256
DEF RADIX_THRESHOLD = 128
ctypedef struct RadixPartitionTableC:
size_t[RADIX] counts
size_t[RADIX + 1] prefix_sums
size_t[RADIX + 1] shifted_sums
cdef void table_clear(RadixPartitionTableC *table) nogil:
memset(table.counts, 0, sizeof(size_t) * RADIX)
#include memset(table<stdint.prefix_sums, 0, sizeof(size_t) * (RADIX + 1))h>
#include memset(table<stdbool.shifted_sums, 0, sizeof(size_t) * (RADIX + 1))
cdef void item_swap(void *items, size_t a, size_t b, size_t size) nogil:
cdef:
size_t i
uint8_t *a_ptr = <uint8_t *>items + (a * size)
uint8_t *b_ptr = <uint8_t *>items + (b * size)
for i in range(size):
a_ptr[i], b_ptr[i] = b_ptr[i], a_ptr[i]
cdef int cmp_func_u8(const void *a, const void *b) nogil:
cdef:
uint8_t a_i = (<uint8_t *>a)[0]
uint8_t b_i = (<uint8_t *>b)[0]
if a_i < b_i: return -1
elif a_i > b_i: return 1
else: return 0
cdef int cmp_func_i8(const void *a, const void *b) nogil:
cdef:
int8_t a_i = (<int8_t *>a)[0]
int8_t b_i = (<int8_t *>b)[0]
if a_i < b_i: return -1
elif a_i > b_i: return 1
else: return 0h>
cdef int cmp_func_u16(const void *a, const void *b) nogil:
cdef:
uint16_t a_i = (<uint16_t *>a)[0]
uint16_t b_i = (<uint16_t *>b)[0]
if a_i < b_i: return -1
elif a_i > b_i:#define returnRADIX 1256
else:#define returnRADIX_THRESHOLD 0128
cdef int#define cmp_func_i16CMP_CHECK(const void *aa, const voidb, *bcmp_type) nogil:
cdef:\
cmp_type a_i = *(cmp_type *)a; int16_t\
cmp_type a_ib_i = *(<int16_tcmp_type *>a*)[0]
b; \
if(a_i < b_i) \
{ \
int16_t b_i =return (<int16_t-1; *>b)[0]\
} \
else if(a_i <> b_i: return) -1\
{ \
elif a_ireturn >1; b_i:\
} return\
else 1\
{ \
else: return 00; \
} \
cdef#define CMP_FUNC(a, b, cmp_type, cmp_suffix) \
int cmp_func_u32cmp_func_##cmp_suffix(const void *a, const void *b) nogil:\
{ cdef:\
uint32_t a_i = CMP_CHECK(<uint32_t *>a)[0]
uint32_t b_i =a, (<uint32_tb, *>bcmp_type)[0]
if a_i < b_i: return -1
elif a_i > b_i: return 1\
else: return} 0\
cdef int cmp_func_i32(const void *a, const voidtypedef *b)enum nogil:RadixSortType
cdef:{
RADIX_SORT_TYPE_U8,
int32_tRADIX_SORT_TYPE_I8,
a_i = (<int32_t *>a)[0]RADIX_SORT_TYPE_U16,
RADIX_SORT_TYPE_I16,
int32_tRADIX_SORT_TYPE_U32,
b_i = (<int32_t *>b)[0]RADIX_SORT_TYPE_I32,
if a_iRADIX_SORT_TYPE_U64,
< b_i: return -1RADIX_SORT_TYPE_I64,
elif a_iRADIX_SORT_TYPE_F32,
> b_i: return 1RADIX_SORT_TYPE_F64,
else: returnRADIX_SORT_TYPE_STR,
} 0RadixSortType;
cdeftypedef int cmp_func_u64(* CmpFuncC)(const void *a*, const void *b*) nogil:
cdef:;
uint64_t a_itypedef =uint8_t (<uint64_t* *>aRadixKeyFuncC)[0]
uint64_t b_i = (<uint64_t *>b)[0]
if a_i < b_i: return -1
elif a_i > b_i: return 1
void else:*item, returnsize_t 0byte_offset);
cdef int cmp_func_i64(const void *a, const void *b) nogil:
typedef struct cdef:RadixPartitionTableC
int64_t a_i = (<int64_t *>a)[0]{
int64_t b_i = (<int64_tsize_t *>b)[0]counts[RADIX];
if a_i <size_t b_i:prefix_sums[RADIX return+ -11];
elif a_i >size_t b_i:shifted_sums[RADIX return+ 11];
else: return} 0RadixPartitionTableC;
cdef int cmp_func_f32(const void *a, const voidtable_clear(RadixPartitionTableC *b*table) nogil:
cdef:{
float a_imemset(table->counts, =0, sizeof(<float *>asize_t)[0]
* RADIX);
float b_imemset(table->prefix_sums, =0, sizeof(<float *>bsize_t)[0]
if a_i <* b_i:(RADIX return+ -1));
elifmemset(table->shifted_sums, a_i0, >sizeof(size_t) b_i:* return(RADIX + 1));
else: return 0}
cdef intvoid cmp_func_f64item_swap(const void *a*items, constsize_t voida, *b)size_t nogil:b, size_t size)
{
cdef:
uint8_t *a_ptr = (uint8_t *)items + (a * size);
double a_iuint8_t *b_ptr = (<doubleuint8_t *>a*)[0]items + (b * size);
uint8_t tmp;
double b_ifor(size_t i = (<double0; *>bi < size; i++)[0]
if{
a_i < b_i: return -1 tmp = a_ptr[i];
elif a_i > b_i: returna_ptr[i] 1= b_ptr[i];
else: return 0 b_ptr[i] = tmp;
}
}
cdefCMP_FUNC(a, intb, cmp_func_struint8_t, u8)
CMP_FUNC(consta, voidb, *aint8_t, consti8)
CMP_FUNC(a, voidb, *b)uint16_t, nogil:u16)
CMP_FUNC(a, b, int16_t, cdef:i16)
CMP_FUNC(a, b, uint32_t, u32)
CMP_FUNC(a, b, int32_t, i32)
CMP_FUNC(a, b, charuint64_t, *a_iu64)
CMP_FUNC(a, =b, (<charint64_t, **>ai64)[0]
CMP_FUNC(a, b, float, f32)
CMP_FUNC(a, b, double, f64)
int cmp_func_str(const charvoid *b_i*a, =const (<charvoid **>b*b)[0]
{
return strcmp(a_i*(char **)a, b_i*(char **)b);
}
cdef uint8_t radix_key_func_u8(void *item, size_t byte_offset) nogil:
{
return (<uint8_t(uint8_t *>item*)item + byte_offset)[0][0];
}
cdef uint8_t radix_key_func_i8(void *item, size_t byte_offset) nogil:
{
return (<uint8_t(uint8_t *>item*)item + byte_offset)[0] + 128128;
}
cdef uint8_t radix_key_func_u16(void *item, size_t byte_offset) nogil:
{
return (<uint8_t(uint8_t *>item*)item + byte_offset)[0][0];
}
cdef uint8_t radix_key_func_i16(void *item, size_t byte_offset) nogil:
cdef:{
uint16_t value
value = (<uint16_t(uint16_t *>item*)item)[0] + 3276832768;
return (<uint8_t(uint8_t *>&value*)&value + byte_offset)[0][0];
}
cdef uint8_t radix_key_func_u32(void *item, size_t byte_offset) nogil:
{
return (<uint8_t(uint8_t *>item*)item + byte_offset)[0][0];
}
cdef uint8_t radix_key_func_i32(void *item, size_t byte_offset) nogil:
cdef:{
uint32_t value value = (<int32_t(int32_t *>item*)item)[0] + <int32_t>2147483648(int32_t)2147483648;
return (<uint8_t(uint8_t *>&value*)&value + byte_offset)[0][0];
}
cdef uint8_t radix_key_func_u64(void *item, size_t byte_offset) nogil:
{
return (<uint8_t(uint8_t *>item*)item + byte_offset)[0][0];
}
cdef uint8_t radix_key_func_i64(void *item, size_t byte_offset) nogil:
cdef:{
uint64_t value value = (<int64_t(int64_t *>item*)item)[0] + <int64_t>9223372036854775808(int64_t)9223372036854775808;
return (<uint8_t(uint8_t *>&value*)&value + byte_offset)[0][0];
}
cdef uint8_t radix_key_func_f32(void *item, size_t byte_offset) nogil:
cdef:{
uint32_t value
uint32_t mask
uint32_t shifted_value
value = (<uint32_t(uint32_t *>item*)[0]item)[0];
uint32_t mask = -<int32_t>(int32_t)(value >> <uint32_t>31(uint32_t)31) | <uint32_t>2147483648(uint32_t)2147483648;
uint32_t shifted_value = value ^ maskmask;
return (<uint8_t(uint8_t *>*)(&shifted_value) + byte_offset)[0][0];
}
cdef uint8_t radix_key_func_f64(void *item, size_t byte_offset) nogil:
cdef:{
uint64_t value
uint64_t mask
uint64_t shifted_value
value = (<uint64_t(uint64_t *>item*)[0]item)[0];
uint64_t mask = -<int64_t>(int64_t)(value >> <uint64_t>63(uint64_t)63) | <uint64_t>9223372036854775808(uint64_t)9223372036854775808;
uint64_t shifted_value = value ^ maskmask;
return (<uint8_t(uint8_t *>*)(&shifted_value) + byte_offset)[0][0];
}
cdef uint8_t radix_key_func_str(void *item, size_t byte_offset) nogil:
{
return ((<uint8_t(uint8_t **>item**)item)[0] + byte_offset)[0][0];
}
cdef void c_radix_sort(void *items, size_t item_size, size_t start, size_t end, size_t start_offset, RadixSortType type_, RadixKeyFuncC key_func=NULL) nogil:
cdef:
size_t start, size_t iend, j,size_t kstart_offset, m
uint8_t item
RadixSortType type_, size_tRadixKeyFuncC countkey_func)
{
size_t offset = 0byte_offset;
RadixPartitionTableCsize_t tablenum_bytes;
bool flip_byte_order size_t[RADIX]= countstrue;
bool use_null_term size_t= part_startfalse;
size_tCmpFuncC part_endcmp_func;
if switch(type_)
== RADIX_SORT_TYPE_U8: {
c_radix_sort_u8(items,case item_size,RADIX_SORT_TYPE_U8:
start, end, start_offset, 0)
elif type_ == RADIX_SORT_TYPE_I8: byte_offset = 0;
c_radix_sort_i8(items, item_size, start, end, start_offset,num_bytes 0= sizeof(uint8_t);
elif type_ == RADIX_SORT_TYPE_U16:
key_func = radix_key_func_u8;
c_radix_sort_u16(items, item_size, start, end, start_offset, 1)
elif type_ ==cmp_func RADIX_SORT_TYPE_I16:= cmp_func_u8;
c_radix_sort_i16(items, item_size, start, end, start_offset, 1)break;
elif type_ == RADIX_SORT_TYPE_U32 case RADIX_SORT_TYPE_I8:
c_radix_sort_u32(items, item_size, start, end, start_offset,byte_offset 3)= 0;
elif type_ == RADIX_SORT_TYPE_I32:
num_bytes = c_radix_sort_i32sizeof(items,int8_t);
item_size, start, end, start_offset, 3)
elif type_ == RADIX_SORT_TYPE_U64:key_func = radix_key_func_i8;
c_radix_sort_u64(items, item_size, start, end, start_offset,cmp_func 7)= cmp_func_i8;
elif type_ == RADIX_SORT_TYPE_I64:
break;
c_radix_sort_i64(items, item_size, start, end, start_offset, 7)case RADIX_SORT_TYPE_U16:
elif type_ == RADIX_SORT_TYPE_F32:
byte_offset = 1;
c_radix_sort_f32(items, item_size, start, end, start_offset, 3)
elif type_ ==num_bytes RADIX_SORT_TYPE_F64:= sizeof(uint16_t);
c_radix_sort_f64(items, item_size, start, end, start_offset,key_func 7)= radix_key_func_u16;
elif type_ == RADIX_SORT_TYPE_STR:
cmp_func = cmp_func_u16;
c_radix_sort_str(items, item_size, start, end, start_offset, 0)
cdef void c_radix_sort_u8(void *items, size_t item_size,break;
size_t start, size_t end, size_t start_offset, size_t byte_offset)case nogilRADIX_SORT_TYPE_I16:
c_radix_sort_byte(
items,byte_offset item_size,= start,1;
end, start_offset,
byte_offset, num_bytes = sizeof(uint8_tint16_t), True, False,;
radix_key_func_u8, cmp_func_u8,
key_func )
= radix_key_func_i16;
cdef void c_radix_sort_i8(void *items, size_t item_size, size_t start, size_t end, size_t start_offset, size_tcmp_func byte_offset)= nogil:cmp_func_i16;
c_radix_sort_byte(
items,break;
item_size, start, end, start_offset, case RADIX_SORT_TYPE_U32:
byte_offset, sizeof(int8_t), True, False, byte_offset = 3;
radix_key_func_i8, cmp_func_i8,
num_bytes = sizeof(uint32_t);
cdef void c_radix_sort_u16(void *items, size_t item_size, size_t start, size_t end, size_t start_offset, size_tkey_func byte_offset)= nogil:radix_key_func_u32;
c_radix_sort_byte(
items, item_size, start,cmp_func end,= start_offset,cmp_func_u32;
byte_offset, sizeof(uint16_t), True, False, break;
radix_key_func_u16,case cmp_func_u16,RADIX_SORT_TYPE_I32:
)
cdef void c_radix_sort_i16(void *items, size_t item_size, size_t start, size_tbyte_offset end,= size_t3;
start_offset, size_t byte_offset) nogil:
c_radix_sort_byte num_bytes = sizeof(int32_t);
items, item_size, start, end, start_offset,key_func = radix_key_func_i32;
byte_offset, sizeof(int16_t), True, False, cmp_func = cmp_func_i32;
radix_key_func_i16, cmp_func_i16, break;
)
cdef void c_radix_sort_u32(void *items, size_tcase item_size,RADIX_SORT_TYPE_U64:
size_t start, size_t end, size_t start_offset, size_t byte_offset) nogil:
byte_offset c_radix_sort_byte(= 7;
items, item_size, start, end, start_offset,num_bytes = sizeof(uint64_t);
byte_offset, sizeof(uint32_t), True, False, key_func = radix_key_func_u64;
radix_key_func_u32, cmp_func_u32,
cmp_func = )
cmp_func_u64;
cdef void c_radix_sort_i32(void *items, size_t item_size, size_t start, size_t end, size_t start_offset, size_tbreak;
byte_offset) nogil:
c_radix_sort_byte( case RADIX_SORT_TYPE_I64:
items, item_size, start, end, start_offset,byte_offset = 7;
byte_offset, num_bytes = sizeof(int32_tint64_t),;
True, False,
radix_key_func_i32, cmp_func_i32, key_func = radix_key_func_i64;
)
cdef void c_radix_sort_u64(void *items, size_t item_size, size_t start, size_tcmp_func end,= size_tcmp_func_i64;
start_offset, size_t byte_offset) nogil: break;
c_radix_sort_byte( case RADIX_SORT_TYPE_F32:
items, item_size, start, end, start_offset,byte_offset = 3;
byte_offset, num_bytes = sizeof(uint64_tfloat),;
True, False,
radix_key_func_u64, cmp_func_u64, key_func = radix_key_func_f32;
)
cdef void c_radix_sort_i64(void *items, size_t item_size, size_t start, size_tcmp_func end,= size_tcmp_func_f32;
start_offset, size_t byte_offset) nogil: break;
c_radix_sort_byte( case RADIX_SORT_TYPE_F64:
items, item_size, start, end, start_offset,byte_offset = 7;
byte_offset, num_bytes = sizeof(int64_tdouble), True, False,;
radix_key_func_i64, cmp_func_i64,
key_func )
= radix_key_func_f64;
cdef void c_radix_sort_f32(void *items, size_t item_size, size_t start, size_t end, size_t start_offset, size_tcmp_func byte_offset)= nogil:cmp_func_f64;
c_radix_sort_byte(
items,break;
item_size, start, end, start_offset, case RADIX_SORT_TYPE_STR:
byte_offset, sizeof(float), True, False, byte_offset = 0;
radix_key_func_f32, cmp_func_f32,
num_bytes = sizeof(char *);
cdef void c_radix_sort_f64(void *items, size_t item_size, size_t start, size_t end, size_t start_offset, size_tflip_byte_order byte_offset)= nogil:false;
c_radix_sort_byte(
items,use_null_term item_size,= start,true;
end, start_offset,
byte_offset, sizeof(double), True,key_func False,= radix_key_func_str;
radix_key_func_f64, cmp_func_f64,
cmp_func )
= cmp_func_str;
cdef void c_radix_sort_str(void *items, size_t item_size, size_t start, size_t end, size_t start_offset, size_tbreak;
byte_offset) nogil: }
c_radix_sort_byte(
items, item_size, start, end, start_offset,
byte_offset, sizeof(char *)num_bytes, Falseflip_byte_order, True,
radix_key_func_struse_null_term, cmp_func_strkey_func, cmp_func
);
}
cdefinline void c_radix_sort_byte(void *items, size_t item_size, size_t start, size_t end, size_t start_offset, size_t byte_offset, size_t num_bytes, bint flip_byte_order, bint use_null_term, RadixKeyFuncCsize_t key_funcbyte_offset, BackupCmpFuncC cmp_func) nogil:
size_t num_bytes, bool flip_byte_order, cdef:
size_tbool i
use_null_term, RadixKeyFuncC key_func, CmpFuncC cmp_func)
{
void *item_ptr*item_ptr;
uint8_t itemitem;
size_t totaltotal;
size_t prefix_sumprefix_sum;
size_t shifted_sumshifted_sum;
size_t a, bb;
RadixPartitionTableC tabletable;
size_t countcount;
size_t table_starttable_start;
size_t table_endtable_end;
int8_t byte_order = -1 if flip_byte_order else? -1 : 1;
table_clear(&table);
for(size_t i in= range(start,start; endi < end; i++):
{
item_ptr = (<uint8_t(uint8_t *>items*)items) + (i * item_size);
item = key_func(item_ptr, byte_offset);
table.counts[item] += 11;
}
if(use_null_term:)
{
table.counts[0] = 00;
}
total = 00;
for(size_t i in= range(RADIX0; i < RADIX; i++):
{
total += table.counts[i]counts[i];
table.prefix_sums[i + 1] += totaltotal;
table.shifted_sums[i] = table.prefix_sums[i + 1]1];
}
for(size_t i in= range(start,start; endi < end; i++):
{
while(true)
True: {
item_ptr = (<uint8_t(uint8_t *>items*)items) + (i * item_size);
item = key_func(item_ptr, byte_offset);
prefix_sum = table.prefix_sums[item]prefix_sums[item];
shifted_sum = table.shifted_sums[item]shifted_sums[item];
if(prefix_sum == shifted_sum:)
{
break break;
}
a = ii;
b = start + prefix_sumprefix_sum;
item_swap(items, a, b, item_size);
table.prefix_sums[item] += 11;
}
}
if not (!flip_byte_order and&& byte_offset == num_bytes and not&& !use_null_term:)
{
return return;
elif}
else if(flip_byte_order and&& byte_offset == 0 and not&& !use_null_term:)
{
return return;
}
else:
{
total = 00;
for(size_t i in= range(RADIX0; i < RADIX; i++):
{
count = table.counts[i]counts[i];
table_start = start + totaltotal;
table_end = start + total + countcount;
total += countcount;
if(count >= RADIX_THRESHOLD:)
{
c_radix_sort_byte(items, item_size, table_start, table_end, start_offset, byte_offset + byte_order, num_bytes, flip_byte_order, use_null_term, key_func, cmp_func);
elif}
else if(count > 1:)
{
qsort(items + (item_size * table_start) + start_offset, table_end - table_start, item_size, cmp_func);
}
}
}
}
The benchmarking test I have is writting in cython and runs radix sorts on each of the supported types (with the char * test being on random alphanumeric strings ranging from 20-50 chars in length), on a custom vector-like container from n=2^1
to n=2^24 items
. I can add the benchmarking code if needed; in the meantime, the code for the game engine repo can be found here. The performance of this code is slower than I would like, ranging from 5x
faster in the simple uint8_t
case to about 0.9x
the speed of the radix sort in the worst cases where the sort code needs to be recursively called (the 64-bit types and the variable length strings). I have the following questions:
- The code swaps items in-place, leading to a bunch of
memcpy
calls when items need to be swapped. Would pointer swapping and sorting out of place provide a meaningful performance benefit? This would come at the cost of allocating additional heap memory, which I was striving to avoid based on that above lecture. - I have heard about "loop unrolling" being a possible technique to improve performance. How would I go about implementing this? Would SIMD be needed to do this? Any resources on this would be helpful.
- I would like to generalize this to a series of passes based on struct members to do more complex sorts based on these primitive types (e.g. sorting players by distance from a target, alphabetizing user names, etc.). I would appreciate any ideas for a convenient, user-friendly API that could handle arbitrary structs.
- The code
(削除) The code is essentially already C-code. Would rewriting in C and making a cython wrapper provide any meaningful performance improvement to help the compiler optimize further? The code is already being compiled with reasonable compiler arguments in gcc (This point is essentially already C-code. Would rewritingmoot now that the code has been rewritten in C for both convenience (macros to deduplicate code) and makingreviewability (cython is a cython wrapper provide any meaningful performance improvement to help the compiler optimize further?niche language). The code is alreadystill being compiled with reasonable compiler arguments in gcc (-std=c11
,-O3
,-ffast-math
,-march=native
). (削除ここまで)-std=c11
,-O3
,-ffast-math
,-march=native
)the same flags and the performance is essentially unchanged.
I am trying to optimize the following cython radix sort code I have been working on the past week to be used in my game engine for tasks such as sorting sprites by z-depth/distance from camera for rendering. The basis for this code was inspired by this youtube video on a C++ implementation called skasort
that used templating to sort STL containers. The current first pass code sorts in-place in most-significant-byte (MSB) order and is designed to handle [u]int[8/16/32/64]_t
types as well as float/f32
, double/f64
and ``char *` strings of variable lengths. I intend to use this code purely within cython and have removed any python object overhead.
The code is split into a header (.pxd) file as follows:
from pyorama.libs.c cimport *
cpdef enum RadixSortType:
RADIX_SORT_TYPE_U8
RADIX_SORT_TYPE_I8
RADIX_SORT_TYPE_U16
RADIX_SORT_TYPE_I16
RADIX_SORT_TYPE_U32
RADIX_SORT_TYPE_I32
RADIX_SORT_TYPE_U64
RADIX_SORT_TYPE_I64
RADIX_SORT_TYPE_F32
RADIX_SORT_TYPE_F64
RADIX_SORT_TYPE_STR
ctypedef cmp_func_t BackupCmpFuncC
cdef int cmp_func_u8(const void *a, const void *b) nogil
cdef int cmp_func_i8(const void *a, const void *b) nogil
cdef int cmp_func_u16(const void *a, const void *b) nogil
cdef int cmp_func_i16(const void *a, const void *b) nogil
cdef int cmp_func_u32(const void *a, const void *b) nogil
cdef int cmp_func_i32(const void *a, const void *b) nogil
cdef int cmp_func_u64(const void *a, const void *b) nogil
cdef int cmp_func_i64(const void *a, const void *b) nogil
cdef int cmp_func_f32(const void *a, const void *b) nogil
cdef int cmp_func_f64(const void *a, const void *b) nogil
cdef int cmp_func_str(const void *a, const void *b) nogil
ctypedef uint8_t (* RadixKeyFuncC)(void *item, size_t byte_offset) nogil
cdef uint8_t radix_key_func_u8(void *item, size_t byte_offset) nogil
cdef uint8_t radix_key_func_i8(void *item, size_t byte_offset) nogil
cdef uint8_t radix_key_func_u16(void *item, size_t byte_offset) nogil
cdef uint8_t radix_key_func_i16(void *item, size_t byte_offset) nogil
cdef uint8_t radix_key_func_u32(void *item, size_t byte_offset) nogil
cdef uint8_t radix_key_func_i32(void *item, size_t byte_offset) nogil
cdef uint8_t radix_key_func_u64(void *item, size_t byte_offset) nogil
cdef uint8_t radix_key_func_i64(void *item, size_t byte_offset) nogil
cdef uint8_t radix_key_func_f32(void *item, size_t byte_offset) nogil
cdef uint8_t radix_key_func_f64(void *item, size_t byte_offset) nogil
cdef uint8_t radix_key_func_str(void *item, size_t byte_offset) nogil
cdef void c_radix_sort(void *items, size_t item_size, size_t start, size_t end, size_t start_offset, RadixSortType type_, RadixKeyFuncC key_func=*) nogil
And an implementation (.pyx) file as follows:
DEF RADIX = 256
DEF RADIX_THRESHOLD = 128
ctypedef struct RadixPartitionTableC:
size_t[RADIX] counts
size_t[RADIX + 1] prefix_sums
size_t[RADIX + 1] shifted_sums
cdef void table_clear(RadixPartitionTableC *table) nogil:
memset(table.counts, 0, sizeof(size_t) * RADIX)
memset(table.prefix_sums, 0, sizeof(size_t) * (RADIX + 1))
memset(table.shifted_sums, 0, sizeof(size_t) * (RADIX + 1))
cdef void item_swap(void *items, size_t a, size_t b, size_t size) nogil:
cdef:
size_t i
uint8_t *a_ptr = <uint8_t *>items + (a * size)
uint8_t *b_ptr = <uint8_t *>items + (b * size)
for i in range(size):
a_ptr[i], b_ptr[i] = b_ptr[i], a_ptr[i]
cdef int cmp_func_u8(const void *a, const void *b) nogil:
cdef:
uint8_t a_i = (<uint8_t *>a)[0]
uint8_t b_i = (<uint8_t *>b)[0]
if a_i < b_i: return -1
elif a_i > b_i: return 1
else: return 0
cdef int cmp_func_i8(const void *a, const void *b) nogil:
cdef:
int8_t a_i = (<int8_t *>a)[0]
int8_t b_i = (<int8_t *>b)[0]
if a_i < b_i: return -1
elif a_i > b_i: return 1
else: return 0
cdef int cmp_func_u16(const void *a, const void *b) nogil:
cdef:
uint16_t a_i = (<uint16_t *>a)[0]
uint16_t b_i = (<uint16_t *>b)[0]
if a_i < b_i: return -1
elif a_i > b_i: return 1
else: return 0
cdef int cmp_func_i16(const void *a, const void *b) nogil:
cdef:
int16_t a_i = (<int16_t *>a)[0]
int16_t b_i = (<int16_t *>b)[0]
ifa_i < b_i: return -1
elif a_i > b_i: return 1
else: return 0
cdef int cmp_func_u32(const void *a, const void *b) nogil:
cdef:
uint32_t a_i = (<uint32_t *>a)[0]
uint32_t b_i = (<uint32_t *>b)[0]
if a_i < b_i: return -1
elif a_i > b_i: return 1
else: return 0
cdef int cmp_func_i32(const void *a, const void *b) nogil:
cdef:
int32_t a_i = (<int32_t *>a)[0]
int32_t b_i = (<int32_t *>b)[0]
if a_i < b_i: return -1
elif a_i > b_i: return 1
else: return 0
cdef int cmp_func_u64(const void *a, const void *b) nogil:
cdef:
uint64_t a_i = (<uint64_t *>a)[0]
uint64_t b_i = (<uint64_t *>b)[0]
if a_i < b_i: return -1
elif a_i > b_i: return 1
else: return 0
cdef int cmp_func_i64(const void *a, const void *b) nogil:
cdef:
int64_t a_i = (<int64_t *>a)[0]
int64_t b_i = (<int64_t *>b)[0]
if a_i < b_i: return -1
elif a_i > b_i: return 1
else: return 0
cdef int cmp_func_f32(const void *a, const void *b) nogil:
cdef:
float a_i = (<float *>a)[0]
float b_i = (<float *>b)[0]
if a_i < b_i: return -1
elif a_i > b_i: return 1
else: return 0
cdef int cmp_func_f64(const void *a, const void *b) nogil:
cdef:
double a_i = (<double *>a)[0]
double b_i = (<double *>b)[0]
if a_i < b_i: return -1
elif a_i > b_i: return 1
else: return 0
cdef int cmp_func_str(const void *a, const void *b) nogil:
cdef:
char *a_i = (<char **>a)[0]
char *b_i = (<char **>b)[0]
return strcmp(a_i, b_i)
cdef uint8_t radix_key_func_u8(void *item, size_t byte_offset) nogil:
return (<uint8_t *>item + byte_offset)[0]
cdef uint8_t radix_key_func_i8(void *item, size_t byte_offset) nogil:
return (<uint8_t *>item + byte_offset)[0] + 128
cdef uint8_t radix_key_func_u16(void *item, size_t byte_offset) nogil:
return (<uint8_t *>item + byte_offset)[0]
cdef uint8_t radix_key_func_i16(void *item, size_t byte_offset) nogil:
cdef:
uint16_t value
value = (<uint16_t *>item)[0] + 32768
return (<uint8_t *>&value + byte_offset)[0]
cdef uint8_t radix_key_func_u32(void *item, size_t byte_offset) nogil:
return (<uint8_t *>item + byte_offset)[0]
cdef uint8_t radix_key_func_i32(void *item, size_t byte_offset) nogil:
cdef:
uint32_t value value = (<int32_t *>item)[0] + <int32_t>2147483648
return (<uint8_t *>&value + byte_offset)[0]
cdef uint8_t radix_key_func_u64(void *item, size_t byte_offset) nogil:
return (<uint8_t *>item + byte_offset)[0]
cdef uint8_t radix_key_func_i64(void *item, size_t byte_offset) nogil:
cdef:
uint64_t value value = (<int64_t *>item)[0] + <int64_t>9223372036854775808
return (<uint8_t *>&value + byte_offset)[0]
cdef uint8_t radix_key_func_f32(void *item, size_t byte_offset) nogil:
cdef:
uint32_t value
uint32_t mask
uint32_t shifted_value
value = (<uint32_t *>item)[0]
mask = -<int32_t>(value >> <uint32_t>31) | <uint32_t>2147483648
shifted_value = value ^ mask
return (<uint8_t *>(&shifted_value) + byte_offset)[0]
cdef uint8_t radix_key_func_f64(void *item, size_t byte_offset) nogil:
cdef:
uint64_t value
uint64_t mask
uint64_t shifted_value
value = (<uint64_t *>item)[0]
mask = -<int64_t>(value >> <uint64_t>63) | <uint64_t>9223372036854775808
shifted_value = value ^ mask
return (<uint8_t *>(&shifted_value) + byte_offset)[0]
cdef uint8_t radix_key_func_str(void *item, size_t byte_offset) nogil:
return ((<uint8_t **>item)[0] + byte_offset)[0]
cdef void c_radix_sort(void *items, size_t item_size, size_t start, size_t end, size_t start_offset, RadixSortType type_, RadixKeyFuncC key_func=NULL) nogil:
cdef:
size_t i, j, k, m
uint8_t item
size_t count
size_t offset = 0
RadixPartitionTableC table
size_t[RADIX] counts
size_t part_start
size_t part_end
if type_ == RADIX_SORT_TYPE_U8:
c_radix_sort_u8(items, item_size, start, end, start_offset, 0)
elif type_ == RADIX_SORT_TYPE_I8:
c_radix_sort_i8(items, item_size, start, end, start_offset, 0)
elif type_ == RADIX_SORT_TYPE_U16:
c_radix_sort_u16(items, item_size, start, end, start_offset, 1)
elif type_ == RADIX_SORT_TYPE_I16:
c_radix_sort_i16(items, item_size, start, end, start_offset, 1)
elif type_ == RADIX_SORT_TYPE_U32:
c_radix_sort_u32(items, item_size, start, end, start_offset, 3)
elif type_ == RADIX_SORT_TYPE_I32:
c_radix_sort_i32(items, item_size, start, end, start_offset, 3)
elif type_ == RADIX_SORT_TYPE_U64:
c_radix_sort_u64(items, item_size, start, end, start_offset, 7)
elif type_ == RADIX_SORT_TYPE_I64:
c_radix_sort_i64(items, item_size, start, end, start_offset, 7)
elif type_ == RADIX_SORT_TYPE_F32:
c_radix_sort_f32(items, item_size, start, end, start_offset, 3)
elif type_ == RADIX_SORT_TYPE_F64:
c_radix_sort_f64(items, item_size, start, end, start_offset, 7)
elif type_ == RADIX_SORT_TYPE_STR:
c_radix_sort_str(items, item_size, start, end, start_offset, 0)
cdef void c_radix_sort_u8(void *items, size_t item_size, size_t start, size_t end, size_t start_offset, size_t byte_offset) nogil:
c_radix_sort_byte(
items, item_size, start, end, start_offset,
byte_offset, sizeof(uint8_t), True, False,
radix_key_func_u8, cmp_func_u8,
)
cdef void c_radix_sort_i8(void *items, size_t item_size, size_t start, size_t end, size_t start_offset, size_t byte_offset) nogil:
c_radix_sort_byte(
items, item_size, start, end, start_offset,
byte_offset, sizeof(int8_t), True, False,
radix_key_func_i8, cmp_func_i8,
)
cdef void c_radix_sort_u16(void *items, size_t item_size, size_t start, size_t end, size_t start_offset, size_t byte_offset) nogil:
c_radix_sort_byte(
items, item_size, start, end, start_offset,
byte_offset, sizeof(uint16_t), True, False,
radix_key_func_u16, cmp_func_u16,
)
cdef void c_radix_sort_i16(void *items, size_t item_size, size_t start, size_t end, size_t start_offset, size_t byte_offset) nogil:
c_radix_sort_byte(
items, item_size, start, end, start_offset,
byte_offset, sizeof(int16_t), True, False,
radix_key_func_i16, cmp_func_i16,
)
cdef void c_radix_sort_u32(void *items, size_t item_size, size_t start, size_t end, size_t start_offset, size_t byte_offset) nogil:
c_radix_sort_byte(
items, item_size, start, end, start_offset,
byte_offset, sizeof(uint32_t), True, False,
radix_key_func_u32, cmp_func_u32,
)
cdef void c_radix_sort_i32(void *items, size_t item_size, size_t start, size_t end, size_t start_offset, size_t byte_offset) nogil:
c_radix_sort_byte(
items, item_size, start, end, start_offset,
byte_offset, sizeof(int32_t), True, False,
radix_key_func_i32, cmp_func_i32,
)
cdef void c_radix_sort_u64(void *items, size_t item_size, size_t start, size_t end, size_t start_offset, size_t byte_offset) nogil:
c_radix_sort_byte(
items, item_size, start, end, start_offset,
byte_offset, sizeof(uint64_t), True, False,
radix_key_func_u64, cmp_func_u64,
)
cdef void c_radix_sort_i64(void *items, size_t item_size, size_t start, size_t end, size_t start_offset, size_t byte_offset) nogil:
c_radix_sort_byte(
items, item_size, start, end, start_offset,
byte_offset, sizeof(int64_t), True, False,
radix_key_func_i64, cmp_func_i64,
)
cdef void c_radix_sort_f32(void *items, size_t item_size, size_t start, size_t end, size_t start_offset, size_t byte_offset) nogil:
c_radix_sort_byte(
items, item_size, start, end, start_offset,
byte_offset, sizeof(float), True, False,
radix_key_func_f32, cmp_func_f32,
)
cdef void c_radix_sort_f64(void *items, size_t item_size, size_t start, size_t end, size_t start_offset, size_t byte_offset) nogil:
c_radix_sort_byte(
items, item_size, start, end, start_offset,
byte_offset, sizeof(double), True, False,
radix_key_func_f64, cmp_func_f64,
)
cdef void c_radix_sort_str(void *items, size_t item_size, size_t start, size_t end, size_t start_offset, size_t byte_offset) nogil:
c_radix_sort_byte(
items, item_size, start, end, start_offset,
byte_offset, sizeof(char *), False, True,
radix_key_func_str, cmp_func_str,
)
cdef void c_radix_sort_byte(void *items, size_t item_size, size_t start, size_t end, size_t start_offset, size_t byte_offset, size_t num_bytes, bint flip_byte_order, bint use_null_term, RadixKeyFuncC key_func, BackupCmpFuncC cmp_func) nogil:
cdef:
size_t i
void *item_ptr
uint8_t item
size_t total
size_t prefix_sum
size_t shifted_sum
size_t a, b
RadixPartitionTableC table
size_t count
size_t table_start
size_t table_end
int8_t byte_order = -1 if flip_byte_order else 1
table_clear(&table)
for i in range(start, end):
item_ptr = (<uint8_t *>items) + (i * item_size)
item = key_func(item_ptr, byte_offset)
table.counts[item] += 1
ifuse_null_term:
table.counts[0] = 0
total = 0
for i in range(RADIX):
total += table.counts[i]
table.prefix_sums[i + 1] += total
table.shifted_sums[i] = table.prefix_sums[i + 1]
for i in range(start, end):
while True:
item_ptr = (<uint8_t *>items) + (i * item_size)
item = key_func(item_ptr, byte_offset)
prefix_sum = table.prefix_sums[item]
shifted_sum = table.shifted_sums[item]
ifprefix_sum == shifted_sum:
break
a = i
b = start + prefix_sum
item_swap(items, a, b, item_size)
table.prefix_sums[item] += 1
if not flip_byte_order and byte_offset == num_bytes and not use_null_term:
return
elif flip_byte_order and byte_offset == 0 and not use_null_term:
return
else:
total = 0
for i in range(RADIX):
count = table.counts[i]
table_start = start + total
table_end = start + total + count
total += count
ifcount >= RADIX_THRESHOLD:
c_radix_sort_byte(items, item_size, table_start, table_end, start_offset, byte_offset + byte_order, num_bytes, flip_byte_order, use_null_term, key_func, cmp_func)
elif count > 1:
qsort(items + (item_size * table_start) + start_offset, table_end - table_start, item_size, cmp_func)
The benchmarking test I have runs radix sorts on each of the supported types (with the char * test being on random alphanumeric strings ranging from 20-50 chars in length), on a custom vector-like container from n=2^1
to n=2^24 items
. I can add the benchmarking code if needed; in the meantime, the code for the game engine repo can be found here. The performance of this code is slower than I would like, ranging from 5x
faster in the simple uint8_t
case to about 0.9x
the speed of the radix sort in the worst cases where the sort code needs to be recursively called (the 64-bit types and the variable length strings). I have the following questions:
- The code swaps items in-place, leading to a bunch of
memcpy
calls when items need to be swapped. Would pointer swapping and sorting out of place provide a meaningful performance benefit? This would come at the cost of allocating additional heap memory, which I was striving to avoid based on that above lecture. - I have heard about "loop unrolling" being a possible technique to improve performance. How would I go about implementing this? Would SIMD be needed to do this? Any resources on this would be helpful.
- I would like to generalize this to a series of passes based on struct members to do more complex sorts based on these primitive types (e.g. sorting players by distance from a target, alphabetizing user names, etc.). I would appreciate any ideas for a convenient, user-friendly API that could handle arbitrary structs.
- The code is essentially already C-code. Would rewriting in C and making a cython wrapper provide any meaningful performance improvement to help the compiler optimize further? The code is already being compiled with reasonable compiler arguments in gcc (
-std=c11
,-O3
,-ffast-math
,-march=native
).
I am trying to optimize the following (now C) radix sort code for use in my game engine library. The basis for this code was inspired by this youtube video on a C++ implementation called skasort
that used templating to sort STL containers. The current first pass code sorts in-place in most-significant-byte (MSB) order and is designed to handle [u]int[8/16/32/64]_t
types as well as float/f32
, double/f64
and ``char *` strings of variable lengths. Here is the raw C code (header-only) I wrote that my cython library is wrapping:
#include <stdint.h>
#include <stdbool.h>
#define RADIX 256
#define RADIX_THRESHOLD 128
#define CMP_CHECK(a, b, cmp_type) \
cmp_type a_i = *(cmp_type *)a; \
cmp_type b_i = *(cmp_type *)b; \
if(a_i < b_i) \
{ \
return -1; \
} \
else if(a_i > b_i) \
{ \
return 1; \
} \
else \
{ \
return 0; \
} \
#define CMP_FUNC(a, b, cmp_type, cmp_suffix) \
int cmp_func_##cmp_suffix(const void *a, const void *b) \
{ \
CMP_CHECK(a, b, cmp_type) \
} \
typedef enum RadixSortType
{
RADIX_SORT_TYPE_U8,
RADIX_SORT_TYPE_I8,
RADIX_SORT_TYPE_U16,
RADIX_SORT_TYPE_I16,
RADIX_SORT_TYPE_U32,
RADIX_SORT_TYPE_I32,
RADIX_SORT_TYPE_U64,
RADIX_SORT_TYPE_I64,
RADIX_SORT_TYPE_F32,
RADIX_SORT_TYPE_F64,
RADIX_SORT_TYPE_STR,
} RadixSortType;
typedef int (* CmpFuncC)(const void *, const void *);
typedef uint8_t (* RadixKeyFuncC)(void *item, size_t byte_offset);
typedef struct RadixPartitionTableC
{
size_t counts[RADIX];
size_t prefix_sums[RADIX + 1];
size_t shifted_sums[RADIX + 1];
} RadixPartitionTableC;
void table_clear(RadixPartitionTableC *table)
{
memset(table->counts, 0, sizeof(size_t) * RADIX);
memset(table->prefix_sums, 0, sizeof(size_t) * (RADIX + 1));
memset(table->shifted_sums, 0, sizeof(size_t) * (RADIX + 1));
}
void item_swap(void *items, size_t a, size_t b, size_t size)
{
uint8_t *a_ptr = (uint8_t *)items + (a * size);
uint8_t *b_ptr = (uint8_t *)items + (b * size);
uint8_t tmp;
for(size_t i = 0; i < size; i++)
{
tmp = a_ptr[i];
a_ptr[i] = b_ptr[i];
b_ptr[i] = tmp;
}
}
CMP_FUNC(a, b, uint8_t, u8)
CMP_FUNC(a, b, int8_t, i8)
CMP_FUNC(a, b, uint16_t, u16)
CMP_FUNC(a, b, int16_t, i16)
CMP_FUNC(a, b, uint32_t, u32)
CMP_FUNC(a, b, int32_t, i32)
CMP_FUNC(a, b, uint64_t, u64)
CMP_FUNC(a, b, int64_t, i64)
CMP_FUNC(a, b, float, f32)
CMP_FUNC(a, b, double, f64)
int cmp_func_str(const void *a, const void *b)
{
return strcmp(*(char **)a, *(char **)b);
}
uint8_t radix_key_func_u8(void *item, size_t byte_offset)
{
return ((uint8_t *)item + byte_offset)[0];
}
uint8_t radix_key_func_i8(void *item, size_t byte_offset)
{
return ((uint8_t *)item + byte_offset)[0] + 128;
}
uint8_t radix_key_func_u16(void *item, size_t byte_offset)
{
return ((uint8_t *)item + byte_offset)[0];
}
uint8_t radix_key_func_i16(void *item, size_t byte_offset)
{
uint16_t value = ((uint16_t *)item)[0] + 32768;
return ((uint8_t *)&value + byte_offset)[0];
}
uint8_t radix_key_func_u32(void *item, size_t byte_offset)
{
return ((uint8_t *)item + byte_offset)[0];
}
uint8_t radix_key_func_i32(void *item, size_t byte_offset)
{
uint32_t value = ((int32_t *)item)[0] + (int32_t)2147483648;
return ((uint8_t *)&value + byte_offset)[0];
}
uint8_t radix_key_func_u64(void *item, size_t byte_offset)
{
return ((uint8_t *)item + byte_offset)[0];
}
uint8_t radix_key_func_i64(void *item, size_t byte_offset)
{
uint64_t value = ((int64_t *)item)[0] + (int64_t)9223372036854775808;
return ((uint8_t *)&value + byte_offset)[0];
}
uint8_t radix_key_func_f32(void *item, size_t byte_offset)
{
uint32_t value = ((uint32_t *)item)[0];
uint32_t mask = -(int32_t)(value >> (uint32_t)31) | (uint32_t)2147483648;
uint32_t shifted_value = value ^ mask;
return ((uint8_t *)(&shifted_value) + byte_offset)[0];
}
uint8_t radix_key_func_f64(void *item, size_t byte_offset)
{
uint64_t value = ((uint64_t *)item)[0];
uint64_t mask = -(int64_t)(value >> (uint64_t)63) | (uint64_t)9223372036854775808;
uint64_t shifted_value = value ^ mask;
return ((uint8_t *)(&shifted_value) + byte_offset)[0];
}
uint8_t radix_key_func_str(void *item, size_t byte_offset)
{
return (((uint8_t **)item)[0] + byte_offset)[0];
}
void c_radix_sort(void *items, size_t item_size,
size_t start, size_t end, size_t start_offset,
RadixSortType type_, RadixKeyFuncC key_func)
{
size_t byte_offset;
size_t num_bytes;
bool flip_byte_order = true;
bool use_null_term = false;
CmpFuncC cmp_func;
switch(type_)
{
case RADIX_SORT_TYPE_U8:
byte_offset = 0;
num_bytes = sizeof(uint8_t);
key_func = radix_key_func_u8;
cmp_func = cmp_func_u8;
break;
case RADIX_SORT_TYPE_I8:
byte_offset = 0;
num_bytes = sizeof(int8_t);
key_func = radix_key_func_i8;
cmp_func = cmp_func_i8;
break;
case RADIX_SORT_TYPE_U16:
byte_offset = 1;
num_bytes = sizeof(uint16_t);
key_func = radix_key_func_u16;
cmp_func = cmp_func_u16;
break;
case RADIX_SORT_TYPE_I16:
byte_offset = 1;
num_bytes = sizeof(int16_t);
key_func = radix_key_func_i16;
cmp_func = cmp_func_i16;
break;
case RADIX_SORT_TYPE_U32:
byte_offset = 3;
num_bytes = sizeof(uint32_t);
key_func = radix_key_func_u32;
cmp_func = cmp_func_u32;
break;
case RADIX_SORT_TYPE_I32:
byte_offset = 3;
num_bytes = sizeof(int32_t);
key_func = radix_key_func_i32;
cmp_func = cmp_func_i32;
break;
case RADIX_SORT_TYPE_U64:
byte_offset = 7;
num_bytes = sizeof(uint64_t);
key_func = radix_key_func_u64;
cmp_func = cmp_func_u64;
break;
case RADIX_SORT_TYPE_I64:
byte_offset = 7;
num_bytes = sizeof(int64_t);
key_func = radix_key_func_i64;
cmp_func = cmp_func_i64;
break;
case RADIX_SORT_TYPE_F32:
byte_offset = 3;
num_bytes = sizeof(float);
key_func = radix_key_func_f32;
cmp_func = cmp_func_f32;
break;
case RADIX_SORT_TYPE_F64:
byte_offset = 7;
num_bytes = sizeof(double);
key_func = radix_key_func_f64;
cmp_func = cmp_func_f64;
break;
case RADIX_SORT_TYPE_STR:
byte_offset = 0;
num_bytes = sizeof(char *);
flip_byte_order = false;
use_null_term = true;
key_func = radix_key_func_str;
cmp_func = cmp_func_str;
break;
}
c_radix_sort_byte(
items, item_size, start, end, start_offset,
byte_offset, num_bytes, flip_byte_order,
use_null_term, key_func, cmp_func
);
}
inline void c_radix_sort_byte(void *items, size_t item_size, size_t start, size_t end, size_t start_offset, size_t byte_offset, size_t num_bytes, bool flip_byte_order,
bool use_null_term, RadixKeyFuncC key_func, CmpFuncC cmp_func)
{
void *item_ptr;
uint8_t item;
size_t total;
size_t prefix_sum;
size_t shifted_sum;
size_t a, b;
RadixPartitionTableC table;
size_t count;
size_t table_start;
size_t table_end;
int8_t byte_order = flip_byte_order ? -1 : 1;
table_clear(&table);
for(size_t i = start; i < end; i++)
{
item_ptr = ((uint8_t *)items) + (i * item_size);
item = key_func(item_ptr, byte_offset);
table.counts[item] += 1;
}
if(use_null_term)
{
table.counts[0] = 0;
}
total = 0;
for(size_t i = 0; i < RADIX; i++)
{
total += table.counts[i];
table.prefix_sums[i + 1] += total;
table.shifted_sums[i] = table.prefix_sums[i + 1];
}
for(size_t i = start; i < end; i++)
{
while(true)
{
item_ptr = ((uint8_t *)items) + (i * item_size);
item = key_func(item_ptr, byte_offset);
prefix_sum = table.prefix_sums[item];
shifted_sum = table.shifted_sums[item];
if(prefix_sum == shifted_sum)
{
break;
}
a = i;
b = start + prefix_sum;
item_swap(items, a, b, item_size);
table.prefix_sums[item] += 1;
}
}
if(!flip_byte_order && byte_offset == num_bytes && !use_null_term)
{
return;
}
else if(flip_byte_order && byte_offset == 0 && !use_null_term)
{
return;
}
else
{
total = 0;
for(size_t i = 0; i < RADIX; i++)
{
count = table.counts[i];
table_start = start + total;
table_end = start + total + count;
total += count;
if(count >= RADIX_THRESHOLD)
{
c_radix_sort_byte(items, item_size, table_start, table_end, start_offset, byte_offset + byte_order, num_bytes, flip_byte_order, use_null_term, key_func, cmp_func);
}
else if(count > 1)
{
qsort(items + (item_size * table_start) + start_offset, table_end - table_start, item_size, cmp_func);
}
}
}
}
The benchmarking test I have is writting in cython and runs radix sorts on each of the supported types (with the char * test being on random alphanumeric strings ranging from 20-50 chars in length), on a custom vector-like container from n=2^1
to n=2^24 items
. I can add the benchmarking code if needed; in the meantime, the code for the game engine repo can be found here. The performance of this code is slower than I would like, ranging from 5x
faster in the simple uint8_t
case to about 0.9x
the speed of the radix sort in the worst cases where the sort code needs to be recursively called (the 64-bit types and the variable length strings). I have the following questions:
- The code swaps items in-place, leading to a bunch of
memcpy
calls when items need to be swapped. Would pointer swapping and sorting out of place provide a meaningful performance benefit? This would come at the cost of allocating additional heap memory, which I was striving to avoid based on that above lecture. - I have heard about "loop unrolling" being a possible technique to improve performance. How would I go about implementing this? Would SIMD be needed to do this? Any resources on this would be helpful.
- I would like to generalize this to a series of passes based on struct members to do more complex sorts based on these primitive types (e.g. sorting players by distance from a target, alphabetizing user names, etc.). I would appreciate any ideas for a convenient, user-friendly API that could handle arbitrary structs.
(削除) The code is essentially already C-code. Would rewriting in C and making a cython wrapper provide any meaningful performance improvement to help the compiler optimize further? The code is already being compiled with reasonable compiler arguments in gcc (This point is moot now that the code has been rewritten in C for both convenience (macros to deduplicate code) and reviewability (cython is a niche language). The code is still being compiled with the same flags and the performance is essentially unchanged.-std=c11
,-O3
,-ffast-math
,-march=native
). (削除ここまで)