I have tried to write a function like memcpy
. It copies sizeof(long)
bytes at a time.
What surprised me is how inefficient it is. It's just 17% more efficient than the naivest implementation with -O3
. With optimizations off it's a lot faster than the naivest, so perhaps the compiler is doing this automatically?
It will copy 1 byte at a time until one of the addresses is aligned, then it will copy sizeof(long)
and then 1 byte at a time to avoid writing beyond the bounds.
How can I make this faster?
#include <stdlib.h>
#define THRESHOLD sizeof(long)
static size_t min(size_t a, size_t b)
{
return (a > b) ? a : b;
}
static void big_copy(void *dest, const void *src, size_t iterations)
{
long *d = dest;
const long *s = src;
size_t eight = iterations / 8;
size_t single = iterations % 8;
while(eight > 0){
*d++ = *s++;
*d++ = *s++;
*d++ = *s++;
*d++ = *s++;
*d++ = *s++;
*d++ = *s++;
*d++ = *s++;
*d++ = *s++;
--eight;
}
while(single > 0){
*d++ = *s++;
--single;
}
}
static void small_copy(void *dest, const void *src, size_t iterations)
{
char *d = dest;
const char *s = src;
while(iterations > 0){
*d++ = *s++;
--iterations;
}
}
void *copy_memory(void *dest, const void *src, size_t size)
{
//Small size is handled here
if(size < THRESHOLD){
small_copy(dest, src, size);
return dest;
}
//Start copying 8 bytes as soon as one of the pointers is aligned
size_t bytes_to_align = min((size_t)dest % sizeof(long), (size_t)src % sizeof(long));
void *position = dest;
//Align
if(bytes_to_align > 0){
small_copy(position, src, bytes_to_align);
position = (char *)position + bytes_to_align;
src = (char *)src + bytes_to_align;
size -= bytes_to_align;
}
//How many iterations can be done
size_t safe_big_iterations = size / sizeof(long);
size_t remaining_bytes = size % sizeof(long);
//Copy most bytes here
big_copy(position, src, safe_big_iterations);
position = (char *)position + safe_big_iterations * sizeof(long);
src = (char *)src + safe_big_iterations * sizeof(long);
//Process the remaining bytes
small_copy(position, src, remaining_bytes);
return dest;
}
-
\$\begingroup\$ I think it should be noted that in the Internet, this very code has been presented as an example of a not obvious UB. See blog.regehr.org/archives/1307 , especially section Chunking Optimizations Are Broken. I’m a layman, but the concern seems to me to be either well-founded. \$\endgroup\$gaazkam– gaazkam2017年05月23日 21:54:55 +00:00Commented May 23, 2017 at 21:54
1 Answer 1
The last time I saw source for a C run-time-library implementation of memcpy
(Microsoft's compiler in the 1990s), it used the algorithm you describe: but it was written in assembly. It might (my memory is uncertain) have used rep movsd
in the inner loop.
Your code says, //Start copying 8 bytes as soon as one of the pointers is aligned
. When you're performance-testing you should know (because that's when you might expect the best performance) whether both buffers are aligned.
On the subject of alignment there as an interesting (but unrelated to your question) question here on StackOverflow: Why speed of memcpy() drops dramatically every 4KB?
I vaguely understand what kind of an effect you're looking for in your code. I don't know what assembler your compiler is actually producing.
The accepted answer to this StackOverflow question demonstrates the kind of assembly that is used nowadays: Very fast memcpy for image processing?
Explore related questions
See similar questions with these tags.