memcpy() implementation

Question 1

I have tried to write a function like memcpy. It copies sizeof(long) bytes at a time.

What surprised me is how inefficient it is. It's just 17% more efficient than the naivest implementation with -O3. With optimizations off it's a lot faster than the naivest, so perhaps the compiler is doing this automatically?

It will copy 1 byte at a time until one of the addresses is aligned, then it will copy sizeof(long) and then 1 byte at a time to avoid writing beyond the bounds.

How can I make this faster?

#include <stdlib.h>
#define THRESHOLD sizeof(long)
static size_t min(size_t a, size_t b)
{
 return (a > b) ? a : b;
}
static void big_copy(void *dest, const void *src, size_t iterations)
{
 long *d = dest;
 const long *s = src;
 size_t eight = iterations / 8;
 size_t single = iterations % 8;
 while(eight > 0){
 *d++ = *s++;
 *d++ = *s++;
 *d++ = *s++;
 *d++ = *s++;
 *d++ = *s++;
 *d++ = *s++;
 *d++ = *s++;
 *d++ = *s++;
 --eight;
 }
 while(single > 0){
 *d++ = *s++;
 --single;
 }
}
static void small_copy(void *dest, const void *src, size_t iterations)
{
 char *d = dest;
 const char *s = src;
 while(iterations > 0){
 *d++ = *s++;
 --iterations;
 }
}
void *copy_memory(void *dest, const void *src, size_t size)
{
 //Small size is handled here
 if(size < THRESHOLD){
 small_copy(dest, src, size);
 return dest;
 }
 //Start copying 8 bytes as soon as one of the pointers is aligned
 size_t bytes_to_align = min((size_t)dest % sizeof(long), (size_t)src % sizeof(long));
 void *position = dest;
 //Align
 if(bytes_to_align > 0){
 small_copy(position, src, bytes_to_align);
 position = (char *)position + bytes_to_align;
 src = (char *)src + bytes_to_align;
 size -= bytes_to_align;
 }
 //How many iterations can be done
 size_t safe_big_iterations = size / sizeof(long);
 size_t remaining_bytes = size % sizeof(long);
 //Copy most bytes here
 big_copy(position, src, safe_big_iterations);
 position = (char *)position + safe_big_iterations * sizeof(long);
 src = (char *)src + safe_big_iterations * sizeof(long);
 //Process the remaining bytes
 small_copy(position, src, remaining_bytes);
 return dest;
}

Question 2

I think it should be noted that in the Internet, this very code has been presented as an example of a not obvious UB. See blog.regehr.org/archives/1307 , especially section Chunking Optimizations Are Broken. I’m a layman, but the concern seems to me to be either well-founded.

Question 3

The last time I saw source for a C run-time-library implementation of memcpy (Microsoft's compiler in the 1990s), it used the algorithm you describe: but it was written in assembly. It might (my memory is uncertain) have used rep movsd in the inner loop.

Your code says, //Start copying 8 bytes as soon as one of the pointers is aligned. When you're performance-testing you should know (because that's when you might expect the best performance) whether both buffers are aligned.

On the subject of alignment there as an interesting (but unrelated to your question) question here on StackOverflow: Why speed of memcpy() drops dramatically every 4KB?

I vaguely understand what kind of an effect you're looking for in your code. I don't know what assembler your compiler is actually producing.

The accepted answer to this StackOverflow question demonstrates the kind of assembly that is used nowadays: Very fast memcpy for image processing?

ChrisW ChrisW 13k1 gold badge35 silver badges76 bronze badges · Accepted Answer · 2014-02-06 19:13:13Z

The last time I saw source for a C run-time-library implementation of memcpy (Microsoft's compiler in the 1990s), it used the algorithm you describe: but it was written in assembly. It might (my memory is uncertain) have used rep movsd in the inner loop.

Your code says, //Start copying 8 bytes as soon as one of the pointers is aligned. When you're performance-testing you should know (because that's when you might expect the best performance) whether both buffers are aligned.

On the subject of alignment there as an interesting (but unrelated to your question) question here on StackOverflow: Why speed of memcpy() drops dramatically every 4KB?

I vaguely understand what kind of an effect you're looking for in your code. I don't know what assembler your compiler is actually producing.

The accepted answer to this StackOverflow question demonstrates the kind of assembly that is used nowadays: Very fast memcpy for image processing?

Stack Exchange Network

memcpy() implementation

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

memcpy() implementation

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions