I have written an implementation of memcpy
function and I want to know how good is it and if it gets the most out of the processor capabilities or not.
The reason for this is because I am writing my operating system (which will be of course reviewed on parts here on Code Review) it is supposed to work efficiently on both x86_64
and x86
machines
#include <stddef.h>
#include <stdint.h>
void *memcpy(void *dest, const void *src,size_t n)
{
if (n==0)
return dest;
#if defined(__x86_64__) || defined(_M_X64)
size_t i=0;
if(n>=8)
while(i<n/8*8){
*(uint64_t*)(dest + i) = *(uint64_t*)(src+i);
i+=8;
}
if(n-i>=4){
*(uint32_t*)(dest + i) = *(uint32_t*)(src+i);
i+=4;
}
if(n-i>=2){
*(uint16_t*)(dest+i) = *(uint16_t*)(src+i);
i+=2;
}
if(n-i>=1)
*(uint8_t*)(dest+i) = *(uint8_t*)(src+i);
#elif defined(__i386) || defined(_M_IX86)
size_t i=0;
if(n>=4)
while(i<n/4*4){
*(uint32_t*)(dest + i) = *(uint32_t*)(src+i);
i+=4;
}
if(n-i>=2){
*(uint16_t*)(dest+i) = *(uint16_t*)(src+i);
i+=2;
}
if(n-i>=1)
*(uint8_t*)(dest+i) = *(uint8_t*)(src+i);
#endif
return dest;
}
I made some aggressive testing with memcmp()
to check for correct data transmission and valgrind to check for memory leaks and It passed all the tests. I didn't post the testing code because I think it could be useless since I don't want it to be reviewed.
1 Answer 1
Not very fast
Your memcpy()
implementation is not really better than a standard byte by byte copy. Even though you attempt to copy more bytes at a time, the limiting factor isn't actually the number of bytes you copy per instruction.
If you research the various memcpy()
implementations there are for x86 targets, you will find a wealth of information about how to get faster speeds. I think the simplest thing for you to do is to just use the simple "rep movsb" implementation.
My own benchmarks
I ran your version against the following two versions. One is a straightforward byte by byte copy, and the other is just using "rep movsb", which on modern processors is highly optimized:
void *memcpy2(void *dst, const void *src,size_t n)
{
size_t i;
for (i=0;i<n;i++)
*(char *) dst++ = *(char *) src++;
return dst;
}
void *memcpy3(void *dst, const void *src, size_t n)
{
void *ret = dst;
asm volatile("rep movsb" : "+D" (dst) : "c"(n), "S"(src) : "cc", "memory");
return ret;
}
My results for copying 2GB (32-bit host):
OP's function : 3.74 sec
memcpy2 (naive): 3.74 sec
memcpy3 (movsb): 2.96 sec
Explore related questions
See similar questions with these tags.
memmove
to me.src + n
remains constant. Please double check if this is indeed the right code. \$\endgroup\$src + n
should besrc + i
on every line except the last. This code is broken. Did you test it? \$\endgroup\$n
is 1? You will write todst[-4]
at some point. \$\endgroup\$n > 1024
. Did you write some basic unit tests and a benchmark againstmemcpy
fromstring.h
? \$\endgroup\$