Re-implementing memcpy

Question 1

I have written an implementation of memcpy function and I want to know how good is it and if it gets the most out of the processor capabilities or not.

The reason for this is because I am writing my operating system (which will be of course reviewed on parts here on Code Review) it is supposed to work efficiently on both x86_64 and x86 machines

#include <stddef.h>
#include <stdint.h>
void *memcpy(void *dest, const void *src,size_t n)
{
 if (n==0)
 return dest;
 #if defined(__x86_64__) || defined(_M_X64)
 size_t i=0;
 if(n>=8)
 while(i<n/8*8){
 *(uint64_t*)(dest + i) = *(uint64_t*)(src+i);
 i+=8;
 }
 if(n-i>=4){
 *(uint32_t*)(dest + i) = *(uint32_t*)(src+i);
 i+=4;
 }
 if(n-i>=2){
 *(uint16_t*)(dest+i) = *(uint16_t*)(src+i);
 i+=2;
 }
 if(n-i>=1)
 *(uint8_t*)(dest+i) = *(uint8_t*)(src+i);
 #elif defined(__i386) || defined(_M_IX86)
 size_t i=0;
 if(n>=4)
 while(i<n/4*4){
 *(uint32_t*)(dest + i) = *(uint32_t*)(src+i);
 i+=4;
 }
 if(n-i>=2){
 *(uint16_t*)(dest+i) = *(uint16_t*)(src+i);
 i+=2;
 }
 if(n-i>=1)
 *(uint8_t*)(dest+i) = *(uint8_t*)(src+i);
 #endif
 return dest;
}

I made some aggressive testing with memcmp() to check for correct data transmission and valgrind to check for memory leaks and It passed all the tests. I didn't post the testing code because I think it could be useless since I don't want it to be reviewed.

Question 2

Doesn't look like memmove to me. src + n remains constant. Please double check if this is indeed the right code.

Question 3

actually If I used code like this `memcpy(a,b,c) C is passed by value so it never changed src is never changed so is the value of bytes passed by value I am sure this is the right code

Question 4

src + n should be src + i on every line except the last. This code is broken. Did you test it?

Question 5

It still has bugs. It looks to me like your index can go negative. For example, what happens when n is 1? You will write to dst[-4] at some point.

Question 6

Also, on i386, your index will overflow for n > 1024. Did you write some basic unit tests and a benchmark against memcpy from string.h?

Question 7

Not very fast

Your memcpy() implementation is not really better than a standard byte by byte copy. Even though you attempt to copy more bytes at a time, the limiting factor isn't actually the number of bytes you copy per instruction.

If you research the various memcpy() implementations there are for x86 targets, you will find a wealth of information about how to get faster speeds. I think the simplest thing for you to do is to just use the simple "rep movsb" implementation.

My own benchmarks

I ran your version against the following two versions. One is a straightforward byte by byte copy, and the other is just using "rep movsb", which on modern processors is highly optimized:

void *memcpy2(void *dst, const void *src,size_t n)
{
 size_t i;
 for (i=0;i<n;i++)
 *(char *) dst++ = *(char *) src++;
 return dst;
}
void *memcpy3(void *dst, const void *src, size_t n)
{
 void *ret = dst;
 asm volatile("rep movsb" : "+D" (dst) : "c"(n), "S"(src) : "cc", "memory");
 return ret;
}

My results for copying 2GB (32-bit host):

OP's function : 3.74 sec
memcpy2 (naive): 3.74 sec
memcpy3 (movsb): 2.96 sec

JS1 JS1 28.8k3 gold badges41 silver badges83 bronze badges · Accepted Answer · 2015-09-21 02:10:39Z

Not very fast

Your memcpy() implementation is not really better than a standard byte by byte copy. Even though you attempt to copy more bytes at a time, the limiting factor isn't actually the number of bytes you copy per instruction.

If you research the various memcpy() implementations there are for x86 targets, you will find a wealth of information about how to get faster speeds. I think the simplest thing for you to do is to just use the simple "rep movsb" implementation.

My own benchmarks

I ran your version against the following two versions. One is a straightforward byte by byte copy, and the other is just using "rep movsb", which on modern processors is highly optimized:

void *memcpy2(void *dst, const void *src,size_t n)
{
 size_t i;
 for (i=0;i<n;i++)
 *(char *) dst++ = *(char *) src++;
 return dst;
}
void *memcpy3(void *dst, const void *src, size_t n)
{
 void *ret = dst;
 asm volatile("rep movsb" : "+D" (dst) : "c"(n), "S"(src) : "cc", "memory");
 return ret;
}

My results for copying 2GB (32-bit host):

OP's function : 3.74 sec
memcpy2 (naive): 3.74 sec
memcpy3 (movsb): 2.96 sec

Stack Exchange Network

Re-implementing memcpy

1 Answer 1

Not very fast

My own benchmarks

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Re-implementing memcpy

1 Answer 1

Not very fast

My own benchmarks

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions