Return to Answer

Commonmark migration

edited Jun 10, 2020 at 13:24

#Not very fast#

Not very fast

Your memcpy() implementation is not really better than a standard byte by byte copy. Even though you attempt to copy more bytes at a time, the limiting factor isn't actually the number of bytes you copy per instruction.

If you research the various memcpy() implementations there are for x86 targets, you will find a wealth of information about how to get faster speeds. I think the simplest thing for you to do is to just use the simple "rep movsb" implementation.

#My own benchmarks#

My own benchmarks

I ran your version against the following two versions. One is a straightforward byte by byte copy, and the other is just using "rep movsb", which on modern processors is highly optimized:

void *memcpy2(void *dst, const void *src,size_t n)
{
 size_t i;
 for (i=0;i<n;i++)
 *(char *) dst++ = *(char *) src++;
 return dst;
}
void *memcpy3(void *dst, const void *src, size_t n)
{
 void *ret = dst;
 asm volatile("rep movsb" : "+D" (dst) : "c"(n), "S"(src) : "cc", "memory");
 return ret;
}

My results for copying 2GB (32-bit host):

OP's function : 3.74 sec
memcpy2 (naive): 3.74 sec
memcpy3 (movsb): 2.96 sec

#Not very fast#

#My own benchmarks#

I ran your version against the following two versions. One is a straightforward byte by byte copy, and the other is just using "rep movsb", which on modern processors is highly optimized:

void *memcpy2(void *dst, const void *src,size_t n)
{
 size_t i;
 for (i=0;i<n;i++)
 *(char *) dst++ = *(char *) src++;
 return dst;
}
void *memcpy3(void *dst, const void *src, size_t n)
{
 void *ret = dst;
 asm volatile("rep movsb" : "+D" (dst) : "c"(n), "S"(src) : "cc", "memory");
 return ret;
}

My results for copying 2GB (32-bit host):

OP's function : 3.74 sec
memcpy2 (naive): 3.74 sec
memcpy3 (movsb): 2.96 sec

Not very fast

My own benchmarks

I ran your version against the following two versions. One is a straightforward byte by byte copy, and the other is just using "rep movsb", which on modern processors is highly optimized:

void *memcpy2(void *dst, const void *src,size_t n)
{
 size_t i;
 for (i=0;i<n;i++)
 *(char *) dst++ = *(char *) src++;
 return dst;
}
void *memcpy3(void *dst, const void *src, size_t n)
{
 void *ret = dst;
 asm volatile("rep movsb" : "+D" (dst) : "c"(n), "S"(src) : "cc", "memory");
 return ret;
}

My results for copying 2GB (32-bit host):

OP's function : 3.74 sec
memcpy2 (naive): 3.74 sec
memcpy3 (movsb): 2.96 sec

Fixed assembly macro to be more correct.

Source Link

edited Sep 21, 2015 at 17:21

JS1

edited Sep 21, 2015 at 17:21

JS1

28.8k
3
41
83

#Not very fast#

#My own benchmarks#

I ran your version against the following two versions. One is a straightforward byte by byte copy, and the other is just using "rep movsb", which on modern processors is highly optimized:

void *memcpy2(void *dst, const void *src,size_t n)
{
 size_t i;
 for (i=0;i<n;i++)
 *(char *) dst++ = *(char *) src++;
 return dst;
}
void *memcpy3(void *dst, const void *src, size_t n)
{
 void *ret = dst;
 asm volatile("rep movsb" : :"+D" "c"(ndst), "D": "c"(dstn), "S"(src) : "cc", "memory");
 return dst;ret;
}

My results for copying 2GB (32-bit host):

OP's function : 3.74 sec
memcpy2 (naive): 3.74 sec
memcpy3 (movsb): 2.96 sec

#Not very fast#

#My own benchmarks#

I ran your version against the following two versions. One is a straightforward byte by byte copy, and the other is just using "rep movsb", which on modern processors is highly optimized:

void *memcpy2(void *dst, const void *src,size_t n)
{
 size_t i;
 for (i=0;i<n;i++)
 *(char *) dst++ = *(char *) src++;
 return dst;
}
void *memcpy3(void *dst, const void *src, size_t n)
{
 asm("rep movsb" : : "c"(n), "D"(dst), "S"(src));
 return dst;
}

My results for copying 2GB (32-bit host):

OP's function : 3.74 sec
memcpy2 (naive): 3.74 sec
memcpy3 (movsb): 2.96 sec

#Not very fast#

#My own benchmarks#

I ran your version against the following two versions. One is a straightforward byte by byte copy, and the other is just using "rep movsb", which on modern processors is highly optimized:

void *memcpy2(void *dst, const void *src,size_t n)
{
 size_t i;
 for (i=0;i<n;i++)
 *(char *) dst++ = *(char *) src++;
 return dst;
}
void *memcpy3(void *dst, const void *src, size_t n)
{
 void *ret = dst;
 asm volatile("rep movsb" : "+D" (dst) : "c"(n), "S"(src) : "cc", "memory");
 return ret;
}

My results for copying 2GB (32-bit host):

OP's function : 3.74 sec
memcpy2 (naive): 3.74 sec
memcpy3 (movsb): 2.96 sec

Source Link

answered Sep 21, 2015 at 2:10

JS1

answered Sep 21, 2015 at 2:10

JS1

28.8k
3
41
83

#Not very fast#

#My own benchmarks#

I ran your version against the following two versions. One is a straightforward byte by byte copy, and the other is just using "rep movsb", which on modern processors is highly optimized:

void *memcpy2(void *dst, const void *src,size_t n)
{
 size_t i;
 for (i=0;i<n;i++)
 *(char *) dst++ = *(char *) src++;
 return dst;
}
void *memcpy3(void *dst, const void *src, size_t n)
{
 asm("rep movsb" : : "c"(n), "D"(dst), "S"(src));
 return dst;
}

My results for copying 2GB (32-bit host):

OP's function : 3.74 sec
memcpy2 (naive): 3.74 sec
memcpy3 (movsb): 2.96 sec

lang-c