#Not very fast#
Not very fast
Your memcpy()
implementation is not really better than a standard byte by byte copy. Even though you attempt to copy more bytes at a time, the limiting factor isn't actually the number of bytes you copy per instruction.
If you research the various memcpy()
implementations there are for x86 targets, you will find a wealth of information about how to get faster speeds. I think the simplest thing for you to do is to just use the simple "rep movsb" implementation.
#My own benchmarks#
My own benchmarks
I ran your version against the following two versions. One is a straightforward byte by byte copy, and the other is just using "rep movsb", which on modern processors is highly optimized:
void *memcpy2(void *dst, const void *src,size_t n)
{
size_t i;
for (i=0;i<n;i++)
*(char *) dst++ = *(char *) src++;
return dst;
}
void *memcpy3(void *dst, const void *src, size_t n)
{
void *ret = dst;
asm volatile("rep movsb" : "+D" (dst) : "c"(n), "S"(src) : "cc", "memory");
return ret;
}
My results for copying 2GB (32-bit host):
OP's function : 3.74 sec
memcpy2 (naive): 3.74 sec
memcpy3 (movsb): 2.96 sec
#Not very fast#
Your memcpy()
implementation is not really better than a standard byte by byte copy. Even though you attempt to copy more bytes at a time, the limiting factor isn't actually the number of bytes you copy per instruction.
If you research the various memcpy()
implementations there are for x86 targets, you will find a wealth of information about how to get faster speeds. I think the simplest thing for you to do is to just use the simple "rep movsb" implementation.
#My own benchmarks#
I ran your version against the following two versions. One is a straightforward byte by byte copy, and the other is just using "rep movsb", which on modern processors is highly optimized:
void *memcpy2(void *dst, const void *src,size_t n)
{
size_t i;
for (i=0;i<n;i++)
*(char *) dst++ = *(char *) src++;
return dst;
}
void *memcpy3(void *dst, const void *src, size_t n)
{
void *ret = dst;
asm volatile("rep movsb" : "+D" (dst) : "c"(n), "S"(src) : "cc", "memory");
return ret;
}
My results for copying 2GB (32-bit host):
OP's function : 3.74 sec
memcpy2 (naive): 3.74 sec
memcpy3 (movsb): 2.96 sec
Not very fast
Your memcpy()
implementation is not really better than a standard byte by byte copy. Even though you attempt to copy more bytes at a time, the limiting factor isn't actually the number of bytes you copy per instruction.
If you research the various memcpy()
implementations there are for x86 targets, you will find a wealth of information about how to get faster speeds. I think the simplest thing for you to do is to just use the simple "rep movsb" implementation.
My own benchmarks
I ran your version against the following two versions. One is a straightforward byte by byte copy, and the other is just using "rep movsb", which on modern processors is highly optimized:
void *memcpy2(void *dst, const void *src,size_t n)
{
size_t i;
for (i=0;i<n;i++)
*(char *) dst++ = *(char *) src++;
return dst;
}
void *memcpy3(void *dst, const void *src, size_t n)
{
void *ret = dst;
asm volatile("rep movsb" : "+D" (dst) : "c"(n), "S"(src) : "cc", "memory");
return ret;
}
My results for copying 2GB (32-bit host):
OP's function : 3.74 sec
memcpy2 (naive): 3.74 sec
memcpy3 (movsb): 2.96 sec
#Not very fast#
Your memcpy()
implementation is not really better than a standard byte by byte copy. Even though you attempt to copy more bytes at a time, the limiting factor isn't actually the number of bytes you copy per instruction.
If you research the various memcpy()
implementations there are for x86 targets, you will find a wealth of information about how to get faster speeds. I think the simplest thing for you to do is to just use the simple "rep movsb" implementation.
#My own benchmarks#
I ran your version against the following two versions. One is a straightforward byte by byte copy, and the other is just using "rep movsb", which on modern processors is highly optimized:
void *memcpy2(void *dst, const void *src,size_t n)
{
size_t i;
for (i=0;i<n;i++)
*(char *) dst++ = *(char *) src++;
return dst;
}
void *memcpy3(void *dst, const void *src, size_t n)
{
void *ret = dst;
asm volatile("rep movsb" : :"+D" "c"(ndst), "D": "c"(dstn), "S"(src) : "cc", "memory");
return dst;ret;
}
My results for copying 2GB (32-bit host):
OP's function : 3.74 sec
memcpy2 (naive): 3.74 sec
memcpy3 (movsb): 2.96 sec
#Not very fast#
Your memcpy()
implementation is not really better than a standard byte by byte copy. Even though you attempt to copy more bytes at a time, the limiting factor isn't actually the number of bytes you copy per instruction.
If you research the various memcpy()
implementations there are for x86 targets, you will find a wealth of information about how to get faster speeds. I think the simplest thing for you to do is to just use the simple "rep movsb" implementation.
#My own benchmarks#
I ran your version against the following two versions. One is a straightforward byte by byte copy, and the other is just using "rep movsb", which on modern processors is highly optimized:
void *memcpy2(void *dst, const void *src,size_t n)
{
size_t i;
for (i=0;i<n;i++)
*(char *) dst++ = *(char *) src++;
return dst;
}
void *memcpy3(void *dst, const void *src, size_t n)
{
asm("rep movsb" : : "c"(n), "D"(dst), "S"(src));
return dst;
}
My results for copying 2GB (32-bit host):
OP's function : 3.74 sec
memcpy2 (naive): 3.74 sec
memcpy3 (movsb): 2.96 sec
#Not very fast#
Your memcpy()
implementation is not really better than a standard byte by byte copy. Even though you attempt to copy more bytes at a time, the limiting factor isn't actually the number of bytes you copy per instruction.
If you research the various memcpy()
implementations there are for x86 targets, you will find a wealth of information about how to get faster speeds. I think the simplest thing for you to do is to just use the simple "rep movsb" implementation.
#My own benchmarks#
I ran your version against the following two versions. One is a straightforward byte by byte copy, and the other is just using "rep movsb", which on modern processors is highly optimized:
void *memcpy2(void *dst, const void *src,size_t n)
{
size_t i;
for (i=0;i<n;i++)
*(char *) dst++ = *(char *) src++;
return dst;
}
void *memcpy3(void *dst, const void *src, size_t n)
{
void *ret = dst;
asm volatile("rep movsb" : "+D" (dst) : "c"(n), "S"(src) : "cc", "memory");
return ret;
}
My results for copying 2GB (32-bit host):
OP's function : 3.74 sec
memcpy2 (naive): 3.74 sec
memcpy3 (movsb): 2.96 sec
#Not very fast#
Your memcpy()
implementation is not really better than a standard byte by byte copy. Even though you attempt to copy more bytes at a time, the limiting factor isn't actually the number of bytes you copy per instruction.
If you research the various memcpy()
implementations there are for x86 targets, you will find a wealth of information about how to get faster speeds. I think the simplest thing for you to do is to just use the simple "rep movsb" implementation.
#My own benchmarks#
I ran your version against the following two versions. One is a straightforward byte by byte copy, and the other is just using "rep movsb", which on modern processors is highly optimized:
void *memcpy2(void *dst, const void *src,size_t n)
{
size_t i;
for (i=0;i<n;i++)
*(char *) dst++ = *(char *) src++;
return dst;
}
void *memcpy3(void *dst, const void *src, size_t n)
{
asm("rep movsb" : : "c"(n), "D"(dst), "S"(src));
return dst;
}
My results for copying 2GB (32-bit host):
OP's function : 3.74 sec
memcpy2 (naive): 3.74 sec
memcpy3 (movsb): 2.96 sec