SIMD memcpy assembler implementation

Question 1

I am fairly rusty with assembler, let alone the AT&T syntax. I would appreciate it if someone with more experience could please review the following memcpy implementation. Note that this will only ever be called with aligned data.

inline static void alignedMemcpySSE(void *dst, const void * src, size_t length)
{
#if defined(__x86_64__) || defined(__i386__)
 size_t rem = (7 - ((length & 0x7F) >> 4)) * 10;
 void * end = dst + (length & ~0x7F);
 __asm__ __volatile__ (
 // save the registers we intend to alter, failure to do so causes problems
 // when gcc -O3 is used
 "push %[dst]\n\t"
 "push %[src]\n\t"
 "push %[end]\n\t"
 "cmp %[dst],%[end] \n\t"
 "je remain_%= \n\t"
 // perform 128 byte SIMD block copy
 "loop_%=: \n\t"
 "vmovaps 0x00(%[src]),%%xmm0 \n\t"
 "vmovaps 0x10(%[src]),%%xmm1 \n\t"
 "vmovaps 0x20(%[src]),%%xmm2 \n\t"
 "vmovaps 0x30(%[src]),%%xmm3 \n\t"
 "vmovaps 0x40(%[src]),%%xmm4 \n\t"
 "vmovaps 0x50(%[src]),%%xmm5 \n\t"
 "vmovaps 0x60(%[src]),%%xmm6 \n\t"
 "vmovaps 0x70(%[src]),%%xmm7 \n\t"
 "vmovntdq %%xmm0,0x00(%[dst]) \n\t"
 "vmovntdq %%xmm1,0x10(%[dst]) \n\t"
 "vmovntdq %%xmm2,0x20(%[dst]) \n\t"
 "vmovntdq %%xmm3,0x30(%[dst]) \n\t"
 "vmovntdq %%xmm4,0x40(%[dst]) \n\t"
 "vmovntdq %%xmm5,0x50(%[dst]) \n\t"
 "vmovntdq %%xmm6,0x60(%[dst]) \n\t"
 "vmovntdq %%xmm7,0x70(%[dst]) \n\t"
 "add 0ドルx80,%[dst] \n\t"
 "add 0ドルx80,%[src] \n\t"
 "cmp %[dst],%[end] \n\t"
 "jne loop_%= \n\t"
 "remain_%=: \n\t"
 // copy any remaining 16 byte blocks
#ifdef __x86_64__
 "leaq (%%rip), %[end]\n\t"
 "add 10,ドル%[end] \n\t"
#else
 "call .+5 \n\t"
 "pop %[end] \n\t"
 "add 8,ドル%[end] \n\t"
#endif
 "add %[rem],%[end] \n\t"
 "jmp *%[end] \n\t"
 // jump table
 "vmovaps 0x60(%[src]),%%xmm0 \n\t"
 "vmovntdq %%xmm0,0x60(%[dst]) \n\t"
 "vmovaps 0x50(%[src]),%%xmm1 \n\t"
 "vmovntdq %%xmm1,0x50(%[dst]) \n\t"
 "vmovaps 0x40(%[src]),%%xmm2 \n\t"
 "vmovntdq %%xmm2,0x40(%[dst]) \n\t"
 "vmovaps 0x30(%[src]),%%xmm3 \n\t"
 "vmovntdq %%xmm3,0x30(%[dst]) \n\t"
 "vmovaps 0x20(%[src]),%%xmm4 \n\t"
 "vmovntdq %%xmm4,0x20(%[dst]) \n\t"
 "vmovaps 0x10(%[src]),%%xmm5 \n\t"
 "vmovntdq %%xmm5,0x10(%[dst]) \n\t"
 "vmovaps 0x00(%[src]),%%xmm6 \n\t"
 "vmovntdq %%xmm6,0x00(%[dst]) \n\t"
 // alignment as the previous two instructions are only 4 bytes
 "nop\n\t"
 "nop\n\t"
 // restore the registers
 "pop %[end]\n\t"
 "pop %[src]\n\t"
 "pop %[dst]\n\t"
 :
 : [dst]"r" (dst),
 [src]"r" (src),
 [end]"c" (end),
 [rem]"d" (rem)
 : "xmm0",
 "xmm1",
 "xmm2",
 "xmm3",
 "xmm4",
 "xmm5",
 "xmm6",
 "xmm7",
 "memory"
 );
 //copy any remaining bytes
 for(size_t i = (length & 0xF); i; --i)
 ((uint8_t *)dst)[length - i] =
 ((uint8_t *)src)[length - i];
#else
 memcpy(dst, src, length);
#endif
}

Question 2

Wouldn’t YMM be faster? Or is 16-byte loads the fastest it gets internally?

Question 3

Through experimentation AVX is marginally faster, but I need this to operate on systems without AVX. SSE2 is the most modern extension I can target.

Question 4

Please do not update the code in your question to incorporate feedback from answers, doing so goes against the Question + Answer style of Code Review. This is not a forum where you should keep the most updated version in your question. Please see what you may and may not do after receiving answers .

Question 5

save the registers we intend to alter, failure to do so causes problems when gcc -O3 is used

It should be possible to avoid the pushes and pops. If you are changing the values of the constraints, you cannot have them as just "inputs" (which is where this code currently has them). Quoting the docs:

Do not modify the contents of input-only operands (except for inputs tied to outputs). The compiler assumes that on exit from the asm statement these operands contain the same values as they had before executing the statement.

There are a variety of ways to deal with this. Simply moving the ones you actually change to the "outputs" section and changing them to "+r" might be sufficient. Note that this will indeed change the values, which might affect your "remaining bytes" code.

[rem]"d" (rem)

Is there some reason you need this value in that specific register? Letting the optimizers pick the registers tends to be a better plan.

let alone the AT&T syntax

If you prefer the intel format for writing asm (and what rational person doesn't?), gcc has the -masm=intel option.

And lastly, have you tried writing this in C using intrinsics? There are a couple of places in your code where I wonder if a different arrangement of instructions might give better results, but I can't say without trying it or running an analysis tool.

It's hard for humans to optimally arrange assembler instructions for modern processors.

Question 6

Thankyou! Excellent information here. I bound rem to the rdx register to ensure it's instruction length is predictable for the calculated relative jump. The reason for needing this is because I am diagnosing a problem with Windows VM vs Linux VM memory copy performance, I need the exact same code running in both environments for testing.

Question 7

I have updated the sample code after making the suggested changes, it works great. I also managed to figure out how to calculate the offset for the jump instead of using hard coded values, which allows the use of "r" instead of "d".

David Wohlferd David Wohlferd 1,5181 gold badge8 silver badges17 bronze badges · Accepted Answer · 2018-05-17 21:50:09Z

save the registers we intend to alter, failure to do so causes problems when gcc -O3 is used

It should be possible to avoid the pushes and pops. If you are changing the values of the constraints, you cannot have them as just "inputs" (which is where this code currently has them). Quoting the docs:

Do not modify the contents of input-only operands (except for inputs tied to outputs). The compiler assumes that on exit from the asm statement these operands contain the same values as they had before executing the statement.

There are a variety of ways to deal with this. Simply moving the ones you actually change to the "outputs" section and changing them to "+r" might be sufficient. Note that this will indeed change the values, which might affect your "remaining bytes" code.

[rem]"d" (rem)

Is there some reason you need this value in that specific register? Letting the optimizers pick the registers tends to be a better plan.

let alone the AT&T syntax

If you prefer the intel format for writing asm (and what rational person doesn't?), gcc has the -masm=intel option.

And lastly, have you tried writing this in C using intrinsics? There are a couple of places in your code where I wonder if a different arrangement of instructions might give better results, but I can't say without trying it or running an analysis tool.

It's hard for humans to optimally arrange assembler instructions for modern processors.

Thankyou! Excellent information here. I bound rem to the rdx register to ensure it's instruction length is predictable for the calculated relative jump. The reason for needing this is because I am diagnosing a problem with Windows VM vs Linux VM memory copy performance, I need the exact same code running in both environments for testing.
I have updated the sample code after making the suggested changes, it works great. I also managed to figure out how to calculate the offset for the jump instead of using hard coded values, which allows the use of "r" instead of "d".

Stack Exchange Network

SIMD memcpy assembler implementation

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

SIMD memcpy assembler implementation

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions