7
\$\begingroup\$

I am fairly rusty with assembler, let alone the AT&T syntax. I would appreciate it if someone with more experience could please review the following memcpy implementation. Note that this will only ever be called with aligned data.

inline static void alignedMemcpySSE(void *dst, const void * src, size_t length)
{
#if defined(__x86_64__) || defined(__i386__)
 size_t rem = (7 - ((length & 0x7F) >> 4)) * 10;
 void * end = dst + (length & ~0x7F);
 __asm__ __volatile__ (
 // save the registers we intend to alter, failure to do so causes problems
 // when gcc -O3 is used
 "push %[dst]\n\t"
 "push %[src]\n\t"
 "push %[end]\n\t"
 "cmp %[dst],%[end] \n\t"
 "je remain_%= \n\t"
 // perform 128 byte SIMD block copy
 "loop_%=: \n\t"
 "vmovaps 0x00(%[src]),%%xmm0 \n\t"
 "vmovaps 0x10(%[src]),%%xmm1 \n\t"
 "vmovaps 0x20(%[src]),%%xmm2 \n\t"
 "vmovaps 0x30(%[src]),%%xmm3 \n\t"
 "vmovaps 0x40(%[src]),%%xmm4 \n\t"
 "vmovaps 0x50(%[src]),%%xmm5 \n\t"
 "vmovaps 0x60(%[src]),%%xmm6 \n\t"
 "vmovaps 0x70(%[src]),%%xmm7 \n\t"
 "vmovntdq %%xmm0,0x00(%[dst]) \n\t"
 "vmovntdq %%xmm1,0x10(%[dst]) \n\t"
 "vmovntdq %%xmm2,0x20(%[dst]) \n\t"
 "vmovntdq %%xmm3,0x30(%[dst]) \n\t"
 "vmovntdq %%xmm4,0x40(%[dst]) \n\t"
 "vmovntdq %%xmm5,0x50(%[dst]) \n\t"
 "vmovntdq %%xmm6,0x60(%[dst]) \n\t"
 "vmovntdq %%xmm7,0x70(%[dst]) \n\t"
 "add 0ドルx80,%[dst] \n\t"
 "add 0ドルx80,%[src] \n\t"
 "cmp %[dst],%[end] \n\t"
 "jne loop_%= \n\t"
 "remain_%=: \n\t"
 // copy any remaining 16 byte blocks
#ifdef __x86_64__
 "leaq (%%rip), %[end]\n\t"
 "add 10,ドル%[end] \n\t"
#else
 "call .+5 \n\t"
 "pop %[end] \n\t"
 "add 8,ドル%[end] \n\t"
#endif
 "add %[rem],%[end] \n\t"
 "jmp *%[end] \n\t"
 // jump table
 "vmovaps 0x60(%[src]),%%xmm0 \n\t"
 "vmovntdq %%xmm0,0x60(%[dst]) \n\t"
 "vmovaps 0x50(%[src]),%%xmm1 \n\t"
 "vmovntdq %%xmm1,0x50(%[dst]) \n\t"
 "vmovaps 0x40(%[src]),%%xmm2 \n\t"
 "vmovntdq %%xmm2,0x40(%[dst]) \n\t"
 "vmovaps 0x30(%[src]),%%xmm3 \n\t"
 "vmovntdq %%xmm3,0x30(%[dst]) \n\t"
 "vmovaps 0x20(%[src]),%%xmm4 \n\t"
 "vmovntdq %%xmm4,0x20(%[dst]) \n\t"
 "vmovaps 0x10(%[src]),%%xmm5 \n\t"
 "vmovntdq %%xmm5,0x10(%[dst]) \n\t"
 "vmovaps 0x00(%[src]),%%xmm6 \n\t"
 "vmovntdq %%xmm6,0x00(%[dst]) \n\t"
 // alignment as the previous two instructions are only 4 bytes
 "nop\n\t"
 "nop\n\t"
 // restore the registers
 "pop %[end]\n\t"
 "pop %[src]\n\t"
 "pop %[dst]\n\t"
 :
 : [dst]"r" (dst),
 [src]"r" (src),
 [end]"c" (end),
 [rem]"d" (rem)
 : "xmm0",
 "xmm1",
 "xmm2",
 "xmm3",
 "xmm4",
 "xmm5",
 "xmm6",
 "xmm7",
 "memory"
 );
 //copy any remaining bytes
 for(size_t i = (length & 0xF); i; --i)
 ((uint8_t *)dst)[length - i] =
 ((uint8_t *)src)[length - i];
#else
 memcpy(dst, src, length);
#endif
}
301_Moved_Permanently
29.4k3 gold badges48 silver badges98 bronze badges
asked May 17, 2018 at 9:39
\$\endgroup\$
3
  • \$\begingroup\$ Wouldn’t YMM be faster? Or is 16-byte loads the fastest it gets internally? \$\endgroup\$ Commented May 18, 2018 at 6:09
  • \$\begingroup\$ Through experimentation AVX is marginally faster, but I need this to operate on systems without AVX. SSE2 is the most modern extension I can target. \$\endgroup\$ Commented May 18, 2018 at 8:44
  • \$\begingroup\$ Please do not update the code in your question to incorporate feedback from answers, doing so goes against the Question + Answer style of Code Review. This is not a forum where you should keep the most updated version in your question. Please see what you may and may not do after receiving answers . \$\endgroup\$ Commented May 18, 2018 at 9:45

1 Answer 1

3
\$\begingroup\$

save the registers we intend to alter, failure to do so causes problems when gcc -O3 is used

It should be possible to avoid the pushes and pops. If you are changing the values of the constraints, you cannot have them as just "inputs" (which is where this code currently has them). Quoting the docs:

Do not modify the contents of input-only operands (except for inputs tied to outputs). The compiler assumes that on exit from the asm statement these operands contain the same values as they had before executing the statement.

There are a variety of ways to deal with this. Simply moving the ones you actually change to the "outputs" section and changing them to "+r" might be sufficient. Note that this will indeed change the values, which might affect your "remaining bytes" code.

[rem]"d" (rem)

Is there some reason you need this value in that specific register? Letting the optimizers pick the registers tends to be a better plan.

let alone the AT&T syntax

If you prefer the intel format for writing asm (and what rational person doesn't?), gcc has the -masm=intel option.

And lastly, have you tried writing this in C using intrinsics? There are a couple of places in your code where I wonder if a different arrangement of instructions might give better results, but I can't say without trying it or running an analysis tool.

It's hard for humans to optimally arrange assembler instructions for modern processors.

answered May 17, 2018 at 21:50
\$\endgroup\$
2
  • \$\begingroup\$ Thankyou! Excellent information here. I bound rem to the rdx register to ensure it's instruction length is predictable for the calculated relative jump. The reason for needing this is because I am diagnosing a problem with Windows VM vs Linux VM memory copy performance, I need the exact same code running in both environments for testing. \$\endgroup\$ Commented May 18, 2018 at 8:47
  • \$\begingroup\$ I have updated the sample code after making the suggested changes, it works great. I also managed to figure out how to calculate the offset for the jump instead of using hard coded values, which allows the use of "r" instead of "d". \$\endgroup\$ Commented May 18, 2018 at 9:37

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.