x86 strcpy implementation

Question 1

I got about 4 days of assembly knowledge, so I need a review on this strcpy function and if it can be done better (at least I have the feeling).

Full code (with the test included):

.data
 s:
 .asciz "Hello world!"
.bss
 .lcomm destination, 4
.text
.globl main
main:
 nop
 pushl $s
 pushl $destination
 call __strcpy
 addl 8,ドル %esp
 pushl $destination
 call puts
 addl 4,ドル %esp
 ret
.globl __strcpy
.type __strcpy, @function
__strcpy:
 movl 0ドルxFFFF, %ecx
 movl 4(%esp), %edi
 movl 8(%esp), %esi
cpy:
 cmpl 0,ドル (%esi)
 je done
 movsb
 loop cpy
done:
 ret

GitHub

Parts that I feel can be optimized:

Because the done label just executes the ret instruction:
- cmpl 0,ドル (%esi)
- je done
Because the rep instruction-family seems a like better approach:
- movsb
- loop cpy

Question 2

Why not use loopne instead of the jump?

Question 3

You can use Bit Twidding Hack to determine if int32 or int64 has no zero bytes: http://graphics.stanford.edu/~seander/bithacks.html#ZeroInWord

If it has not you can copy whole int32 or int64. So it will be 4 operations for searching zero byte in 8 bytes (in int64 case). It looks like true optimization.

char * strcpy(char * dst, const char * src)
{
 char * origin = dst;
 while (!((((*(uint64_t *)src) - 0x0101010101010101ULL) 
 & ~(*(uint64_t *)src) & 0x8080808080808080ULL)))
 {
 *(uint64_t *)dst = *(uint64_t *)src;
 src += 8;
 dst += 8;
 }
 while (*dst++ = *src++)
 ;
 return origin;
}

Simple strcpy implementation uses 8 compares to zero and 8 byte copyings for each 8 bytes of source string. My implementation uses 4 operation for checking for zeros and 1 operation to copy for 8 bytes. So we have 5ops vs 16ops. Not all ops have same speeds so it is not easy to compare real speedup. We need some benchmarks, is anyone free?

Question 4

Sorry for the late reply. This looks good to me, thanks for your effort.

Question 5

This violates strict aliasing and could create exactly the same kind of problem as the custom memcpy that caused the problem in gcc, strict-aliasing, and horror stories.

Question 6

@PeterCordes don't you think char * is allowed to cast to any type without breaking strict aliasing rule? It is an exception to this rule.

Question 7

It's ok to read a long object with a char*, but it's not guaranteed to be ok to read a char object with a long*. You might be ok here for strict aliasing, in practice on real compilers, because you'd normally only use strcpy on actual string data, not on other objects cast to char* the way memcpy gets used. But I'm not sure about even that, according to the letter of the law. char str[16] is an array object, not a pointer, so accessing arr[10] might not count as a char* access that's allowed to alias the copying you did with this function.

Question 8

Apart from strict-aliasing, though, this function has two fatal flaws, one fixable: uint64_t has a higher alignment requirement than char, so this could easily fault from unaligned access on some CPUs, and even on x86 gcc assumes correct alignment when auto-vectorizing: Why does unaligned access to mmap'ed memory sometimes segfault on AMD64?.

Question 9

Your __strcpy assembler function seems to

copy at most 0xffff bytes
not to copy the terminating 0円

A normal string copy would copy any number of bytes and would include the 0円.

It is often good to learn from what the compiler gives you from a C function. You can do this with the -S option to gcc. For example this line will compile code.c into assembler file code.s

gcc -S -o code.s -O3 -Wall code.c

The -O3 sets the optimization level (high).

Also if you omit the length check, you should be able to arrange your loop so that there is only one branch instruction (branches are expensive).

Question 10

I apologize for my lack of GCC fluency in advance - I have only used MASM and RosAsm for x86, but I will try to translate. This review will be in top-to-bottom order, not in order of importance.

The first thing I would do is evaluate whether you really need to use cdecl calling convention. If you're only calling your function from asm, it makes sense to pass the source and destination in esi and edi, respectively, rather than putting them on the stack and then loading them.

Next, instead of movl %ecx, 0ドルxFFFF, I would do:

xorl %ecx, %ecx
decl %ecx

xoring a register with itself is such a common idiom that some processors use it as an optimization hint. All it does is set the register to zero, with a smaller opcode & operand. The dec after the xor makes the register all 1 bits no matter how many bits your register actually has. Some day when all of our GPRs are 128 bits, some poor sap that is updating assembly code will thank you for that =D.

Alternatively, you can forget about ecx being a limit altogether. No matter what arbitrary limit you set on the size of the string, it will either (1) not be big enough for someone someday, or (2) be small enough that an access violation (or worse: no access violation) will occur before you actually reach that limit. Either way, that is really only a nominal protection of data integrity.

Now, the string. There seem to be some inconsistencies in how you're treating its terminating null character. You're using cmpl to find four bytes of 0, then using movsb to only copy/advance esi by 1. Normally, strings are only guaranteed to be terminated by a number of null bytes equal to the character size, although in practice there are probably at least 2-3 to get the next datum to be dword-aligned. What that means for you is that your code will fail to detect the end of ~3/4 of normal, null-terminated, ascii strings, and keep copying until it finally causes an access violation.

But that's not all. Notice that you're fetching a dword at esi with the cmpl instruction, and that advancing that pointer by 1 at every iteration will make the pointer not dword-aligned 3/4 of the time. Loading non-aligned data takes two fetches instead of one, so for every 4 bytes of string, your cmpl instruction alone needs 7 fetches from memory. Furthermore, after fetching the data and discarding it with cmpl, you fetch it again with movsb, a total of 11 memory loads per dword of data.

To reduce that number, you should load the data into a register, do your test for the null terminator on that register, then store the data to the destination. I see that someone else has pointed you to the bit-hack that will let you test all 4 bytes of the dword at once, so if you can follow that, do so, but here's a less efficient way that demonstrates my point very clearly:

cpy:
 lodsb ; fetch one byte from [esi++] to al
 test %al, %al ; set the z flag if al is 0
 stosb ; store the byte, even if it's 0, doesn't affect flags
 jnz cpy ; or loopnz if you still want to use a size limit in ecx

That will only require 4 fetches per dword, and is perfectly readable (as far as asm goes...). The bit twiddling hack does it with just one.

With regard to your instinct that rep family instructions would be better, I believe you're sadly incorrect. If you learn any other assembly language and try to use strings, you'll certainly appreciate the effort Intel originally put into providing special string instructions, but they haven't really optimized or extended those instructions with the same zeal as one might like. Further, you can't repnz movsb, because movsb doesn't set any flags. If you ever do end up repne scasb, remember to jecxz =D.

This might seem like a long list of complaints, but for having first seen assembly 4 days before writing this, it's remarkable that you can do anything useful. Cheers.

k06a k06a 2281 silver badge7 bronze badges · Accepted Answer · 2013-08-28 07:23:01Z

9

\$\begingroup\$

You can use Bit Twidding Hack to determine if int32 or int64 has no zero bytes: http://graphics.stanford.edu/~seander/bithacks.html#ZeroInWord

If it has not you can copy whole int32 or int64. So it will be 4 operations for searching zero byte in 8 bytes (in int64 case). It looks like true optimization.

char * strcpy(char * dst, const char * src)
{
 char * origin = dst;
 while (!((((*(uint64_t *)src) - 0x0101010101010101ULL) 
 & ~(*(uint64_t *)src) & 0x8080808080808080ULL)))
 {
 *(uint64_t *)dst = *(uint64_t *)src;
 src += 8;
 dst += 8;
 }
 while (*dst++ = *src++)
 ;
 return origin;
}

Simple strcpy implementation uses 8 compares to zero and 8 byte copyings for each 8 bytes of source string. My implementation uses 4 operation for checking for zeros and 1 operation to copy for 8 bytes. So we have 5ops vs 16ops. Not all ops have same speeds so it is not easy to compare real speedup. We need some benchmarks, is anyone free?

Share

edited Feb 15, 2014 at 13:46

answered Aug 28, 2013 at 7:23

k06a's user avatar

k06a k06a

2281 silver badge7 bronze badges

\$\endgroup\$

6

\$\begingroup\$ Sorry for the late reply. This looks good to me, thanks for your effort. \$\endgroup\$

Fallen
– Fallen

2013年09月02日 12:25:30 +00:00
Commented Sep 2, 2013 at 12:25
\$\begingroup\$ This violates strict aliasing and could create exactly the same kind of problem as the custom memcpy that caused the problem in gcc, strict-aliasing, and horror stories. \$\endgroup\$

Peter Cordes
– Peter Cordes

2018年05月19日 05:07:31 +00:00
Commented May 19, 2018 at 5:07
\$\begingroup\$ @PeterCordes don't you think char * is allowed to cast to any type without breaking strict aliasing rule? It is an exception to this rule. \$\endgroup\$

k06a
– k06a

2018年05月19日 20:45:45 +00:00
Commented May 19, 2018 at 20:45
\$\begingroup\$ It's ok to read a long object with a char*, but it's not guaranteed to be ok to read a char object with a long*. You might be ok here for strict aliasing, in practice on real compilers, because you'd normally only use strcpy on actual string data, not on other objects cast to char* the way memcpy gets used. But I'm not sure about even that, according to the letter of the law. char str[16] is an array object, not a pointer, so accessing arr[10] might not count as a char* access that's allowed to alias the copying you did with this function. \$\endgroup\$

Peter Cordes
– Peter Cordes

2018年05月19日 20:59:35 +00:00
Commented May 19, 2018 at 20:59
\$\begingroup\$ Apart from strict-aliasing, though, this function has two fatal flaws, one fixable: uint64_t has a higher alignment requirement than char, so this could easily fault from unaligned access on some CPUs, and even on x86 gcc assumes correct alignment when auto-vectorizing: Why does unaligned access to mmap'ed memory sometimes segfault on AMD64?. \$\endgroup\$

Peter Cordes
– Peter Cordes

2018年05月19日 21:03:26 +00:00
Commented May 19, 2018 at 21:03

| Show 1 more comment

Stack Exchange Network

x86 strcpy implementation

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

x86 strcpy implementation

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions