I got about 4 days of assembly knowledge, so I need a review on this strcpy
function and if it can be done better (at least I have the feeling).
Full code (with the test included):
.data
s:
.asciz "Hello world!"
.bss
.lcomm destination, 4
.text
.globl main
main:
nop
pushl $s
pushl $destination
call __strcpy
addl 8,ドル %esp
pushl $destination
call puts
addl 4,ドル %esp
ret
.globl __strcpy
.type __strcpy, @function
__strcpy:
movl 0ドルxFFFF, %ecx
movl 4(%esp), %edi
movl 8(%esp), %esi
cpy:
cmpl 0,ドル (%esi)
je done
movsb
loop cpy
done:
ret
Parts that I feel can be optimized:
Because the
done
label just executes theret
instruction:cmpl 0,ドル (%esi)
je done
Because the
rep
instruction-family seems a like better approach:movsb
loop cpy
3 Answers 3
You can use Bit Twidding Hack to determine if int32
or int64
has no zero bytes: http://graphics.stanford.edu/~seander/bithacks.html#ZeroInWord
If it has not you can copy whole int32
or int64
. So it will be 4 operations for searching zero byte in 8 bytes (in int64
case). It looks like true optimization.
char * strcpy(char * dst, const char * src)
{
char * origin = dst;
while (!((((*(uint64_t *)src) - 0x0101010101010101ULL)
& ~(*(uint64_t *)src) & 0x8080808080808080ULL)))
{
*(uint64_t *)dst = *(uint64_t *)src;
src += 8;
dst += 8;
}
while (*dst++ = *src++)
;
return origin;
}
Simple strcpy
implementation uses 8 compares to zero and 8 byte copyings for each 8 bytes of source string. My implementation uses 4 operation for checking for zeros and 1 operation to copy for 8 bytes. So we have 5ops vs 16ops. Not all ops have same speeds so it is not easy to compare real speedup. We need some benchmarks, is anyone free?
-
\$\begingroup\$ Sorry for the late reply. This looks good to me, thanks for your effort. \$\endgroup\$Fallen– Fallen2013年09月02日 12:25:30 +00:00Commented Sep 2, 2013 at 12:25
-
\$\begingroup\$ This violates strict aliasing and could create exactly the same kind of problem as the custom
memcpy
that caused the problem in gcc, strict-aliasing, and horror stories. \$\endgroup\$Peter Cordes– Peter Cordes2018年05月19日 05:07:31 +00:00Commented May 19, 2018 at 5:07 -
\$\begingroup\$ @PeterCordes don't you think
char *
is allowed to cast to any type without breaking strict aliasing rule? It is an exception to this rule. \$\endgroup\$k06a– k06a2018年05月19日 20:45:45 +00:00Commented May 19, 2018 at 20:45 -
\$\begingroup\$ It's ok to read a
long
object with achar*
, but it's not guaranteed to be ok to read achar
object with along*
. You might be ok here for strict aliasing, in practice on real compilers, because you'd normally only usestrcpy
on actual string data, not on other objects cast tochar*
the waymemcpy
gets used. But I'm not sure about even that, according to the letter of the law.char str[16]
is an array object, not a pointer, so accessingarr[10]
might not count as achar*
access that's allowed to alias the copying you did with this function. \$\endgroup\$Peter Cordes– Peter Cordes2018年05月19日 20:59:35 +00:00Commented May 19, 2018 at 20:59 -
\$\begingroup\$ Apart from strict-aliasing, though, this function has two fatal flaws, one fixable:
uint64_t
has a higher alignment requirement thanchar
, so this could easily fault from unaligned access on some CPUs, and even on x86 gcc assumes correct alignment when auto-vectorizing: Why does unaligned access to mmap'ed memory sometimes segfault on AMD64?. \$\endgroup\$Peter Cordes– Peter Cordes2018年05月19日 21:03:26 +00:00Commented May 19, 2018 at 21:03
Your __strcpy
assembler function seems to
- copy at most 0xffff bytes
- not to copy the terminating 0円
A normal string copy would copy any number of bytes and would include the 0円.
It is often good to learn from what the compiler gives you from a C function.
You can do this with the -S option to gcc
. For example this line will
compile code.c
into assembler file code.s
gcc -S -o code.s -O3 -Wall code.c
The -O3 sets the optimization level (high).
Also if you omit the length check, you should be able to arrange your loop so that there is only one branch instruction (branches are expensive).
I apologize for my lack of GCC fluency in advance - I have only used MASM and RosAsm for x86, but I will try to translate. This review will be in top-to-bottom order, not in order of importance.
The first thing I would do is evaluate whether you really need to use cdecl calling convention. If you're only calling your function from asm, it makes sense to pass the source and destination in esi
and edi
, respectively, rather than putting them on the stack and then loading them.
Next, instead of movl %ecx, 0ドルxFFFF
, I would do:
xorl %ecx, %ecx
decl %ecx
xor
ing a register with itself is such a common idiom that some processors use it as an optimization hint. All it does is set the register to zero, with a smaller opcode & operand. The dec
after the xor
makes the register all 1 bits no matter how many bits your register actually has. Some day when all of our GPRs are 128 bits, some poor sap that is updating assembly code will thank you for that =D.
Alternatively, you can forget about ecx
being a limit altogether. No matter what arbitrary limit you set on the size of the string, it will either (1) not be big enough for someone someday, or (2) be small enough that an access violation (or worse: no access violation) will occur before you actually reach that limit. Either way, that is really only a nominal protection of data integrity.
Now, the string. There seem to be some inconsistencies in how you're treating its terminating null character. You're using cmpl
to find four bytes of 0, then using movsb
to only copy/advance esi
by 1. Normally, strings are only guaranteed to be terminated by a number of null bytes equal to the character size, although in practice there are probably at least 2-3 to get the next datum to be dword-aligned. What that means for you is that your code will fail to detect the end of ~3/4 of normal, null-terminated, ascii strings, and keep copying until it finally causes an access violation.
But that's not all. Notice that you're fetching a dword at esi
with the cmpl
instruction, and that advancing that pointer by 1 at every iteration will make the pointer not dword-aligned 3/4 of the time. Loading non-aligned data takes two fetches instead of one, so for every 4 bytes of string, your cmpl
instruction alone needs 7 fetches from memory. Furthermore, after fetching the data and discarding it with cmpl
, you fetch it again with movsb
, a total of 11 memory loads per dword of data.
To reduce that number, you should load the data into a register, do your test for the null terminator on that register, then store the data to the destination. I see that someone else has pointed you to the bit-hack that will let you test all 4 bytes of the dword at once, so if you can follow that, do so, but here's a less efficient way that demonstrates my point very clearly:
cpy:
lodsb ; fetch one byte from [esi++] to al
test %al, %al ; set the z flag if al is 0
stosb ; store the byte, even if it's 0, doesn't affect flags
jnz cpy ; or loopnz if you still want to use a size limit in ecx
That will only require 4 fetches per dword, and is perfectly readable (as far as asm goes...). The bit twiddling hack does it with just one.
With regard to your instinct that rep
family instructions would be better, I believe you're sadly incorrect. If you learn any other assembly language and try to use strings, you'll certainly appreciate the effort Intel originally put into providing special string instructions, but they haven't really optimized or extended those instructions with the same zeal as one might like. Further, you can't repnz movsb
, because movsb
doesn't set any flags. If you ever do end up repne scasb
, remember to jecxz
=D.
This might seem like a long list of complaints, but for having first seen assembly 4 days before writing this, it's remarkable that you can do anything useful. Cheers.
loopne
instead of the jump? \$\endgroup\$