This is my best effort at converting a 32 bit integer in EAX
, to an 8 character ascii string (result in RDI
). It will work accurately up to 99,999,999. Higher values could be done using an xmm
register. The byte order in RDI
is correct (e.g. if copied directly to video ram, the most significant byte is displayed first) No expensive DIV
instructions or memory accesses required. Can this be improved further?
mov ebx, 0xCCCCCCCD
xor rdi, rdi
.loop:
mov ecx, eax ; save original number
mul ebx ; divide by 10 using agner fog's 'magic number'
shr edx, 3 ;
mov eax, edx ; store it back into eax
lea edx, [edx*4 + edx] ; multiply by 10
lea edx, [edx*2 - '0'] ; and ascii it
sub ecx, edx ; subtract from original number to get remainder
shl rdi, 8 ; shift in to least significant byte
or rdi, rcx ;
test eax, eax
jnz .loop
-
\$\begingroup\$ What constitutes an "improvement?" Smaller code? Faster code? Error checking? I can make it smaller by using div, faster (on my hw) by moving things around, and checking for overflow would be trivial. There are some comments that could be improved too. \$\endgroup\$David Wohlferd– David Wohlferd2016年09月30日 10:32:33 +00:00Commented Sep 30, 2016 at 10:32
-
1\$\begingroup\$ Faster is the criteria. \$\endgroup\$poby– poby2016年09月30日 12:20:38 +00:00Commented Sep 30, 2016 at 12:20
1 Answer 1
There are all sorts of things that LOOK like they should perform better, but I can only get a tiny bit of improvement. And that may disappear (or worsen) on other hardware:
mov ebx, 0xCCCCCCCD
xor rdi, rdi
.loop:
mov ecx, eax ; save original number
mul ebx ; divide by 10 using agner fog's 'magic number'
shr edx, 3 ;
mov eax, edx ; store quotient for next loop
lea edx, [edx*4 + edx] ; multiply by 10
shl rdi, 8 ; make room for byte
lea edx, [edx*2 - '0'] ; finish *10 and convert to ascii
sub ecx, edx ; subtract from original number to get remainder
lea rdi, [rdi + rcx] ; store next byte
test eax, eax
jnz .loop
The 2 changes are:
- Move
shl rdi
to a 'better' place - Use
lea
instead ofor
to set rdi
-
\$\begingroup\$ I understand moving the
shl rdi
but how islea
faster thanor
? \$\endgroup\$poby– poby2016年09月30日 23:35:07 +00:00Commented Sep 30, 2016 at 23:35 -
3\$\begingroup\$ Strictly speaking, the answer is "I don't know why it's faster, I just tried it and it works."
or
(presumably) uses a different pipeline, or has different dependencies thanlea
. Given the "out-of-order" execution of today's processors, it's hard for mere mortals to understand how anything works anymore. That's (one reason) why most people use higher level languages. The people who write the compiler's optimizers understand this crap so the rest of us don't have to. The days when you could just dash off a few lines of asm that performed better then the C output are long gone. \$\endgroup\$David Wohlferd– David Wohlferd2016年10月01日 01:08:06 +00:00Commented Oct 1, 2016 at 1:08 -
2\$\begingroup\$ Why use
lea
here at all? Why not justadd rdi, rcx
? Also,xor rdi, rdi
should bexor edi, edi
since that will implicitly clear the upper 32 bits but is shorter because it doesn't require the REX prefix. \$\endgroup\$Cody Gray– Cody Gray2017年01月02日 15:06:14 +00:00Commented Jan 2, 2017 at 15:06