10
\$\begingroup\$

This is my best effort at converting a 32 bit integer in EAX, to an 8 character ascii string (result in RDI). It will work accurately up to 99,999,999. Higher values could be done using an xmm register. The byte order in RDI is correct (e.g. if copied directly to video ram, the most significant byte is displayed first) No expensive DIV instructions or memory accesses required. Can this be improved further?

 mov ebx, 0xCCCCCCCD 
 xor rdi, rdi
.loop:
 mov ecx, eax ; save original number
 mul ebx ; divide by 10 using agner fog's 'magic number'
 shr edx, 3 ;
 mov eax, edx ; store it back into eax
 lea edx, [edx*4 + edx] ; multiply by 10
 lea edx, [edx*2 - '0'] ; and ascii it
 sub ecx, edx ; subtract from original number to get remainder
 shl rdi, 8 ; shift in to least significant byte
 or rdi, rcx ;
 test eax, eax
 jnz .loop 
asked Sep 29, 2016 at 19:47
\$\endgroup\$
2
  • \$\begingroup\$ What constitutes an "improvement?" Smaller code? Faster code? Error checking? I can make it smaller by using div, faster (on my hw) by moving things around, and checking for overflow would be trivial. There are some comments that could be improved too. \$\endgroup\$ Commented Sep 30, 2016 at 10:32
  • 1
    \$\begingroup\$ Faster is the criteria. \$\endgroup\$ Commented Sep 30, 2016 at 12:20

1 Answer 1

2
\$\begingroup\$

There are all sorts of things that LOOK like they should perform better, but I can only get a tiny bit of improvement. And that may disappear (or worsen) on other hardware:

 mov ebx, 0xCCCCCCCD 
 xor rdi, rdi
.loop:
 mov ecx, eax ; save original number
 mul ebx ; divide by 10 using agner fog's 'magic number'
 shr edx, 3 ;
 mov eax, edx ; store quotient for next loop
 lea edx, [edx*4 + edx] ; multiply by 10
 shl rdi, 8 ; make room for byte
 lea edx, [edx*2 - '0'] ; finish *10 and convert to ascii
 sub ecx, edx ; subtract from original number to get remainder
 lea rdi, [rdi + rcx] ; store next byte
 test eax, eax
 jnz .loop 

The 2 changes are:

  • Move shl rdi to a 'better' place
  • Use lea instead of or to set rdi
answered Sep 30, 2016 at 15:42
\$\endgroup\$
3
  • \$\begingroup\$ I understand moving the shl rdi but how is lea faster than or? \$\endgroup\$ Commented Sep 30, 2016 at 23:35
  • 3
    \$\begingroup\$ Strictly speaking, the answer is "I don't know why it's faster, I just tried it and it works." or (presumably) uses a different pipeline, or has different dependencies than lea. Given the "out-of-order" execution of today's processors, it's hard for mere mortals to understand how anything works anymore. That's (one reason) why most people use higher level languages. The people who write the compiler's optimizers understand this crap so the rest of us don't have to. The days when you could just dash off a few lines of asm that performed better then the C output are long gone. \$\endgroup\$ Commented Oct 1, 2016 at 1:08
  • 2
    \$\begingroup\$ Why use lea here at all? Why not just add rdi, rcx? Also, xor rdi, rdi should be xor edi, edi since that will implicitly clear the upper 32 bits but is shorter because it doesn't require the REX prefix. \$\endgroup\$ Commented Jan 2, 2017 at 15:06

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.