Integer-to-ASCII algorithm (x86 assembly)

Question 1

This is my best effort at converting a 32 bit integer in EAX, to an 8 character ascii string (result in RDI). It will work accurately up to 99,999,999. Higher values could be done using an xmm register. The byte order in RDI is correct (e.g. if copied directly to video ram, the most significant byte is displayed first) No expensive DIV instructions or memory accesses required. Can this be improved further?

 mov ebx, 0xCCCCCCCD 
 xor rdi, rdi
.loop:
 mov ecx, eax ; save original number
 mul ebx ; divide by 10 using agner fog's 'magic number'
 shr edx, 3 ;
 mov eax, edx ; store it back into eax
 lea edx, [edx*4 + edx] ; multiply by 10
 lea edx, [edx*2 - '0'] ; and ascii it
 sub ecx, edx ; subtract from original number to get remainder
 shl rdi, 8 ; shift in to least significant byte
 or rdi, rcx ;
 test eax, eax
 jnz .loop

Question 2

What constitutes an "improvement?" Smaller code? Faster code? Error checking? I can make it smaller by using div, faster (on my hw) by moving things around, and checking for overflow would be trivial. There are some comments that could be improved too.

Question 3

Faster is the criteria.

Question 4

There are all sorts of things that LOOK like they should perform better, but I can only get a tiny bit of improvement. And that may disappear (or worsen) on other hardware:

 mov ebx, 0xCCCCCCCD 
 xor rdi, rdi
.loop:
 mov ecx, eax ; save original number
 mul ebx ; divide by 10 using agner fog's 'magic number'
 shr edx, 3 ;
 mov eax, edx ; store quotient for next loop
 lea edx, [edx*4 + edx] ; multiply by 10
 shl rdi, 8 ; make room for byte
 lea edx, [edx*2 - '0'] ; finish *10 and convert to ascii
 sub ecx, edx ; subtract from original number to get remainder
 lea rdi, [rdi + rcx] ; store next byte
 test eax, eax
 jnz .loop

The 2 changes are:

Move shl rdi to a 'better' place
Use lea instead of or to set rdi

Question 5

I understand moving the shl rdi but how is lea faster than or?

Question 6

Strictly speaking, the answer is "I don't know why it's faster, I just tried it and it works." or (presumably) uses a different pipeline, or has different dependencies than lea. Given the "out-of-order" execution of today's processors, it's hard for mere mortals to understand how anything works anymore. That's (one reason) why most people use higher level languages. The people who write the compiler's optimizers understand this crap so the rest of us don't have to. The days when you could just dash off a few lines of asm that performed better then the C output are long gone.

Question 7

Why use lea here at all? Why not just add rdi, rcx? Also, xor rdi, rdi should be xor edi, edi since that will implicitly clear the upper 32 bits but is shorter because it doesn't require the REX prefix.

David Wohlferd David Wohlferd 1,5181 gold badge8 silver badges17 bronze badges · Accepted Answer · 2016-09-30 15:42:37Z

2

\$\begingroup\$

There are all sorts of things that LOOK like they should perform better, but I can only get a tiny bit of improvement. And that may disappear (or worsen) on other hardware:

 mov ebx, 0xCCCCCCCD 
 xor rdi, rdi
.loop:
 mov ecx, eax ; save original number
 mul ebx ; divide by 10 using agner fog's 'magic number'
 shr edx, 3 ;
 mov eax, edx ; store quotient for next loop
 lea edx, [edx*4 + edx] ; multiply by 10
 shl rdi, 8 ; make room for byte
 lea edx, [edx*2 - '0'] ; finish *10 and convert to ascii
 sub ecx, edx ; subtract from original number to get remainder
 lea rdi, [rdi + rcx] ; store next byte
 test eax, eax
 jnz .loop

The 2 changes are:

Move shl rdi to a 'better' place
Use lea instead of or to set rdi

Share

answered Sep 30, 2016 at 15:42

David Wohlferd's user avatar

David Wohlferd David Wohlferd

1,5181 gold badge8 silver badges17 bronze badges

\$\endgroup\$

3

\$\begingroup\$ I understand moving the shl rdi but how is lea faster than or? \$\endgroup\$

poby
– poby

2016年09月30日 23:35:07 +00:00
Commented Sep 30, 2016 at 23:35
3

\$\begingroup\$ Strictly speaking, the answer is "I don't know why it's faster, I just tried it and it works." or (presumably) uses a different pipeline, or has different dependencies than lea. Given the "out-of-order" execution of today's processors, it's hard for mere mortals to understand how anything works anymore. That's (one reason) why most people use higher level languages. The people who write the compiler's optimizers understand this crap so the rest of us don't have to. The days when you could just dash off a few lines of asm that performed better then the C output are long gone. \$\endgroup\$

David Wohlferd
– David Wohlferd

2016年10月01日 01:08:06 +00:00
Commented Oct 1, 2016 at 1:08
2

\$\begingroup\$ Why use lea here at all? Why not just add rdi, rcx? Also, xor rdi, rdi should be xor edi, edi since that will implicitly clear the upper 32 bits but is shorter because it doesn't require the REX prefix. \$\endgroup\$

Cody Gray
– Cody Gray

2017年01月02日 15:06:14 +00:00
Commented Jan 2, 2017 at 15:06

Add a comment |

Stack Exchange Network

Integer-to-ASCII algorithm (x86 assembly)

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Integer-to-ASCII algorithm (x86 assembly)

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions