This is my best effort to code an x86 32 bits to get an unsigned int (4 bytes) and convert it into a string. My strategy was successively divide the number by ten and fulfill the remainder of the div
according its position in the heap pre-allocated string. I am assuming the bigger value has 10 digits at most.
File i2a.s
:
# convert integer to ASCII
.section .data
num:
.long 123456789
str:
# 0123456789A
.string " \n"
len:
.long .-str
.section .text
.globl _start
_start:
# last index of str
lea str+0xa, %edi
# initialize eax
mov num, %eax
loop:
# divide
xor %edx, %edx
mov 0ドルxa, %ecx
div %ecx # quocient: EAX / remainder: EDX
# convert remainder to ASCII
add 0ドルx30, %edx
# store it in str[i--]
mov %dl, -1(%edi)
dec %edi
# if quocient is zero, we're done.
cmp 0,ドル %eax
je print
# if i == 0, we're done.
cmp str, %edi
je print
# repeat.
jmp loop
print:
# print content of str
mov 4,ドル %eax
mov 1,ドル %ebx
lea str, %ecx
# where the generated string starts?
# str[%edi]
sub $str, %edi
add %edi, %ecx
# what is the final length of str?
# len -= %edi
mov len, %edx
sub %edi, %edx
int 0ドルx80
exit:
mov 1,ドル %eax
mov %edx, %ebx
int 0ドルx80
I know there are some criticism about div
and its slow execution. Please, fell free to criticize it as well as to point out caveats, tips, and tricks.
Simple script I'm using to build it: (it will create bin
directory in cwd)
#! /bin/bash
# binaries output directory (objects and executables are placed there)
OUTDIR=bin
[ -d $OUTDIR ] || mkdir $OUTDIR
as --32 1ドル.s -o $OUTDIR/1ドル.o \
&& ld -m elf_i386 $OUTDIR/1ドル.o -o $OUTDIR/1ドル \
&& $OUTDIR/1ドル
Usase:
./build i2a
1 Answer 1
It’s been a good many years since I wrote significant code in x86 assembly language, but I’ll wade in here.
Yes, div
is a fat pig of an instruction. It not only takes a ridiculous number of clock cycles, but it uses many of the dispatch ports, clogging the CPU so it can’t do much else while waiting on the result. See the links in this S.O. answer.
This is because division is difficult. You could look up how to transform division by a constant (10 in this case) into multiplication and code that.
But, successive division/modulo by 10 is now I learned it.
The biggest speed benefit will be from allowing pipelining to work well and not stalling. Sometimes you code two or three different copies of the loop and interweave them so that there is more independent calculations to work on at the same time. To that end, you might start by dividing by 10000 once and then doing high digits and low digits separately in interwoven loops.
But, using the div
instruction makes that difficult because it has pre-targeted registers that it uses, and worse, the instruction does not pipeline.
I think the fastest would be to use the div-less algorithm with vector SIMD instructions to compute multiple digits at the same time.
sketch of an idea
Avoid all division/modulo and instead build up the decimal representation using decimal arithmetic!
A 64-bit value can also be treated as 8 bytes; each byte is a digit (not packed as BCD nibbles, but one per byte). You can add them by adding the 64-bit integers — each byte will add, and there can be no caries outside of the byte because the largest result is 18 and you can hold up to 255.
In particular, doubling such a value means adding it to itself. The hard part, which I’ll explain later, is following that up with carry propagation.
So, start with a 8-digit unpacked decimal set to all zeros; call that Accumulator.
Loop while In != 0 :
Take your normal integer (called In) and mask off the lowest bit, getting a result that is either 0 or 1. Add that to the least significant byte of Accumulator. Note that you have no if
here, no branching. It will simply add zero accomplishing nothing, or add one. You can do that add using a 32-bit or any other sized int aliased with the Accumulator, not just a byte, because you know the carry will not happen out of the byte. I mention this because byte arithmetic is slightly slower than 32/64.
Then, shift In right one bit.
Add Accumulator to itself, using the aliasing trick I mentioned before.
Process the carries.
[end of loop]
Convert Accumulator to an ASCII representation by adding 0x30 to each byte. This can be a constant of 64 bits, so you do this in one add.
So, how do you propagate the carries? Use SSE instructions.
Load the Accumulator into an SSE register. Use an instruction which compares each (byte) element against 10 and produces a packed bitmap of the results. There will be 16 bits, indicating which bytes are too large for decimal.
Use that bitmap as a mask to subtract 10 from each enabled byte. You can use the same SSE register loaded with all 10's for both steps.
Shift the bitmap left by one bit. Now use that as a mask to add 1 to every enabled byte.
Now, saving the SSE register back to RAM and loading it again will slow it down. So do the doubling (adding the register to itself) in SSE, also. Adding the new In bit can be done with separate instructions, or rolled into the carry step by setting the least bit (or not) after shifting left. Whichever is the least amount of jumping through hoops or avoiding a branch.
mov 5, %eax
instead ofmov EAX, 5
. :) \$\endgroup\$