Convert unsigned Integer to ASCII in x86 assembly

Question 1

This is my best effort to code an x86 32 bits to get an unsigned int (4 bytes) and convert it into a string. My strategy was successively divide the number by ten and fulfill the remainder of the div according its position in the heap pre-allocated string. I am assuming the bigger value has 10 digits at most.

File i2a.s:

# convert integer to ASCII
.section .data
num:
 .long 123456789
str:
 # 0123456789A
 .string " \n"
len:
 .long .-str
.section .text
 .globl _start
_start:
 # last index of str
 lea str+0xa, %edi
 # initialize eax
 mov num, %eax
loop:
 # divide
 xor %edx, %edx
 mov 0ドルxa, %ecx
 div %ecx # quocient: EAX / remainder: EDX
 # convert remainder to ASCII
 add 0ドルx30, %edx
 # store it in str[i--]
 mov %dl, -1(%edi)
 dec %edi
 # if quocient is zero, we're done.
 cmp 0,ドル %eax
 je print
 # if i == 0, we're done.
 cmp str, %edi
 je print
 # repeat.
 jmp loop
print:
# print content of str
 mov 4,ドル %eax
 mov 1,ドル %ebx
 lea str, %ecx
# where the generated string starts?
# str[%edi]
 sub $str, %edi
 add %edi, %ecx
# what is the final length of str?
# len -= %edi
 mov len, %edx
 sub %edi, %edx
 int 0ドルx80
exit:
 mov 1,ドル %eax
 mov %edx, %ebx
 int 0ドルx80

I know there are some criticism about div and its slow execution. Please, fell free to criticize it as well as to point out caveats, tips, and tricks.

Simple script I'm using to build it: (it will create bin directory in cwd)

#! /bin/bash
# binaries output directory (objects and executables are placed there)
OUTDIR=bin
[ -d $OUTDIR ] || mkdir $OUTDIR
as --32 1ドル.s -o $OUTDIR/1ドル.o \
 && ld -m elf_i386 $OUTDIR/1ドル.o -o $OUTDIR/1ドル \
 && $OUTDIR/1ドル

Usase:

./build i2a

Question 2

@DavidWohlferd Comments are for seeking clarification to the question. Please put all suggestions for improvements in answers.

Question 3

The title does not correspond to the code (a2i vs. i2a)

Question 4

Do you have the title the wrong way around? It looks like you're converting an unsigned integer to an ASCII string.

Question 5

The title was flipped, I'm converting from unsigned int 32 bits to ASCII. I will change it.

Question 6

@TobySpeight title reflects the backwards notation in his asm format: mov 5, %eax instead of mov EAX, 5. :)

Question 7

It’s been a good many years since I wrote significant code in x86 assembly language, but I’ll wade in here.

Yes, div is a fat pig of an instruction. It not only takes a ridiculous number of clock cycles, but it uses many of the dispatch ports, clogging the CPU so it can’t do much else while waiting on the result. See the links in this S.O. answer.

This is because division is difficult. You could look up how to transform division by a constant (10 in this case) into multiplication and code that.

But, successive division/modulo by 10 is now I learned it.

The biggest speed benefit will be from allowing pipelining to work well and not stalling. Sometimes you code two or three different copies of the loop and interweave them so that there is more independent calculations to work on at the same time. To that end, you might start by dividing by 10000 once and then doing high digits and low digits separately in interwoven loops.

But, using the div instruction makes that difficult because it has pre-targeted registers that it uses, and worse, the instruction does not pipeline.

I think the fastest would be to use the div-less algorithm with vector SIMD instructions to compute multiple digits at the same time.

sketch of an idea

Avoid all division/modulo and instead build up the decimal representation using decimal arithmetic!

A 64-bit value can also be treated as 8 bytes; each byte is a digit (not packed as BCD nibbles, but one per byte). You can add them by adding the 64-bit integers — each byte will add, and there can be no caries outside of the byte because the largest result is 18 and you can hold up to 255.

In particular, doubling such a value means adding it to itself. The hard part, which I’ll explain later, is following that up with carry propagation.

So, start with a 8-digit unpacked decimal set to all zeros; call that Accumulator.

Loop while In != 0 :

Take your normal integer (called In) and mask off the lowest bit, getting a result that is either 0 or 1. Add that to the least significant byte of Accumulator. Note that you have no if here, no branching. It will simply add zero accomplishing nothing, or add one. You can do that add using a 32-bit or any other sized int aliased with the Accumulator, not just a byte, because you know the carry will not happen out of the byte. I mention this because byte arithmetic is slightly slower than 32/64.

Then, shift In right one bit.

Add Accumulator to itself, using the aliasing trick I mentioned before.

Process the carries.

[end of loop]

Convert Accumulator to an ASCII representation by adding 0x30 to each byte. This can be a constant of 64 bits, so you do this in one add.

So, how do you propagate the carries? Use SSE instructions.

Load the Accumulator into an SSE register. Use an instruction which compares each (byte) element against 10 and produces a packed bitmap of the results. There will be 16 bits, indicating which bytes are too large for decimal.

Use that bitmap as a mask to subtract 10 from each enabled byte. You can use the same SSE register loaded with all 10's for both steps.

Shift the bitmap left by one bit. Now use that as a mask to add 1 to every enabled byte.

Now, saving the SSE register back to RAM and loading it again will slow it down. So do the doubling (adding the register to itself) in SSE, also. Adding the new In bit can be done with separate instructions, or rolled into the carry step by setting the least bit (or not) after shifting left. Whichever is the least amount of jumping through hoops or avoiding a branch.

JDługosz JDługosz 11.7k19 silver badges40 bronze badges · Answer 1 · 2018-05-08 21:37:35Z

It’s been a good many years since I wrote significant code in x86 assembly language, but I’ll wade in here.

Yes, div is a fat pig of an instruction. It not only takes a ridiculous number of clock cycles, but it uses many of the dispatch ports, clogging the CPU so it can’t do much else while waiting on the result. See the links in this S.O. answer.

This is because division is difficult. You could look up how to transform division by a constant (10 in this case) into multiplication and code that.

But, successive division/modulo by 10 is now I learned it.

The biggest speed benefit will be from allowing pipelining to work well and not stalling. Sometimes you code two or three different copies of the loop and interweave them so that there is more independent calculations to work on at the same time. To that end, you might start by dividing by 10000 once and then doing high digits and low digits separately in interwoven loops.

But, using the div instruction makes that difficult because it has pre-targeted registers that it uses, and worse, the instruction does not pipeline.

I think the fastest would be to use the div-less algorithm with vector SIMD instructions to compute multiple digits at the same time.

sketch of an idea

Avoid all division/modulo and instead build up the decimal representation using decimal arithmetic!

A 64-bit value can also be treated as 8 bytes; each byte is a digit (not packed as BCD nibbles, but one per byte). You can add them by adding the 64-bit integers — each byte will add, and there can be no caries outside of the byte because the largest result is 18 and you can hold up to 255.

In particular, doubling such a value means adding it to itself. The hard part, which I’ll explain later, is following that up with carry propagation.

So, start with a 8-digit unpacked decimal set to all zeros; call that Accumulator.

Loop while In != 0 :

Take your normal integer (called In) and mask off the lowest bit, getting a result that is either 0 or 1. Add that to the least significant byte of Accumulator. Note that you have no if here, no branching. It will simply add zero accomplishing nothing, or add one. You can do that add using a 32-bit or any other sized int aliased with the Accumulator, not just a byte, because you know the carry will not happen out of the byte. I mention this because byte arithmetic is slightly slower than 32/64.

Then, shift In right one bit.

Add Accumulator to itself, using the aliasing trick I mentioned before.

Process the carries.

[end of loop]

Convert Accumulator to an ASCII representation by adding 0x30 to each byte. This can be a constant of 64 bits, so you do this in one add.

So, how do you propagate the carries? Use SSE instructions.

Load the Accumulator into an SSE register. Use an instruction which compares each (byte) element against 10 and produces a packed bitmap of the results. There will be 16 bits, indicating which bytes are too large for decimal.

Use that bitmap as a mask to subtract 10 from each enabled byte. You can use the same SSE register loaded with all 10's for both steps.

Shift the bitmap left by one bit. Now use that as a mask to add 1 to every enabled byte.

Now, saving the SSE register back to RAM and loading it again will slow it down. So do the doubling (adding the register to itself) in SSE, also. Adding the new In bit can be done with separate instructions, or rolled into the carry step by setting the least bit (or not) after shifting left. Whichever is the least amount of jumping through hoops or avoiding a branch.

Stack Exchange Network

Convert unsigned Integer to ASCII in x86 assembly

1 Answer 1

sketch of an idea

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Convert unsigned Integer to ASCII in x86 assembly

1 Answer 1

sketch of an idea

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions