Print decimal integer

Question 1

This is an x86-64 Linux program to print a non-negative decimal integer. I would appreciate any simple optimizations for size and readability. I am aware that division by a constant is usually done by multiplication by the magic constant reciprocal, but I used div here for simplicity.

; nasm -f elf64 print_dec.asm && ld print_dec.o -o print_dec
global _start
section .text
; print non-negative decimal integer in rdi
print_dec: 
 push rbp ; callee-saved 
 mov rbp, rsp ; save sp
 mov rcx, 10 ; divisor base
 mov rax, rdi ; dividend from arg0
 
L1:
 xor rdx, rdx ; zero upper dividend
 div rcx ; unsigned divide rdx:rax by rcx
 ; rax := quotient, rdx := remainder
 add rdx, '0' ; convert digit to ASCII
 push rdx ; push remainder digit
 cmp rax, 0 
 jne L1 ; do while (rax != 0)
L2:
 mov rax, 1 ; call number for write
 mov rdi, 1 ; write to stdout (fd=1)
 mov rsi, rsp ; use char on stack
 mov rdx, 1 ; write 1 char
 syscall
 
 add rsp, 8 ; "pop" stack 
 cmp rbp, rsp ; do while (stack still has digits)
 jne L2 
 pop rbp
 ret
_start:
 mov rdi, 1234 ; int to print
 call print_dec
 mov eax, 60 ; exit call number
 xor rdi, rdi ; exit code 0
 syscall

Before I wrote the code, I wrote my simple stack-based algorithm (pretty much the only way I think you could write it) in C first and cheated a bit by looking at the godbolt output. But I don't know how to force the compiler to use push and pop as I did because it just does indirect addressing like mov QWORD PTR [rsp+8+rbx*8], rdx. So the "stack" is implemented in C and not the x86 stack. If there is a way to, please let me know.

// print non-negative integer in decimal
#include <stdio.h>
typedef unsigned long long ull;
void print_dec(ull x)
{
 ull stack[20];
 ull sp = 0; // not actual x86 sp
 ull bp = sp;
 ull d;
 do // run at least once to print 0
 {
 d = x % 10; 
 stack[sp++] = d + '0'; // push
 x /= 10;
 } while (x);
 do 
 {
 d = stack[--sp]; // pop
 putc(d, stdout);
 } while (sp != bp);
}
int main() 
{
 print_dec(1234);
}

Question 2

That stack on the stack is (ironically?) an unusual technique, but OK.

Various 64-bit instructions can be replaced with an equivalent 32-bit instruction, which typically makes them smaller.

mov rcx, 10 can be mov ecx, 10
xor rdx, rdx can be xor edx, edx
In this context, add rdx, '0' can be add edx, '0', since rdx always has a small value here
mov rax, 1 can be mov eax, 1 etc

For mov rcx, 10 and similar instructions, that saves 2 bytes each (the REX prefix, but also using a different opcode that doesn't need a Mod/RM byte). The other instructions just lose their REX prefix and become 1 byte smaller.

Comparison to zero can be simplified:

cmp rax, 0 
jne L1

Can be:

test rax, rax 
jne L1

That only saves one byte.

Before the syscall that writes the byte, there are 3 separate loads of the value 1, two of them can be replaced with a mov between registers, making them 3 bytes smaller each. These days that probably doesn't cost more time either, although that wouldn't have been relevant relative to the huge cost of a syscall anyway.

Question 3

Thanks. I'm curious why the compiler will use a local array on the stack but not push/pop.

Question 4

@qwr compilers have never done that kind of locally-unbalanced stack manipulation[citation needed] but it makes even less sense now since with the SysV AMD64 ABI it is almost always an ABI violation to do it (and it definitely requires a frame pointer which is typically avoided). You can get away with it here, probably.

Question 5

I would appreciate any simple optimizations for size and readability.

Memory reduction

Rather than ull stack[20];, only a byte array is needed.

Avoid UB

The magic number 20 is insufficient when ull/unsigned long long is more than 64-bits. C only requires that type to be at least 64-bit.

An alternative is to scale stack[] size by the bit-width of ull.

// Size to encode an unsigned type as a "decimal" string: value_bit_with*log2(10) + 1
#define LOG2_10_N 28
#define LOG2_10_D 93
#define ULL_DECIMAL_TEXT_SIZE (sizeof(ull)*CHAR_BIT*LOG2_10_N/LOG2_10_D + 1)

One output call

Rather than form the digits forwards in stack[], save them starting from the end of stack[] and then print once.

unsigned char stack[ULL_DECIMAL_TEXT_SIZE + 1]; // +1 for a string
unsigned sp = ULL_DECIMAL_TEXT_SIZE;
stack[ULL_DECIMAL_TEXT_SIZE] = 0;
do {
 stack[--sp] = (int)(x % 10) + '0'; // push
 x /= 10;
} while (x);
fputs(stack + sp, stdout);

Question 6

You are right about me assuming long long would only be 64 bit, which I can document. I don't know if the fputs call would work in assembly unless there was a way to push just bytes onto the x86 stack? I like the symmetry of push/pop anyhow.

Question 7

@qwr Rather than document assuming long long would only be 64 bit, use typedef uint64_t ull; instead of unsigned long long (or adjust stack[] as answered).

Question 8

By the way, I figured that doing something like char *sp = stack;, then push as *sp = d; ++sp; was more representative of what a stack would do with rsp. I think this is *(sp++)? Or equivalent with your clever idea of "push-down" stack (up/down is perspective).

Question 9

If sp is actually a pointer to the end of the stack like the name suggests, I can have one syscall write(1, sp, digit_count)

score 3 · Answer 1 · 2023-02-10 00:27:30Z

That stack on the stack is (ironically?) an unusual technique, but OK.

Various 64-bit instructions can be replaced with an equivalent 32-bit instruction, which typically makes them smaller.

mov rcx, 10 can be mov ecx, 10
xor rdx, rdx can be xor edx, edx
In this context, add rdx, '0' can be add edx, '0', since rdx always has a small value here
mov rax, 1 can be mov eax, 1 etc

For mov rcx, 10 and similar instructions, that saves 2 bytes each (the REX prefix, but also using a different opcode that doesn't need a Mod/RM byte). The other instructions just lose their REX prefix and become 1 byte smaller.

Comparison to zero can be simplified:

cmp rax, 0 
jne L1

Can be:

test rax, rax 
jne L1

That only saves one byte.

Before the syscall that writes the byte, there are 3 separate loads of the value 1, two of them can be replaced with a mov between registers, making them 3 bytes smaller each. These days that probably doesn't cost more time either, although that wouldn't have been relevant relative to the huge cost of a syscall anyway.

Thanks. I'm curious why the compiler will use a local array on the stack but not push/pop.
@qwr compilers have never done that kind of locally-unbalanced stack manipulation[citation needed] but it makes even less sense now since with the SysV AMD64 ABI it is almost always an ABI violation to do it (and it definitely requires a frame pointer which is typically avoided). You can get away with it here, probably.

chux chuxchux 36.4k2 gold badges43 silver badges96 bronze badges · Answer 2 · 2023-02-14 13:40:38Z

I would appreciate any simple optimizations for size and readability.

Memory reduction

Rather than ull stack[20];, only a byte array is needed.

Avoid UB

The magic number 20 is insufficient when ull/unsigned long long is more than 64-bits. C only requires that type to be at least 64-bit.

An alternative is to scale stack[] size by the bit-width of ull.

// Size to encode an unsigned type as a "decimal" string: value_bit_with*log2(10) + 1
#define LOG2_10_N 28
#define LOG2_10_D 93
#define ULL_DECIMAL_TEXT_SIZE (sizeof(ull)*CHAR_BIT*LOG2_10_N/LOG2_10_D + 1)

One output call

Rather than form the digits forwards in stack[], save them starting from the end of stack[] and then print once.

unsigned char stack[ULL_DECIMAL_TEXT_SIZE + 1]; // +1 for a string
unsigned sp = ULL_DECIMAL_TEXT_SIZE;
stack[ULL_DECIMAL_TEXT_SIZE] = 0;
do {
 stack[--sp] = (int)(x % 10) + '0'; // push
 x /= 10;
} while (x);
fputs(stack + sp, stdout);

You are right about me assuming long long would only be 64 bit, which I can document. I don't know if the fputs call would work in assembly unless there was a way to push just bytes onto the x86 stack? I like the symmetry of push/pop anyhow.
@qwr Rather than document assuming long long would only be 64 bit, use typedef uint64_t ull; instead of unsigned long long (or adjust stack[] as answered).
By the way, I figured that doing something like char *sp = stack;, then push as *sp = d; ++sp; was more representative of what a stack would do with rsp. I think this is *(sp++)? Or equivalent with your clever idea of "push-down" stack (up/down is perspective).
If sp is actually a pointer to the end of the stack like the name suggests, I can have one syscall write(1, sp, digit_count)

Stack Exchange Network

Print decimal integer

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Print decimal integer

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions