This is an x86-64 Linux program to print a non-negative decimal integer. I would appreciate any simple optimizations for size and readability. I am aware that division by a constant is usually done by multiplication by the magic constant reciprocal, but I used div
here for simplicity.
; nasm -f elf64 print_dec.asm && ld print_dec.o -o print_dec
global _start
section .text
; print non-negative decimal integer in rdi
print_dec:
push rbp ; callee-saved
mov rbp, rsp ; save sp
mov rcx, 10 ; divisor base
mov rax, rdi ; dividend from arg0
L1:
xor rdx, rdx ; zero upper dividend
div rcx ; unsigned divide rdx:rax by rcx
; rax := quotient, rdx := remainder
add rdx, '0' ; convert digit to ASCII
push rdx ; push remainder digit
cmp rax, 0
jne L1 ; do while (rax != 0)
L2:
mov rax, 1 ; call number for write
mov rdi, 1 ; write to stdout (fd=1)
mov rsi, rsp ; use char on stack
mov rdx, 1 ; write 1 char
syscall
add rsp, 8 ; "pop" stack
cmp rbp, rsp ; do while (stack still has digits)
jne L2
pop rbp
ret
_start:
mov rdi, 1234 ; int to print
call print_dec
mov eax, 60 ; exit call number
xor rdi, rdi ; exit code 0
syscall
Before I wrote the code, I wrote my simple stack-based algorithm (pretty much the only way I think you could write it) in C first and cheated a bit by looking at the godbolt output. But I don't know how to force the compiler to use push
and pop
as I did because it just does indirect addressing like mov QWORD PTR [rsp+8+rbx*8], rdx
. So the "stack" is implemented in C and not the x86 stack. If there is a way to, please let me know.
// print non-negative integer in decimal
#include <stdio.h>
typedef unsigned long long ull;
void print_dec(ull x)
{
ull stack[20];
ull sp = 0; // not actual x86 sp
ull bp = sp;
ull d;
do // run at least once to print 0
{
d = x % 10;
stack[sp++] = d + '0'; // push
x /= 10;
} while (x);
do
{
d = stack[--sp]; // pop
putc(d, stdout);
} while (sp != bp);
}
int main()
{
print_dec(1234);
}
2 Answers 2
That stack on the stack is (ironically?) an unusual technique, but OK.
Various 64-bit instructions can be replaced with an equivalent 32-bit instruction, which typically makes them smaller.
mov rcx, 10
can bemov ecx, 10
xor rdx, rdx
can bexor edx, edx
- In this context,
add rdx, '0'
can beadd edx, '0'
, sincerdx
always has a small value here mov rax, 1
can bemov eax, 1
etc
For mov rcx, 10
and similar instructions, that saves 2 bytes each (the REX prefix, but also using a different opcode that doesn't need a Mod/RM byte). The other instructions just lose their REX prefix and become 1 byte smaller.
Comparison to zero can be simplified:
cmp rax, 0 jne L1
Can be:
test rax, rax
jne L1
That only saves one byte.
Before the syscall that writes the byte, there are 3 separate loads of the value 1, two of them can be replaced with a mov between registers, making them 3 bytes smaller each. These days that probably doesn't cost more time either, although that wouldn't have been relevant relative to the huge cost of a syscall anyway.
-
\$\begingroup\$ Thanks. I'm curious why the compiler will use a local array on the stack but not push/pop. \$\endgroup\$qwr– qwr2023年02月10日 21:03:42 +00:00Commented Feb 10, 2023 at 21:03
-
\$\begingroup\$ @qwr compilers have never done that kind of locally-unbalanced stack manipulation[citation needed] but it makes even less sense now since with the SysV AMD64 ABI it is almost always an ABI violation to do it (and it definitely requires a frame pointer which is typically avoided). You can get away with it here, probably. \$\endgroup\$user555045– user5550452023年02月10日 21:24:19 +00:00Commented Feb 10, 2023 at 21:24
I would appreciate any simple optimizations for size and readability.
Memory reduction
Rather than ull stack[20];
, only a byte array is needed.
Avoid UB
The magic number 20 is insufficient when ull/unsigned long long
is more than 64-bits. C only requires that type to be at least 64-bit.
An alternative is to scale stack[]
size by the bit-width of ull
.
// Size to encode an unsigned type as a "decimal" string: value_bit_with*log2(10) + 1
#define LOG2_10_N 28
#define LOG2_10_D 93
#define ULL_DECIMAL_TEXT_SIZE (sizeof(ull)*CHAR_BIT*LOG2_10_N/LOG2_10_D + 1)
One output call
Rather than form the digits forwards in stack[]
, save them starting from the end of stack[]
and then print once.
unsigned char stack[ULL_DECIMAL_TEXT_SIZE + 1]; // +1 for a string
unsigned sp = ULL_DECIMAL_TEXT_SIZE;
stack[ULL_DECIMAL_TEXT_SIZE] = 0;
do {
stack[--sp] = (int)(x % 10) + '0'; // push
x /= 10;
} while (x);
fputs(stack + sp, stdout);
-
\$\begingroup\$ You are right about me assuming long long would only be 64 bit, which I can document. I don't know if the fputs call would work in assembly unless there was a way to push just bytes onto the x86 stack? I like the symmetry of push/pop anyhow. \$\endgroup\$qwr– qwr2023年02月14日 19:55:04 +00:00Commented Feb 14, 2023 at 19:55
-
\$\begingroup\$ @qwr Rather than document assuming
long long
would only be 64 bit, usetypedef uint64_t ull;
instead ofunsigned long long
(or adjuststack[]
as answered). \$\endgroup\$chux– chux2023年02月14日 20:40:33 +00:00Commented Feb 14, 2023 at 20:40 -
\$\begingroup\$ By the way, I figured that doing something like
char *sp = stack;
, then push as*sp = d; ++sp;
was more representative of what a stack would do with rsp. I think this is*(sp++)
? Or equivalent with your clever idea of "push-down" stack (up/down is perspective). \$\endgroup\$qwr– qwr2024年09月10日 03:31:29 +00:00Commented Sep 10, 2024 at 3:31 -
\$\begingroup\$ If sp is actually a pointer to the end of the stack like the name suggests, I can have one
syscall write(1, sp, digit_count)
\$\endgroup\$qwr– qwr2024年09月10日 03:47:24 +00:00Commented Sep 10, 2024 at 3:47