Functions to simplify printing strings and numbers in amd64 Assembly

Question 1

I'm new to Assembly, and this is my very first "project" in Assembly. I wanted to store data (numbers) on the stack, then access and display them. Eventually, this "experiment" of mine grew to become a tiny library of 3 functions.

There are 3 functions:-

push_ASCII: For ASCII strings
push_int32_as_ASCII: For int32 values
clean_stack: To clean stack after the previous two functions

Brief Explanation

The objective of the library is to simplify printing data. Two functions push data onto the stack. All the pushed data can be printed using a single sys_write call. Finally, the third function cleans up the stack.

The two push functions pack data and start populating the stack from where the RSP is pointing till it runs out of bytes to populate.

The RSP decrements by 8 bytes (on a 64-bit architecture), and a total of ceil(total_length_of_string/8) push operations are performed. Thus, it's possible for the RSP to point at memory filled with 0's (null characters) after return from the first two functions.
The RBX register is used for this reason to offset the RSP while accessing the string on the stack.

If during a function call to the two push functions, there's existing string on the stack, the functions automatically pack data on to the stack to avoid any null character in the middle of the string on the stack.

After push_ASCII or push_int32_as_ASCII has been called, the stack is ready and a sys_write function can be performed using [RSP + RBX] for the address of the string, with R8 for the number of bytes to print. Once a sys_write has been performed, the pushed string can be popped using clean_stack. This function pops as many bytes as indicated by R8. Again since the RSP only moves by 8 bytes (64-bit architecture), it performs a total of ceil(R8/8) pop operations.

The Code

section .text
; =================================================================================
; PUSH_ASCII
; ---------------------------------------------------------------------------------
; Function to push existing string (ASCII) to the stack
;
; Input:-
; > RSI holding address to string to be pushed
; > RDX holding length of string
; > R8 holding the length of the string on the stack so far
; Output:-
; > (RSP + RBX) points to the beginning of the string on the stack
; > R8 holds the length of the string on the stack so far
;
; (Uses RAX, RBX, RCX, R9, and R10 registers internally)
; ---------------------------------------------------------------------------------
;
push_ASCII:
 pop r9 ; Since the stack is altered, the return address is saved
 
 lea rbx, [rsi + rdx - 1] ; Load address of the last char into RBX
 mov rcx, r8 ; Load existing length of string
 add r8, rdx ; Update value of R8 with current string length
 and rcx, 7 ; Perform modulo operation on RCX with 8
 test rcx, rcx 
 jz .b0 ; Jump to .b0 if RCX is a multiple of 8
 pop rax ; Pop previous stack push as it contains null characters
 neg rcx 
 add rcx, 8 ; Find number of null characters
 mov r10, rcx 
; Loop to right shift RAX to eliminate all null characters
.l0:
 shr rax, 8
 dec rcx
 test rcx, rcx
 jnz .l0
; end loop
 mov rcx, r10
 jmp .l1
 
.b0:
 xor rax, rax
 mov rcx, 8
; Loop to get each character of string backwards and put in RAX; then push to stack
.l1:
 shl rax, 8
 mov r10b, [rbx]
 add al, r10b
 dec rcx
 test rcx, rcx
 jnz .b1
 push rax
 xor rax, rax
 mov rcx, 8
.b1:
 dec rbx
 mov r10, rsi
 dec r10
 cmp rbx, r10
 jnz .l1
; end loop
; Calculate value for RBX; Check if RAX still has to be pushed
 and rcx, 7
 mov rbx, rcx
 test rcx, rcx
 jz .b2
; Loop to "left-adjust" RAX
.l2:
 shl rax, 8
 dec rcx
 test rcx, rcx
 jnz .l2
; end loop
; If RAX is not empty, push to stack
.b2:
 test rax, rax
 jz .b3
 
 push rax
.b3:
 jmp r9 ; Jump to previously stored return address
;
; =================================================================================
; =================================================================================
; PUSH_INT32_AS_ASCII
; ---------------------------------------------------------------------------------
; Function to push a 32-bit integer to the stack as an ASCII string
;
; Input:- 
; > RDI holding the number
; > R8 holding the length of the string on the stack so far
; Output:-
; > (RSP + RBX) points to the beginning of the string on the stack
; > R8 holds the length of the string on the stack so far
;
; (Uses RAX, RBX, RCX, RDI, R9, R10, and R11 internally)
; ---------------------------------------------------------------------------------
;
push_int32_as_ASCII: 
 pop r9 ; Store return address
 mov eax, edi ; Copy recieved 32-bit number
 mov ebx, 0xCCCCCCCD ; Agner Fog's magic number
 mov rcx, r8
 and rcx, 7 ; Perform modulo operation on RCX with 8
 test rcx, rcx
 jz .b0 ; Jump to .b0 if RCX is divisible by 8
 pop r10 ; Pop previous stack push as it contains null characters
 neg rcx
 add rcx, 8 ; Calculate number of null character
 mov r11, rcx
; Loop to right shift R10 to eliminate all null characters
.l0:
 shr r10, 8
 dec rcx
 test rcx, rcx
 jnz .l0
; end loop
 mov rcx, r11
 jmp .l1
.b0:
 xor r10, r10
 mov rcx, 8
; If number is negative, negate it
.b1:
 mov r11, rdi
 test r11, r11
 jns .l1
 neg eax
; Loop to convert decimal number to ASCII; push to stack
.l1:
 shl r10, 8
 mov edi, eax ; save original number
 mul ebx ; divide by 10 using agner fog's 'magic number'
 shr edx, 3 ;
 mov eax, edx ; store quotient for next loop
 lea edx, [edx*4 + edx] ; multiply by 10
 lea edx, [edx*2 - '0'] ; finish *10 and convert to ascii
 sub edi, edx ; subtract from original number to get remainder
 inc r8 ; Update R8
 lea r10, [r10 + rdi] ; Store current digit (in ASCII)
 dec rcx
 test rcx, rcx
 jnz .b2
 push r10
 xor r10, r10
 mov rcx, 8
.b2:
 test eax, eax
 jnz .l1
; end loop
; If given number was negative, add '-' sign
 test r11, r11
 jns .b3
 shl r10, 8
 lea r10, [r10 + '-']
 dec rcx
 inc r8
; Calculate value for RBX; Check if R10 still has to be pushed
.b3:
 and rcx, 7
 mov rbx, rcx
 test rcx, rcx
 jz .b4
; Loop to "left-adjust" R10
.l2:
 shl r10, 8
 dec rcx
 test rcx, rcx
 jnz .l2
; end loop 
; Push R10 if not empty
.b4:
 test r10, r10
 jz .b5
 push r10
.b5:
 jmp r9 ; Return to previously stored return address
;
; =================================================================================
; =================================================================================
; CLEAN_STACK
; ---------------------------------------------------------------------------------
; Function to "clean" the stack after push_string or push_uint32_as_ASCII calls
; Input:-
; > R8 holding the length of string pushed to the stack so far
; Output:-
; Nil
;
; (Uses RAX, RCX, and R9 internally)
; ---------------------------------------------------------------------------------
;
clean_stack:
 pop r9 ; Store return address
 test r8, r8 ; Check if R8 is 0 for early exit
 jz .b0
; Calculate number of pop operations ( ceil(R8/8) )
 mov rcx, r8
 shr rcx, 3 ; Divide RCX (holds same value as R8) by 8
 mov rax, r8
 and rax, 7
 test rax, rax
 jz .l0
 inc rcx
; Loop to pop stack 
.l0:
 pop rax
 dec rcx
 test rcx, rcx
 jnz .l0
; end loop
.b0:
 jmp r9 ; Return to previously-stored return address
;
; =================================================================================

The code works fine. I just want to get some expert opinions on the implementation or standard practices. But most importantly, if I'm doing something that's absolutely looked down upon in assembly coding (:p).

Link to the GitHub repo: https://github.com/ghost-1608/Assembly-Print-Header

Question 2

Most arithmetic operations affect the flags

So for example after and rcx, 7, you don't need a test rcx, rcx to check whether rcx became zero, the and already set or reset the zero flag according to the result. By the way you may as well use and ecx, 7, saving 1 byte, which is not very significant but .. you may as well.

dec also sets the zero flag according to its result (but not the carry flag), so you don't need a test after it either. dec rcx may be slightly slower than sub rcx, 1 on some CPUs, for some time that quirk only affected irrelevant CPUs but unfortunately Intel reused some of those microarchitectures for their E-cores. sub rcx, 1 of course, also sets the zero flag according to its result.

Various 64-bit instructions can use the 32-bit variant

Since writes to 32-bit registers zero-extend to the corresponding 64-bit registers, often times you can use them and save a little bit of space. For example, xor rax, rax wastes a byte compared to xor eax, eax while being otherwise identical. mov rcx, 8 wastes two bytes compared to mov ecx, 8.

Saving a couple of bytes of code is not super important, but probably better on average, slightly reducing code fetch and potentially decoding time, depending on how dense the code is. There are some cases where you're better off wasting some space too though, it's not as simple as "smaller is always better".

ceil(R8/8)

If you add 7 before dividing by 8:

lea rcx, [r8 + 7]
shr rcx, 3

.. then you don't need to do a conditional increment if the bottom 3 bits of r8 were not zero, effectively adding 7 has done that.

There is an edge case if r8 can be very close to the maximum unsigned 64-bit integer, but for a string length that's not a reasonable case.

Popping the return address and `jmp`-ing to it

This seems to be a central idea of this code. It's a cute trick, but unfortunately not a good one, sorry. The problem is, it defeats return address prediction. Return address prediction allows calling and returning from functions to not be as slow as arbitrary (often hard to predict) indirect branches. Popping the return address and jumping to it has both a local effect of turning a fast ret into a usually slower jump to some arbitrary unknown address (as far as the CPU is concerned) (unless you repeatedly return to the same address), it also leaves the return address on the return address predictor stack so the next ret would be predicted to jump there, which is wrong, and so on if multiple returns are done in a sequence.

This kind of trick also wouldn't work with stack unwinding for exception handling, but you're not supplying stack unwinding information anyway.

Question 3

Thanks a lot! Yes, I realised popping the return address and jmp-ing isn't the best idea. In fact, I used it only because I didn't find any other solution. How would you suggest structuring the library since I'm making changes to the stack at a global level?

Question 4

@ghost a classic solution is requiring the caller to allocate the stack space, which the library function then writes data into. The caller would need to decide on the size of allocation, a typical solution is to just pick some reasonable limit eg 11 bytes (may as well round it up to 16) are sufficient to write the result of converting a 32-bit integer to decimal.

Question 5

If during a function call to the two push functions, there's existing string on the stack, the functions automatically pack data on to the stack to avoid any null character in the middle of the string on the stack.

If it weren't for this ability to concatenate strings on the stack, I would have suggested much simpler code for the push_ASCII and push_int32_as_ASCII routines: aligning the start of the string at [RSP] (not requiring any RBX) and allowing any or some garbage bytes behind the string. A decent simplification in push_int32_as_ASCII is still possible if you would do the number conversion separately and have it followed by a call to push_ASCII that after all already exists for the purpose of pushing a string to the stack...

Although I understand the routines that use the actual push instruction, I don't agree with using a loop of pop instructions in the clean_stack code. Just calculate the number of bytes (multiple of 8) that you need to remove from the stack and adjust RSP with a single addition:

; IN (r8) OUT () MOD (rax,r9)
clean_stack:
 pop r9 ; Remove return address
 lea rax, [r8 + 7] ; Calculate the next higher multiple of 8 (unless it was
 and rax, -8 ; already a multiple of 8, then it stays unmodified)
 add rsp, rax ; Remove the string from the stack
 push r9 ; Restore return address
 ret

I just want to get some expert opinions on the implementation or standard practices. But most importantly, if I'm doing something that's absolutely looked down upon in assembly coding (:p).

You are using many loops, even for things that don't need one (see below).
You forget that you can read/write stack memory just like any other memory. You don't have to use push and pop persé.
Sometimes the code is hard to follow and should get streamlined, eliminating some jumps like this one: jmp .l1 .b0:.
Naming your labels .b0, .b1, etc makes it harder than necessary to understand the program, and especially labels like .l0 and .l1 need to be condemned for their poor readability.

Review

You don't need an actual loop in order to "to right shift RAX to eliminate all null characters". Simply take the number of null bytes and multiply by 8, then use shr rax, cl. You have a very similar loop "to left-adjust RAX" where you can apply the same loop elimination.

 and ecx, 7
 jz .b0
 pop rax ; Pop previous stack push as it contains null characters
 neg ecx 
 add ecx, 8 ; Find number of null characters
 shl ecx, 3 ; From byte to bits
 shr rax, cl ; Right shift RAX to eliminate all null characters
 shr ecx, 3 ; From bits to bytes
 jmp .l1
.b0:
 xor eax, eax
 mov ecx, 8
.l1:

.l1:
 shl rax, 8
 mov r10b, [rbx]
 add al, r10b

Instead of using the R10B register, you could simply write: mov al, [rbx]. Of course, if you care about efficiency then write:

.l1:
 shl rax, 8
 movzx r10, byte [rbx]
 or rax, r10

.b1:
 dec rbx
 mov r10, rsi
 dec r10
 cmp rbx, r10
 jnz .l1

Here you could replace the pair mov r10, rsi dec r10 by the single lea r10, [rsi - 1] instruction, but since the intent is to continue the loop for as long as the address in RBX falls within the string at RSI, this code should become:

.b1:
 dec rbx
 cmp rbx, rsi
 jae .l1 ; While RBX is AboveOrEqual to RSI, continue

Question 6

Thanks for the review! You talked about the choice of labels I used. What would you suggest instead? I agree this kind of labelling is hard to follow (and hard to to deal with while editing the code)

Question 7

@ghost Usually one would suggest to choose meaningful names for labels. eg. in PUSH_INT32_AS_ASCII you could change .b3: into .IsPositive:. However, once you eliminate many of the branches (requiring less labels) and start re-using existing code (making calls), you will notice that many functions that you write will be limited in length, say 25 instructions or so. What I then use are anonymous local labels like .a:, .b:, .c: etc. One trick to make editing a bit easier: I label from the top using .a:, .b:, .c: etc. and I label from the bottom using .z:, .y:, .x: etc.

user555045 user555045 12k1 gold badge19 silver badges37 bronze badges · Answer 1 · 2023-08-07 22:37:57Z

Most arithmetic operations affect the flags

So for example after and rcx, 7, you don't need a test rcx, rcx to check whether rcx became zero, the and already set or reset the zero flag according to the result. By the way you may as well use and ecx, 7, saving 1 byte, which is not very significant but .. you may as well.

dec also sets the zero flag according to its result (but not the carry flag), so you don't need a test after it either. dec rcx may be slightly slower than sub rcx, 1 on some CPUs, for some time that quirk only affected irrelevant CPUs but unfortunately Intel reused some of those microarchitectures for their E-cores. sub rcx, 1 of course, also sets the zero flag according to its result.

Various 64-bit instructions can use the 32-bit variant

Since writes to 32-bit registers zero-extend to the corresponding 64-bit registers, often times you can use them and save a little bit of space. For example, xor rax, rax wastes a byte compared to xor eax, eax while being otherwise identical. mov rcx, 8 wastes two bytes compared to mov ecx, 8.

Saving a couple of bytes of code is not super important, but probably better on average, slightly reducing code fetch and potentially decoding time, depending on how dense the code is. There are some cases where you're better off wasting some space too though, it's not as simple as "smaller is always better".

ceil(R8/8)

If you add 7 before dividing by 8:

lea rcx, [r8 + 7]
shr rcx, 3

.. then you don't need to do a conditional increment if the bottom 3 bits of r8 were not zero, effectively adding 7 has done that.

There is an edge case if r8 can be very close to the maximum unsigned 64-bit integer, but for a string length that's not a reasonable case.

Popping the return address and `jmp`-ing to it

This seems to be a central idea of this code. It's a cute trick, but unfortunately not a good one, sorry. The problem is, it defeats return address prediction. Return address prediction allows calling and returning from functions to not be as slow as arbitrary (often hard to predict) indirect branches. Popping the return address and jumping to it has both a local effect of turning a fast ret into a usually slower jump to some arbitrary unknown address (as far as the CPU is concerned) (unless you repeatedly return to the same address), it also leaves the return address on the return address predictor stack so the next ret would be predicted to jump there, which is wrong, and so on if multiple returns are done in a sequence.

This kind of trick also wouldn't work with stack unwinding for exception handling, but you're not supplying stack unwinding information anyway.

Thanks a lot! Yes, I realised popping the return address and jmp-ing isn't the best idea. In fact, I used it only because I didn't find any other solution. How would you suggest structuring the library since I'm making changes to the stack at a global level?
@ghost a classic solution is requiring the caller to allocate the stack space, which the library function then writes data into. The caller would need to decide on the size of allocation, a typical solution is to just pick some reasonable limit eg 11 bytes (may as well round it up to 16) are sufficient to write the result of converting a 32-bit integer to decimal.

Sep Roland Sep Roland 4,78317 silver badges28 bronze badges · Answer 2 · 2023-08-08 21:38:09Z

If during a function call to the two push functions, there's existing string on the stack, the functions automatically pack data on to the stack to avoid any null character in the middle of the string on the stack.

If it weren't for this ability to concatenate strings on the stack, I would have suggested much simpler code for the push_ASCII and push_int32_as_ASCII routines: aligning the start of the string at [RSP] (not requiring any RBX) and allowing any or some garbage bytes behind the string. A decent simplification in push_int32_as_ASCII is still possible if you would do the number conversion separately and have it followed by a call to push_ASCII that after all already exists for the purpose of pushing a string to the stack...

Although I understand the routines that use the actual push instruction, I don't agree with using a loop of pop instructions in the clean_stack code. Just calculate the number of bytes (multiple of 8) that you need to remove from the stack and adjust RSP with a single addition:

; IN (r8) OUT () MOD (rax,r9)
clean_stack:
 pop r9 ; Remove return address
 lea rax, [r8 + 7] ; Calculate the next higher multiple of 8 (unless it was
 and rax, -8 ; already a multiple of 8, then it stays unmodified)
 add rsp, rax ; Remove the string from the stack
 push r9 ; Restore return address
 ret

I just want to get some expert opinions on the implementation or standard practices. But most importantly, if I'm doing something that's absolutely looked down upon in assembly coding (:p).

You are using many loops, even for things that don't need one (see below).
You forget that you can read/write stack memory just like any other memory. You don't have to use push and pop persé.
Sometimes the code is hard to follow and should get streamlined, eliminating some jumps like this one: jmp .l1 .b0:.
Naming your labels .b0, .b1, etc makes it harder than necessary to understand the program, and especially labels like .l0 and .l1 need to be condemned for their poor readability.

Review

You don't need an actual loop in order to "to right shift RAX to eliminate all null characters". Simply take the number of null bytes and multiply by 8, then use shr rax, cl. You have a very similar loop "to left-adjust RAX" where you can apply the same loop elimination.

 and ecx, 7
 jz .b0
 pop rax ; Pop previous stack push as it contains null characters
 neg ecx 
 add ecx, 8 ; Find number of null characters
 shl ecx, 3 ; From byte to bits
 shr rax, cl ; Right shift RAX to eliminate all null characters
 shr ecx, 3 ; From bits to bytes
 jmp .l1
.b0:
 xor eax, eax
 mov ecx, 8
.l1:

.l1:
 shl rax, 8
 mov r10b, [rbx]
 add al, r10b

Instead of using the R10B register, you could simply write: mov al, [rbx]. Of course, if you care about efficiency then write:

.l1:
 shl rax, 8
 movzx r10, byte [rbx]
 or rax, r10

.b1:
 dec rbx
 mov r10, rsi
 dec r10
 cmp rbx, r10
 jnz .l1

Here you could replace the pair mov r10, rsi dec r10 by the single lea r10, [rsi - 1] instruction, but since the intent is to continue the loop for as long as the address in RBX falls within the string at RSI, this code should become:

.b1:
 dec rbx
 cmp rbx, rsi
 jae .l1 ; While RBX is AboveOrEqual to RSI, continue

Thanks for the review! You talked about the choice of labels I used. What would you suggest instead? I agree this kind of labelling is hard to follow (and hard to to deal with while editing the code)
@ghost Usually one would suggest to choose meaningful names for labels. eg. in PUSH_INT32_AS_ASCII you could change .b3: into .IsPositive:. However, once you eliminate many of the branches (requiring less labels) and start re-using existing code (making calls), you will notice that many functions that you write will be limited in length, say 25 instructions or so. What I then use are anonymous local labels like .a:, .b:, .c: etc. One trick to make editing a bit easier: I label from the top using .a:, .b:, .c: etc. and I label from the bottom using .z:, .y:, .x: etc.

Stack Exchange Network

Functions to simplify printing strings and numbers in amd64 Assembly

2 Answers 2

Most arithmetic operations affect the flags

Various 64-bit instructions can use the 32-bit variant

ceil(R8/8)

Popping the return address and `jmp`-ing to it

Review

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Functions to simplify printing strings and numbers in amd64 Assembly

2 Answers 2

Most arithmetic operations affect the flags

Various 64-bit instructions can use the 32-bit variant

ceil(R8/8)

Popping the return address and jmp-ing to it

Review

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions

Popping the return address and `jmp`-ing to it