Variadic Functions in NASM Win64 Assembly

Question 1

Simple variadic function implementation in NASM Assembly. When implemented like I have below:

sum.asm

;; bits 64
default rel
extern printf
section .text
 global main
 global sum
main:
 sub rsp, 40
 mov r9d, 3
 mov r8d, 2
 mov edx, 1
 mov ecx, 3
 call sum
 lea rcx, [fmt]
 mov edx, eax
 call printf
 xor eax, eax
 add rsp, 40
 ret
sum:
 sub rsp, 24
 test ecx, ecx
 mov qword [rsp+28H], rdx
 lea rdx, [rsp+28H]
 mov qword [rsp+30H], r8
 mov qword [rsp+38H], r9
 mov qword [rsp+8H], rdx
 jle .end
 lea eax, [rcx-1H]
 lea rcx, [rsp+rax*8+30H]
 xor eax, eax
.recurse:
 add eax, dword [rdx]
 add rdx, 8
 cmp rdx, rcx
 jnz .recurse
 add rsp, 24
 ret
.end:
 xor eax, eax
 add rsp, 24
 ret
%macro str 2
 %2: db %1, 0
%endmacro
section .rdata
 str "%d", fmt

We can change how many variables are being given to the function without having to change the amount of memory being allocated to the function itself.

main:
 sub rsp, 72
 mov dword [rsp+30H], 6
 mov dword [rsp+28H], 5
 mov dword [rsp+20H], 4
 mov r9d, 3
 mov r8d, 2
 mov edx, 1
 mov ecx, 6
 call sum
 lea rcx, [fmt]
 mov edx, eax
 call printf
 xor eax, eax
 add rsp, 72
 ret

There are no comments in the code since what I'm doing should be pretty self-explanatory. Any advice and all topical comments on optimizing the code and its performance, as well as standard conventions, is appreciated!

Compiled as follows using VS2017 x64 Native Tools Command Prompt:

> nasm -g -f win64 sum.asm
> cl /Zi sum.obj msvcrt.lib legacy_stdio_definitions.lib

Question 2

mov qword [rsp+8H], rdx

Why would you store this value at all? You never use the stored value afterwards!

You can easily dismiss all of the local storage for the sum routine. No need for sub rsp, 24 nor add rsp, 24 instructions. Also, why 24 bytes when all you use are the middle 8 bytes? The same happens in main where you reserve 40 bytes when all you use are the middle 24 bytes! The alternative main evens wastes 24 bytes this way.

You can avoid the separate .end code by clearing the accumulator very early and jumping to the one RET you really need.

lea eax, [rcx-1H]
lea rcx, [rsp+rax*8+30H]

You can calculate this upper limit in just one instruction. Absorb the decrement by 1 operation into the displacement of the LEA:

lea rcx, [rsp + rcx*8 + 28h]

And since by the time of this calculation the RDX register already points at [rsp + 28h] you could simply write:

lea rcx, [rdx + rcx*8]

Applying these changes:

sum:
 xor eax, eax
 test ecx, ecx
 jle .end
 mov [rsp + 10h], rdx ; 1st arg for sure (ECX=1+)
 mov [rsp + 18h], r8 ; maybe 2nd arg
 mov [rsp + 20h], r9 ; maybe 3rd arg
 lea rdx, [rsp + 10h] ; lower limit
 lea rcx, [rdx + rcx*8] ; upper limit
.more:
 add eax, dword [rdx]
 add rdx, 8
 cmp rdx, rcx
 jb .more
.end:
 ret

When comparing memory addresses like in your code cmp rdx, rcx, it would be best to not check for equality but rather for excess. Provided the step is non-zero, checking for the 'below' condition will happen eventually while checking for the 'equal' condition might never happen at all (if there's an error in the program of course...).

An even simpler solution uses the ECX register as a counter (like it was on input):

sum:
 xor eax, eax
 test ecx, ecx
 jle .end
 mov [rsp + 10h], rdx ; 1st arg for sure (ECX=1+)
 mov [rsp + 18h], r8 ; maybe 2nd arg
 mov [rsp + 20h], r9 ; maybe 3rd arg
 lea rdx, [rsp + 10h] ; lower limit
.more:
 add eax, dword [rdx]
 add rdx, 8
 sub ecx, 1
 jnz .more
.end:
 ret

We can change how many variables are being given to the function without having to change the amount of memory being allocated to the function itself.

True, but the caller still has to reserve - somewhat mysteriously - those 24 bytes that correspond to the 3 arguments passed via registers EDX, R8D, and R9D.
I think it would be simpler to either pass all values on the stack or pass a pointer (in RDX) to a buffer holding all of those values. In both cases the count remains in ECX.
Next code combines it all:

 sub rsp, 48
 mov rdx, rsp
 mov dword [rdx + 40], 6
 mov dword [rdx + 32], 5
 mov dword [rdx + 24], 4
 mov dword [rdx + 16], 3
 mov dword [rdx + 8], 2
 mov dword [rdx], 1
 mov ecx, 6
 call sum
 ...
sum:
 xor eax, eax
 test ecx, ecx
 jle .end
.more:
 add eax, dword [rdx]
 add rdx, 8
 sub ecx, 1
 jnz .more
.end:
 ret

Sep Roland Sep Roland 4,78317 silver badges28 bronze badges · Accepted Answer · 2018-10-21 14:47:05Z

mov qword [rsp+8H], rdx

Why would you store this value at all? You never use the stored value afterwards!

You can easily dismiss all of the local storage for the sum routine. No need for sub rsp, 24 nor add rsp, 24 instructions. Also, why 24 bytes when all you use are the middle 8 bytes? The same happens in main where you reserve 40 bytes when all you use are the middle 24 bytes! The alternative main evens wastes 24 bytes this way.

You can avoid the separate .end code by clearing the accumulator very early and jumping to the one RET you really need.

lea eax, [rcx-1H]
lea rcx, [rsp+rax*8+30H]

You can calculate this upper limit in just one instruction. Absorb the decrement by 1 operation into the displacement of the LEA:

lea rcx, [rsp + rcx*8 + 28h]

And since by the time of this calculation the RDX register already points at [rsp + 28h] you could simply write:

lea rcx, [rdx + rcx*8]

Applying these changes:

sum:
 xor eax, eax
 test ecx, ecx
 jle .end
 mov [rsp + 10h], rdx ; 1st arg for sure (ECX=1+)
 mov [rsp + 18h], r8 ; maybe 2nd arg
 mov [rsp + 20h], r9 ; maybe 3rd arg
 lea rdx, [rsp + 10h] ; lower limit
 lea rcx, [rdx + rcx*8] ; upper limit
.more:
 add eax, dword [rdx]
 add rdx, 8
 cmp rdx, rcx
 jb .more
.end:
 ret

When comparing memory addresses like in your code cmp rdx, rcx, it would be best to not check for equality but rather for excess. Provided the step is non-zero, checking for the 'below' condition will happen eventually while checking for the 'equal' condition might never happen at all (if there's an error in the program of course...).

An even simpler solution uses the ECX register as a counter (like it was on input):

sum:
 xor eax, eax
 test ecx, ecx
 jle .end
 mov [rsp + 10h], rdx ; 1st arg for sure (ECX=1+)
 mov [rsp + 18h], r8 ; maybe 2nd arg
 mov [rsp + 20h], r9 ; maybe 3rd arg
 lea rdx, [rsp + 10h] ; lower limit
.more:
 add eax, dword [rdx]
 add rdx, 8
 sub ecx, 1
 jnz .more
.end:
 ret

We can change how many variables are being given to the function without having to change the amount of memory being allocated to the function itself.

True, but the caller still has to reserve - somewhat mysteriously - those 24 bytes that correspond to the 3 arguments passed via registers EDX, R8D, and R9D.
I think it would be simpler to either pass all values on the stack or pass a pointer (in RDX) to a buffer holding all of those values. In both cases the count remains in ECX.
Next code combines it all:

 sub rsp, 48
 mov rdx, rsp
 mov dword [rdx + 40], 6
 mov dword [rdx + 32], 5
 mov dword [rdx + 24], 4
 mov dword [rdx + 16], 3
 mov dword [rdx + 8], 2
 mov dword [rdx], 1
 mov ecx, 6
 call sum
 ...
sum:
 xor eax, eax
 test ecx, ecx
 jle .end
.more:
 add eax, dword [rdx]
 add rdx, 8
 sub ecx, 1
 jnz .more
.end:
 ret

Stack Exchange Network

Variadic Functions in NASM Win64 Assembly

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Variadic Functions in NASM Win64 Assembly

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions