Is there a better way of implementing this other than using simd instructions?
What is the best way of dealing with arrays not divisible by 8, as in the code where if there are less than 8 bytes left to zero they just get zeroed 1 by 1?
Maybe it is faster to check how many bytes there are left and then zero them 2 bytes or 4 bytes at a time?
Does the checking outweigh the cost of doing them 1 by 1?
This is just a test for me to try to learn assembly so any, even small, improvements and tips are greatly appreciated.
Thank you
.code
ZeroArray proc
cmp edx, 0
jle Finished ; Check if count is 0
cmp edx, 8
jl SetupLessThan8Bytes ; Check if counter is less than 8
mov r8d, edx ; Storing the original count
shr edx, 3 ; Bit shifts the counter to the right by 3 (equal to dividing by 8), works because 2^3 is equal to 8
mov r9d, edx ; Stores the divided count to be able to check how many single byte zeros the program has to do
MainLoop:
mov qword ptr [rcx], 0 ; Set the next 8 bytes (qword) to 0
add rcx, 8 ; Move pointer along the array by 8 bytes
dec edx ; Decrement the counter
jnz MainLoop ; If counter is not equal to 0 jump to MainLoop
shl r9d, 3 ; Bit shifts the stored divided counter to the left by 3 (equal to multiplying by 8), 2^3 again
sub r8d, r9d ; Subs the counts from eachother, if it equals zero all bytes are zeroed, otherwise r8d equals the amount of bytes left
je Finished
SetFinalBytesLoop:
mov byte ptr [rcx], 0 ; Sets the last byte of the array to 0
inc rcx
dec r8d
jnz SetFinalBytesLoop
Finished:
ret
SetupLessThan8Bytes:
mov r8d, edx ; Mov the value of edx into r8d so the same code can be used in SetFinalBytesLoop
jmp SetFinalBytesLoop
ZeroArray endp
end
1 Answer 1
Shave off a byte
cmp edx, 0 jle Finished ; Check if count is 0
Using cmp
is certainly not wrong, but the optimal way to check for any inappropriate counter value would be to use the test
instruction.
test edx, edx
jle Finished ; Check if count is 0
Bypassing when the counter is zero is fine, but perhaps a negative counter value should rather be considered an error and handled accordingly?
Don't loose yourself in jumping around
cmp edx, 8 jl SetupLessThan8Bytes ; Check if counter is less than 8 mov r8d, edx ; Storing the original count ... ... SetupLessThan8Bytes: mov r8d, edx jmp SetFinalBytesLoop
When the counter in EDX
is smaller than 8, you jump to SetupLessThan8Bytes where you just make a convenient copy of the counter and then jump again to SetFinalBytesLoop.
If you move the instruction that makes a copy of the original counter to right before where you compare the counter to 8, you can save yourself from writing 3 lines of code (a label, a mov
, and a jmp
). Moreover the program becomes clearer.
mov r8d, edx ; Storing the original count
cmp edx, 8
jl SetFinalBytesLoop ; Check if counter is less than 8
You don't even have to compare to 8 at all!
When you shift the counter in EDX
3 times to the right in order to find out how many qwords you have to process, you can look at the zero flag. If the ZF is set (meaning no qwords at all), you instantely know that the counter is in the range [1,7], and so the above snippet becomes:
mov r8d, edx ; Storing the original count
shr edx, 3 ; Equal to dividing by 8
jz SetFinalBytesLoop ; Jump if counter is less than 8
Easier calculation of leftovers
mov r9d, edx ... shl r9d, 3 sub r8d, r9d je Finished SetFinalBytesLoop:
The way you find out about the number of left over bytes is too complicated. It's correct but needlessly involved. Basically all it takes is anding the original counter with 7 to extract the lowest 3 bits. Simpler, shorter, and using one register less which in future programs will always be handy:
and r8d, 7
jz Finished
SetFinalBytesLoop:
Smaller instructions are generally better
With the 32-bit immediate value, the mov
instruction in the MainLoop is quite long (7 bytes). You can store the zero in RAX
and move that to memory. This also eliminates the need for the mention "qword ptr ":
xor rax, rax ; Equivalent to MOV EAX, 0
MainLoop:
mov [rcx], rax ; Set the next 8 bytes (qword) to 0
add rcx, 8 ; Move pointer along the array by 8 bytes
dec edx ; Decrement the counter
jnz MainLoop ; If counter is not equal to 0 jump to MainLoop
Your program with all the above applied
xor rax, rax
test edx, edx
jle Finished ; Check if count is LE 0
mov r8d, edx ; Copy of the original count
shr edx, 3 ; Gives number of qwords
jz SetFinalBytesLoop ; Jump if counter is less than 8
MainLoop:
mov [rcx], rax ; RAX=0 Set the next 8 bytes (qword) to 0
add rcx, 8 ; Step per 8 bytes
dec edx ; Dec the counter
jnz MainLoop
and r8d, 7 ; Remainder from division by 8
jz Finished
SetFinalBytesLoop:
mov [rcx], al ; AL=0 Sets the last bytes of the array to 0
inc rcx ; Step per 1 byte
dec r8d ; Dec counter
jnz SetFinalBytesLoop
Finished:
ret
I've moved the xor rax, rax
higher up in the code so SetFinalBytesLoop can benefit from using the register AL
vs the immediate 0.
The optimization
The most important optimization that you can apply to your program is making sure that the qword value that you write is aligned on a qword boundary, so a memory address that is divisible by 8.
The extra alignment loop will at most iterate 7 times.
xor rax, rax
test edx, edx
jle Finished ; Check if count is LE 0
jmp TestAligned
AlignLoop:
mov [rcx], al
inc rcx
dec edx
jz Finished
TestAligned:
test rcx, 7 ; Is this a qword aligned address?
jnz AlignLoop ; Not yet!
mov r8d, edx ; Copy of the (reduced) original count
shr edx, 3 ; Gives number of qwords
jz SetFinalBytesLoop ; Jump if counter is less than 8
MainLoop:
mov [rcx], rax ; RAX=0 Set the next 8 bytes (qword) to 0
add rcx, 8 ; Step per 8 bytes
dec edx ; Dec the counter
jnz MainLoop
and r8d, 7 ; Remainder from division by 8
jz Finished
SetFinalBytesLoop:
mov [rcx], al ; AL=0 Sets the last bytes of the array to 0
inc rcx ; Step per 1 byte
dec r8d ; Dec counter
jnz SetFinalBytesLoop
Finished:
ret
memset
function (which is basically what you are doing), you can see it testing whether the current CPU supports "Enhanced Fast Strings" as it selects which approach to use. As you say this is for educational purposes, how about looking at thestosb/stosw/stosd/stosq
instructions? Combined with therep
prefix they can produce small, easy-to-understand code that is a common alternative if you don't want to use SIMD instructions. \$\endgroup\$