x64 Assembly zeroing an array (8 bytes at a time)

Question 1

Is there a better way of implementing this other than using simd instructions?

What is the best way of dealing with arrays not divisible by 8, as in the code where if there are less than 8 bytes left to zero they just get zeroed 1 by 1?

Maybe it is faster to check how many bytes there are left and then zero them 2 bytes or 4 bytes at a time?

Does the checking outweigh the cost of doing them 1 by 1?

This is just a test for me to try to learn assembly so any, even small, improvements and tips are greatly appreciated.

Thank you

.code
ZeroArray proc
 cmp edx, 0 
 jle Finished ; Check if count is 0
 cmp edx, 8 
 jl SetupLessThan8Bytes ; Check if counter is less than 8
 mov r8d, edx ; Storing the original count
 shr edx, 3 ; Bit shifts the counter to the right by 3 (equal to dividing by 8), works because 2^3 is equal to 8
 mov r9d, edx ; Stores the divided count to be able to check how many single byte zeros the program has to do
MainLoop:
 mov qword ptr [rcx], 0 ; Set the next 8 bytes (qword) to 0
 add rcx, 8 ; Move pointer along the array by 8 bytes
 dec edx ; Decrement the counter
 jnz MainLoop ; If counter is not equal to 0 jump to MainLoop
 shl r9d, 3 ; Bit shifts the stored divided counter to the left by 3 (equal to multiplying by 8), 2^3 again
 sub r8d, r9d ; Subs the counts from eachother, if it equals zero all bytes are zeroed, otherwise r8d equals the amount of bytes left
 je Finished
SetFinalBytesLoop:
 mov byte ptr [rcx], 0 ; Sets the last byte of the array to 0
 inc rcx
 dec r8d
 jnz SetFinalBytesLoop
Finished:
 ret
SetupLessThan8Bytes:
 mov r8d, edx ; Mov the value of edx into r8d so the same code can be used in SetFinalBytesLoop
 jmp SetFinalBytesLoop
ZeroArray endp
end

Question 2

You are using MASM and Visual Studio is this correct?

Question 3

Yes, I am and I'm calling the function from c++. @Will

Question 4

Is the second parameter of ZeroArray the number of qwords or number of bytes?

Question 5

There are lots of different ways of going about this. Which one is fastest tends to change from CPU to CPU. For example, looking at the source for the MSVC memset function (which is basically what you are doing), you can see it testing whether the current CPU supports "Enhanced Fast Strings" as it selects which approach to use. As you say this is for educational purposes, how about looking at the stosb/stosw/stosd/stosq instructions? Combined with the rep prefix they can produce small, easy-to-understand code that is a common alternative if you don't want to use SIMD instructions.

Question 6

Shave off a byte

 cmp edx, 0 
 jle Finished ; Check if count is 0

Using cmp is certainly not wrong, but the optimal way to check for any inappropriate counter value would be to use the test instruction.

 test edx, edx
 jle Finished ; Check if count is 0

Bypassing when the counter is zero is fine, but perhaps a negative counter value should rather be considered an error and handled accordingly?

Don't loose yourself in jumping around

 cmp edx, 8 
 jl SetupLessThan8Bytes ; Check if counter is less than 8
 mov r8d, edx ; Storing the original count
 ...
 ...
SetupLessThan8Bytes:
 mov r8d, edx
 jmp SetFinalBytesLoop

When the counter in EDX is smaller than 8, you jump to SetupLessThan8Bytes where you just make a convenient copy of the counter and then jump again to SetFinalBytesLoop.
If you move the instruction that makes a copy of the original counter to right before where you compare the counter to 8, you can save yourself from writing 3 lines of code (a label, a mov, and a jmp). Moreover the program becomes clearer.

 mov r8d, edx ; Storing the original count
 cmp edx, 8 
 jl SetFinalBytesLoop ; Check if counter is less than 8

You don't even have to compare to 8 at all!

When you shift the counter in EDX 3 times to the right in order to find out how many qwords you have to process, you can look at the zero flag. If the ZF is set (meaning no qwords at all), you instantely know that the counter is in the range [1,7], and so the above snippet becomes:

 mov r8d, edx ; Storing the original count
 shr edx, 3 ; Equal to dividing by 8
 jz SetFinalBytesLoop ; Jump if counter is less than 8

Easier calculation of leftovers

 mov r9d, edx
 ...
 shl r9d, 3
 sub r8d, r9d
 je Finished
SetFinalBytesLoop:

The way you find out about the number of left over bytes is too complicated. It's correct but needlessly involved. Basically all it takes is anding the original counter with 7 to extract the lowest 3 bits. Simpler, shorter, and using one register less which in future programs will always be handy:

 and r8d, 7
 jz Finished
SetFinalBytesLoop:

Smaller instructions are generally better

With the 32-bit immediate value, the mov instruction in the MainLoop is quite long (7 bytes). You can store the zero in RAX and move that to memory. This also eliminates the need for the mention "qword ptr ":

 xor rax, rax ; Equivalent to MOV EAX, 0
MainLoop:
 mov [rcx], rax ; Set the next 8 bytes (qword) to 0
 add rcx, 8 ; Move pointer along the array by 8 bytes
 dec edx ; Decrement the counter
 jnz MainLoop ; If counter is not equal to 0 jump to MainLoop

Your program with all the above applied

 xor rax, rax
 test edx, edx
 jle Finished ; Check if count is LE 0
 mov r8d, edx ; Copy of the original count
 shr edx, 3 ; Gives number of qwords
 jz SetFinalBytesLoop ; Jump if counter is less than 8
MainLoop:
 mov [rcx], rax ; RAX=0 Set the next 8 bytes (qword) to 0
 add rcx, 8 ; Step per 8 bytes
 dec edx ; Dec the counter
 jnz MainLoop
 and r8d, 7 ; Remainder from division by 8
 jz Finished
SetFinalBytesLoop:
 mov [rcx], al ; AL=0 Sets the last bytes of the array to 0
 inc rcx ; Step per 1 byte
 dec r8d ; Dec counter
 jnz SetFinalBytesLoop
Finished:
 ret

I've moved the xor rax, rax higher up in the code so SetFinalBytesLoop can benefit from using the register AL vs the immediate 0.

The optimization

The most important optimization that you can apply to your program is making sure that the qword value that you write is aligned on a qword boundary, so a memory address that is divisible by 8.
The extra alignment loop will at most iterate 7 times.

 xor rax, rax
 test edx, edx
 jle Finished ; Check if count is LE 0
 jmp TestAligned
AlignLoop:
 mov [rcx], al
 inc rcx
 dec edx
 jz Finished
TestAligned:
 test rcx, 7 ; Is this a qword aligned address?
 jnz AlignLoop ; Not yet!
 mov r8d, edx ; Copy of the (reduced) original count
 shr edx, 3 ; Gives number of qwords
 jz SetFinalBytesLoop ; Jump if counter is less than 8
MainLoop:
 mov [rcx], rax ; RAX=0 Set the next 8 bytes (qword) to 0
 add rcx, 8 ; Step per 8 bytes
 dec edx ; Dec the counter
 jnz MainLoop
 and r8d, 7 ; Remainder from division by 8
 jz Finished
SetFinalBytesLoop:
 mov [rcx], al ; AL=0 Sets the last bytes of the array to 0
 inc rcx ; Step per 1 byte
 dec r8d ; Dec counter
 jnz SetFinalBytesLoop
Finished:
 ret

Sep Roland Sep Roland 4,78317 silver badges28 bronze badges · Accepted Answer · 2017-10-08 18:56:06Z

Shave off a byte

 cmp edx, 0 
 jle Finished ; Check if count is 0

Using cmp is certainly not wrong, but the optimal way to check for any inappropriate counter value would be to use the test instruction.

 test edx, edx
 jle Finished ; Check if count is 0

Bypassing when the counter is zero is fine, but perhaps a negative counter value should rather be considered an error and handled accordingly?

Don't loose yourself in jumping around

 cmp edx, 8 
 jl SetupLessThan8Bytes ; Check if counter is less than 8
 mov r8d, edx ; Storing the original count
 ...
 ...
SetupLessThan8Bytes:
 mov r8d, edx
 jmp SetFinalBytesLoop

When the counter in EDX is smaller than 8, you jump to SetupLessThan8Bytes where you just make a convenient copy of the counter and then jump again to SetFinalBytesLoop.
If you move the instruction that makes a copy of the original counter to right before where you compare the counter to 8, you can save yourself from writing 3 lines of code (a label, a mov, and a jmp). Moreover the program becomes clearer.

 mov r8d, edx ; Storing the original count
 cmp edx, 8 
 jl SetFinalBytesLoop ; Check if counter is less than 8

You don't even have to compare to 8 at all!

When you shift the counter in EDX 3 times to the right in order to find out how many qwords you have to process, you can look at the zero flag. If the ZF is set (meaning no qwords at all), you instantely know that the counter is in the range [1,7], and so the above snippet becomes:

 mov r8d, edx ; Storing the original count
 shr edx, 3 ; Equal to dividing by 8
 jz SetFinalBytesLoop ; Jump if counter is less than 8

Easier calculation of leftovers

 mov r9d, edx
 ...
 shl r9d, 3
 sub r8d, r9d
 je Finished
SetFinalBytesLoop:

The way you find out about the number of left over bytes is too complicated. It's correct but needlessly involved. Basically all it takes is anding the original counter with 7 to extract the lowest 3 bits. Simpler, shorter, and using one register less which in future programs will always be handy:

 and r8d, 7
 jz Finished
SetFinalBytesLoop:

Smaller instructions are generally better

With the 32-bit immediate value, the mov instruction in the MainLoop is quite long (7 bytes). You can store the zero in RAX and move that to memory. This also eliminates the need for the mention "qword ptr ":

 xor rax, rax ; Equivalent to MOV EAX, 0
MainLoop:
 mov [rcx], rax ; Set the next 8 bytes (qword) to 0
 add rcx, 8 ; Move pointer along the array by 8 bytes
 dec edx ; Decrement the counter
 jnz MainLoop ; If counter is not equal to 0 jump to MainLoop

Your program with all the above applied

 xor rax, rax
 test edx, edx
 jle Finished ; Check if count is LE 0
 mov r8d, edx ; Copy of the original count
 shr edx, 3 ; Gives number of qwords
 jz SetFinalBytesLoop ; Jump if counter is less than 8
MainLoop:
 mov [rcx], rax ; RAX=0 Set the next 8 bytes (qword) to 0
 add rcx, 8 ; Step per 8 bytes
 dec edx ; Dec the counter
 jnz MainLoop
 and r8d, 7 ; Remainder from division by 8
 jz Finished
SetFinalBytesLoop:
 mov [rcx], al ; AL=0 Sets the last bytes of the array to 0
 inc rcx ; Step per 1 byte
 dec r8d ; Dec counter
 jnz SetFinalBytesLoop
Finished:
 ret

I've moved the xor rax, rax higher up in the code so SetFinalBytesLoop can benefit from using the register AL vs the immediate 0.

The optimization

The most important optimization that you can apply to your program is making sure that the qword value that you write is aligned on a qword boundary, so a memory address that is divisible by 8.
The extra alignment loop will at most iterate 7 times.

 xor rax, rax
 test edx, edx
 jle Finished ; Check if count is LE 0
 jmp TestAligned
AlignLoop:
 mov [rcx], al
 inc rcx
 dec edx
 jz Finished
TestAligned:
 test rcx, 7 ; Is this a qword aligned address?
 jnz AlignLoop ; Not yet!
 mov r8d, edx ; Copy of the (reduced) original count
 shr edx, 3 ; Gives number of qwords
 jz SetFinalBytesLoop ; Jump if counter is less than 8
MainLoop:
 mov [rcx], rax ; RAX=0 Set the next 8 bytes (qword) to 0
 add rcx, 8 ; Step per 8 bytes
 dec edx ; Dec the counter
 jnz MainLoop
 and r8d, 7 ; Remainder from division by 8
 jz Finished
SetFinalBytesLoop:
 mov [rcx], al ; AL=0 Sets the last bytes of the array to 0
 inc rcx ; Step per 1 byte
 dec r8d ; Dec counter
 jnz SetFinalBytesLoop
Finished:
 ret

Stack Exchange Network

x64 Assembly zeroing an array (8 bytes at a time)

1 Answer 1

Shave off a byte

Don't loose yourself in jumping around

You don't even have to compare to 8 at all!

Easier calculation of leftovers

Smaller instructions are generally better

Your program with all the above applied

The optimization

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

x64 Assembly zeroing an array (8 bytes at a time)

1 Answer 1

Shave off a byte

Don't loose yourself in jumping around

You don't even have to compare to 8 at all!

Easier calculation of leftovers

Smaller instructions are generally better

Your program with all the above applied

The optimization

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions