CodeWars - Assembly Language - Check List For Value

Question 1

I'm doing assembly language problems on CodeWars, a website with practice problems.

Problem

https://www.codewars.com/kata/545991b4cbae2a5fda000158/train/nasm

Create a method that accepts a list and an item, and returns true if the item belongs to the list, otherwise false.

Solution In C

To give you an idea what the assembly code will be doing.

#include <stdbool.h>
#include <stddef.h>
bool include(const int* arr, size_t size, int item)
{
 int i = 0;
 loop:
 if ( i < size ) {
 if ( arr[i] == item ) {
 return true;
 }
 i++;
 goto loop;
 } 
 return false;
}

Solution In NASM Assembly (Linux x64)

CodeWars provided the 7 lines at the top.

SECTION .text
global include
include:
 ; bool include(const int* arr, size_t size, int item)
 ; sizeof(int) = 4 bytes (32bit)
 ; sizeof(size_t) = 8 bytes (64bit)
 
 ;rdi = &arr pointer, 8 bytes
 ; arr[i] signed int, 4 bytes (dd)
 ;rsi = size size_t, unsigned int, 8 bytes
 ;edx = item signed int, 4 bytes
 
 ; Avoid using registers that we need to preserve (RBX, RBP, R12-R15). Else we'd have to push and pop them onto the stack.
 mov rcx, 0 ; unsigned int i = 0;
loop1:
 cmp rcx, rsi ; if ( i < size ) {
 jae skip_loop
 
 mov r8d, [rdi + 4 * rcx] ; make a temp variable so we can see this in step debugging
 cmp edx, r8d ; if ( arr[i] == item ) {
 jne skip_if
 mov rax, 1 ; return true;
 ret
 
skip_if:
 inc rcx ; i++;
 jmp loop1
 
skip_loop:
 mov rax, 0 ; return false;
 ret

Questions

I'm brand new to assembly. Any feedback on patterns and best practices would be appreciated. For example

Is there a standard pattern to use when writing loops?
Is there a standard pattern to use when writing if/elseif/else?
Are there better word choices and formatting for the labels?

Question 2

"CodeWars provided the 6 lines at the top." Do you skip blank lines when numbering/counting? Is ;rdi = &arr pointer, 8 bytes the first line line that you wrote yourself? It's not readily clear.

Question 3

Yes and yes. I'll edit for clarity.

Question 4

First of all, props for the copious comments, particularly how you included a representation in C. The C representation itself has a signed vs unsigned comparison, which can cause weird bugs when and where you don't expect them, but I'm going to stick to the assembly code itself in this review. I would just recommend declaring the loop counter i as size_t, since that's what the type of the stop condition is.

I assembled your C function using gcc version 10.2.0 with -O3 -march=native, so I'll include the output here so I can walk through it step by step, comparing the two implementations. This is a really good idea, by the way, because working backwards through what the C compiler did helps you see real assembly language, not just practice examples you wrote. Compiler Explorer is a great tool for this.

Anyways, here is my input file.

#include <stdbool.h>
#include <stddef.h>
bool include(const int* arr, size_t size, int item) {
 for (size_t i = 0; i < size; ++i) {
 if (arr[i] == item) {
 return true;
 }
 }
 return false;
}

To assemble it, I use the following command. Note the -masm=intel argument; the default assembly syntax is AT&T for GNU tools.

gcc -S -O3 -march=native -masm=intel -o output.asm input.c

You can filter out auxiliary metadata and its containing labels using the following command.

cat output.asm | sed -E '/^\s+\./d;/^\.L[A-Z]/d'

And here is my output.

include:
 test rsi, rsi
 je .L4
 xor eax, eax
 jmp .L3
.L8:
 inc rax
 cmp rsi, rax
 je .L4
.L3:
 cmp DWORD PTR [rdi+rax*4], edx
 jne .L8
 mov eax, 1
 ret
.L4:
 xor eax, eax
 ret

Notice that the first line is already different. In your version, you started by setting the rcx register to 0, using the mov instruction, whereas the compiler output test rsi, rsi. Why?

Well, as you noted, Intel x86-64 Linux assembly programming calling convention dictates that the rsi register contains the second argument to your function, in this case the size of the array. From the Intel x86-64 documentation (pg. 1866), the test instruction performs a logical AND test on its arguments. If the result is zero, it sets the zero flag ZF equal to 1. The following instruction therefore makes sense, since the "jump near if equal" (je) instruction is carried out when the the zero flag is set (ZF=1).

In other words, the subroutine begins by checking whether or not the input array actually contains any items before actually doing anything with it. Note that you weren't checking for this edge case in your original code (nor did you verify the array pointer wasn't NULL), and it's a great example of compilers being awesome. Matt Godbolt (the guy who made Compiler Explorer) has an awesome talk about this kind of stuff that I highly recommend you check out if you like this sort of thing.

Anyways, if you look at the .L4 label, you'll notice it's semantically equivalent to your skip_loop. However, you literally set the rax register (i.e., the return value of the function) equal to zero by moving a 0 into it, whereas the compiler uses the exclusive-or xor instruction on eax with itself, which will obviously always be zero. You're not semantically wrong for doing it the way you did, but you can read this SO post that describes in significant detail why you should opt for the xor eax, eax method. The short version is that it's more efficient, and the longer version is that it's much more efficient, but there are other benefits, like power consumption. That post goes into a lot more detail, though, and it's a great read.

Your loop itself looks okay to me. The compiler used the rax register for the loop counter, which both you and the compiler then used to get the value of the array at the appropriate index. The only real difference between the two versions is that the compiler used an unconditional jump jmp instruction to skip the first part of its main loop, which contained the loop counter increment, whereas your code had that last.

I genuinely don't think this difference has any real impact, because both implementations contain two conditional jumps, which significantly impact performance because they trigger unconditional instruction fetches and involve more advanced processor features like branch prediction, which itself introduces problems via an optimization called speculative execution. (Long story short, optimization is complicated, you won't really know until you profile it, and you probably shouldn't even care about optimization until you have a working something to optimize, but you're "probably" fine.)

Something I found really interesting (although not particularly impactful or world-view shattering), was that believe it or not, creating that temporary variable and then comparing takes exactly as many bytes to encode as the direct comparison the compiler output in my version.

Here is a snippet from the objdump output for your version. (To generate this on your local machine, the command I used after assembling with nasm was objdump -Mx86-64,intel -D -S -s input.o.)

0000000000000005 <loop1>:
loop1:
 cmp rcx, rsi ; if ( i < size ) {
 5: 48 39 f1 cmp rcx,rsi
 jae skip_loop
 8: 73 14 jae 1e <skip_loop>
 
 mov r8d, [rdi + 4 * rcx] ; make a temp variable so we can see this in step debugging
 a: 44 8b 04 8f mov r8d,DWORD PTR [rdi+rcx*4]
 cmp edx, r8d ; if ( arr[i] == item ) {
 e: 44 39 c2 cmp edx,r8d
 jne skip_if
 11: 75 06 jne 19 <skip_if>
 mov rax, 1 ; return true;
 13: b8 01 00 00 00 mov eax,0x1
 ret
 18: c3 ret

Now here's a snippet from the output for the compiler's version that contains the comparison operation.

0000000000000011 <include.L3>:
.L3:
 cmp [dword rdi+rax*4], edx
 11: 39 94 87 00 00 00 00 cmp DWORD PTR [rdi+rax*4+0x0],edx
 jne .L8
 18: 75 ef jne 9 <include.L8>
 mov eax, 1
 1a: b8 01 00 00 00 mov eax,0x1
 ret
 1f: c3 ret

Notice how in your version, the assignment to a temporary variable takes four bytes. You specified the r8d register as the destination register, so this isn't exactly groundbreaking stuff, but the following comparison instruction required only three bytes to encode:

44 8b 04 8f mov r8d,DWORD PTR [rdi+rcx*4]
44 39 c2 cmp edx,r8d

The compiler's version skipped the intermediate variable assignment, but the resulting instruction required seven bytes to encode.

39 94 87 00 00 00 00 cmp DWORD PTR [rdi+rax*4+0x0],edx

To explain why those extra zeroes at the end matter, I will borrow once again from this great post which you should definitely read.

Smaller machine-code size [...] is always an advantage: Higher code density leads to fewer instruction-cache misses, and better instruction fetch and potentially decode bandwidth.

To really drive this point home, let us read the conditional jump instruction documentation (pg. 1109 in the combined manual [vols 1-4]):

All conditional jumps are converted to code fetches of one or two cache lines, regardless of jump address or cache-ability.

I now leave this link to the latency numbers every programmer should know for your edification, although it should be noted this document is from 2012. Here's a cool updated version where you can look at the latency numbers by year (including 2020), but I actually just found this myself, so I admit I haven't vetted the source for accuracy. I am including it for completeness nonetheless.

As far as the labels themselves are concerned, since loop1, skip_if, and skip_loop are all logically related to the include subroutine, I would recommend using local labels to more intuitively organize your assembly code. Local labels are particularly useful because the subroutine name serves as a sort of namespace, allowing you to reuse the local label names defined therein. You can see the include version above assembled by gcc uses local labels.

The only recommendation I would make regarding loops is to be wary of using the right conditional jump for your situation. From the documentation:

The terms "less" and "greater" are used for comparisons of signed integers and the terms "above" and "below" are used for unsigned integers.

This isn't pedantry, either. Take for example, the "jump if above or equal" jae instruction in your code. It follows a cmp instruction, which subtracts the second operand from the first and modifies the EFLAGS register accordingly. Specifically, the intermediate sub instruction performs both signed and unsigned subtraction, setting the overflow and carry flags, respectively. However, by using the jae instruction, you are implicitly only checking the carry flag, so hopefully your loop counter and stop conditions are of the same type.

The C standard defines how this should be done, which does help mitigate bugs by both converting as properly and safely as possible, and by providing helpful warnings and even error messages (depending on the compilation strictness settings). Of course, if you're going to be writing assembly language directly, this obviously does not help you.

For reference, the EFLAGS condition codes can be found in Volume 1 Appendix B of the Intel® 64 and IA-32 Architectures Software Developer’s Manuals, and the conditional jumps reference table starts on page 1106 of Volume 2.

Question 5

Thanks for the detailed reply. I am familiar with godbolt and I've taken a look at its output before.Great tool. So you basically recommend that I run all my C code through that to get an idea of best practices in assembly? Are default settings OK?

Question 6

@RedDragonWebDesign The best resource is always going to be an old hand, but if one isn't available, or when you're researching a question before posting on a site like this, compiler output is going to be a great resource because it's literally what would be getting run. I like to compare the output and performance from different optimization levels (particularly size vs. speed), but definitely make sure the C/C++/Rust/etc. code you're compiling is correct (syntactically, semantically, and idiomatically) so you get the best output. In other words, enable all warnings, checks, etc.

score 7 · Answer 1 · 2020-09-30 16:45:13Z