5
\$\begingroup\$

This is some assembly code I've written to add two base-10 numbers. The input and output are represented as null-terminated ASCII strings, with digits in big-endian order.

I have also provided comments for each line explaining what it does. Is there anywhere I can use SIMD or something which will be faster here?

Since the function's called add_whole, it only adds whole numbers, a.k.a. non-negative integers. I also use my own strlen which only modifies rcx.

extern strlen
section .text
global add_whole
add_whole:
 ; Input:
 ; - char *a -> rdi
 ; - char *b -> rsi
 ; - char *res -> rdx
 ; Registers used:
 ; - rax
 ; - rcx
 ; - r8
 ; - r9
 ; - r10
 ; - r11
 ; - r12
 ; - r13
 ; - r14
 
 push r12 ; Save callee-saved registers used.
 push r13 ; In this case,
 push r14 ; r12, r13, and r14.
 call strlen ; Calculate the length of char *a and store it in rax.
 push rax ; Save rax,
 push rdi ; and rdi to make another call to strlen.
 mov rdi, rsi ; Move char *b into rdi.
 call strlen ; Calculate the length of char *b,
 mov r8, rax ; and store it in r8.
 pop rdi ; Now, rax = strlen(a)
 pop rax ; r8 = strlen(b)
 mov r9, r8 ; The next two lines of code
 cmp rax, r8 ; put std::max(rax, r8) into
 cmovnc r9, rax ; the r9 register.
 xor r10b, r10b ; r10b = false := carry_flag
 xor rcx, rcx ; rcx = 0 := loop_counter
.loop_1:
 lea r13, [rcx + 1] ; r13 = loop_counter + 1
 mov r11b, 48 ; r11b = '0', if there are no more digits in the number/s, 
 mov r12b, r11b ; r12b = '0' then these default values will be used.
 cmp rax, r13 ; Compare strlen(a) with loop_counter + 1.
 js .after_if_1 ; if (loop_counter < strlen(a)):
 lea r14, [rdi+rax] ; r11b = a[strlen(a) - loop_counter - 1]
 sub r14, rcx ; The next two lines 
 dec r14 ; execute that one 
 mov r11b, byte [r14] ; line if statement.
.after_if_1:
 cmp r8, r13 ; Compare strlen(b) with loop_counter + 1.
 js .after_if_2 ; if (loop_counter < strlen(b)):
 lea r14, [rsi+r8] ; r12b = b[strlen(b) - loop_counter - 1]
 sub r14, rcx ; The next two lines 
 dec r14 ; execute that one
 mov r12b, byte [r14] ; line if statement.
.after_if_2:
 mov r13b, r11b ; These next two lines add individual digits 
 add r13b, r12b ; of the numbers.
 sub r13b, 48
 test r10b, r10b ; Check for carry.
 jz .after_if_3 ; if (carry_flag):
 xor r10b, r10b ; carry_flag = false
 inc r13b ; r13b++, add the carry
.after_if_3:
 cmp r13b, 57 ; Compare the current addition result with '9'
 jle .after_if_4 ; if (r13b > '9'):
 sub r13b, 10 ; r13b -= 10
 mov r10b, 1 ; carry_flag = true
.after_if_4:
 mov byte[rdx+rcx], r13b ; res[loop_counter] = r13b
 inc rcx ; Increment the loop counter, which now points to the end of res.
 cmp rcx, r9 ; Keep looping,
 js .loop_1 ; while (loop_counter < std::max(strlen(a), strlen(b))).
 test r10b, r10b ; Check for a final carry.
 jz .after_if_5 ; if (carry_flag):
 mov byte[rdx+rcx], 49 ; res[loop_counter] = '1'
 inc r9 ; Increment r9, which stores strlen(res).
.after_if_5:
 xor rcx, rcx ; rcx = 0 := loop_counter
 mov r11, r9 ; Note that r9 = strlen(res).
 shr r11, 1 ; r11 = strlen(res) / 2
.loop_2:
 lea r12, [rdx+r9] ; r12 = &res[strlen(res) - loop_counter - 1]
 sub r12, rcx
 dec r12
 mov r8b, byte [rdx+rcx] ; r8b = res[loop_counter]
 mov r10b, byte [r12] ; r10b = *r12 = res[strlen(res) - loop_counter - 1]
 mov byte [rdx+rcx], r10b ; res[loop_counter] = r10b
 mov byte [r12], r8b ; res[strlen(res) - loop_counter - 1] = r8b
 inc rcx
 cmp rcx, r11 ; Keep looping while (loop_counter < strlen(res) / 2)
 js .loop_2
 mov byte [rdx+r9], 0 ; res should be null terminated.
 pop r14 ; Pop used callee-saved registers.
 pop r13
 pop r12
 ret
asked Sep 21, 2022 at 6:56
\$\endgroup\$
7
  • \$\begingroup\$ Intel assembly, x86-64 nasm like assembly. And no, it only adds non-negative numbers. \$\endgroup\$ Commented Sep 22, 2022 at 6:08
  • \$\begingroup\$ @greybeard done \$\endgroup\$ Commented Sep 22, 2022 at 10:04
  • \$\begingroup\$ You seem to have sketched out the procedure "in C": Did you try to have that compiled to assembly with code improvement /optimisation for reference? \$\endgroup\$ Commented Sep 22, 2022 at 10:45
  • \$\begingroup\$ You tagged performance, mention SIMD & faster: There used to be a decimal adjust instruction for packed decimals. Can you code something similar for "ASCII decimals"? \$\endgroup\$ Commented Sep 22, 2022 at 10:45
  • \$\begingroup\$ What has been your goal coding this in assembly? \$\endgroup\$ Commented Sep 22, 2022 at 10:48

1 Answer 1

2
\$\begingroup\$

Following the ABI (you are preserving R12, R13, and R14), the RDX and RSI registers are not amongst the callee-saved registers. Therefore neither might have survived those strlen calls!

Optimizations to the existing code

xor rcx, rcx ; rcx = 0 := loop_counter

Write this as xor ecx, ecx. It'll shave off the REX prefix and will zero the whole RCX register just as well.

lea r14, [rdi+rax] 
sub r14, rcx
dec r14

Better avoid that separate dec r14 and add the 'minus 1' that it does to the lea instruction: lea r14, [rdi+rax-1].

lea r13, [rcx + 1] ; r13 = loop_counter + 1
cmp rax, r13 ; cmp strlen(a) with loop_counter + 1
js .after_if_1 ; if (loop_counter < strlen(a)):

Why don't you code it the way your C code expresses it? You wouldn't need the detour via the extra R13 register. Since the meaning of '<' for unsigned numbers is 'Below' and you want to by-pass, the conditional to use is 'JumpIfNotBelow':

cmp rcx, rax ; cmp loop_counter, strlen(a)
jnb .after_if_1
mov r13b, r11b ; These next two lines add individual digits 
add r13b, r12b ; of the numbers.
sub r13b, 48

It is unnecessary to first copy R11B to another register. Just process the sum in the R11B register:

add r11b, r12b ; (*) Using R11B instead of R13B
sub r11b, 48
 test r10b, r10b ; Check for carry.
 jz .after_if_3 ; if (carry_flag):
 xor r10b, r10b ; carry_flag = false
 inc r13b ; r13b++, add the carry
.after_if_3:

The carry variable in the R10B register is limited to the values of 0 and 1. If it holds 0, you don't want to do anything, and if it is 1, you want to increment the sum and reset the carry variable. Next code is much simpler and does just that:

add r11b, r10b ; (*) Using R11B instead of R13B
xor r10b, r10b
mov byte[rdx+rcx], r13b ; res[loop_counter] = r13b

Because you store the result of the addition reversed, you have added a string reversal routine to the code. This is an inefficient approach. The final result will have the same number of digits as the longest of the two inputs, or maybe just one extra digit in case of a final carry.

lea rdx, [rdx+1+r9] ; R9 is longest input
mov byte [rdx], 0 ; Zero-terminated result

Writing an optimized version

Don't do the whole calculation in a single loop. First process the positions that both inputs have in common, then process the positions that remain in the longer of the two inputs. You'll have much less conditional branches to execute.
Carefully choosing your registers can shave off a lot of bytes via not requiring a REX prefix, being able to use the short accumulator encodings, and not having to preserve callee-saved registers.

Input: RDI RSI RDX
 v v v
a, b, res -> 1158791457863133@ 2678139@ ..................
Setup: RDI R8 RSI R9 RDX
 v v v v v
a, b, res -> 1158791457863133@ 2678139@ .................@
Loops: LoopB LoopA
 /-------\/-----\
a -> 1158791457863133@
b -> 2678139@
res -> .1158791460541272@
Output: R8 R9 RDX
 v v v
a, b, res -> 1158791457863133@ 2678139@ .1158791460541272@
 ^
 RAX
; IN (rdx,rsi,rdi) OUT (rax) MOD (rcx,rdx,rsi,rdi,r8,r9)
 push rbx ; The only callee-saved register to preserve
 push rdx ; (1)
 push rsi ; (2)
 push rdi ; (3)
 call strlen ; -> RAX is strlen(a)
 mov ebx, eax ; For sure less than 4GB
 mov rdi, [rsp+8]
 call strlen ; -> RAX is strlen(b)
 pop rdi ; (3)
 pop rsi ; (2)
 pop rdx ; (1)
 ; At RDI are RBX bytes, at RSI are RAX bytes
 cmp ebx, eax
 jae .NoSwap
 xchg rdi, rsi ; Make (RDI:RBX) refer to the longer input
 xchg ebx, eax
.NoSwap:
 mov ecx, eax ; RAX is length of the shorter input
 lea r8, [rdi+rbx] ; R8 points at the terminating zero
 lea r9, [rsi+rax] ; R9 points at the terminating zero
 lea rdx, [rdx+1+rbx] ; RBX is length of the longer input
 xor ebx, ebx ; BL is carry [0,1]
 mov [rdx], bl ; Zero-terminating the result
 jecxz .NoLoopA
.LoopA: ; Deals with the common positions
 movzx eax, byte [r8-1]
 add al, [r9-1]
 sub al, '0'
 add al, bl ; Plus carry
 xor ebx, ebx ; Clear carry
 cmp al, '9'
 jbe .CharOK
 sub al, 10
 inc ebx ; Set carry
.CharOK:
 mov [rdx-1], al
 dec r9
 dec r8
 dec rdx
 dec rcx
 jnz .LoopA
.NoLoopA:
 cmp r8, rdi
 jna .NoLoopB
.LoopB: ; Deals with remainder of the longer input
 movzx eax, byte [r8-1]
 add al, bl ; Plus carry
 xor ebx, ebx ; Clear carry
 cmp al, '9'
 jbe .CharOK_
 sub al, 10
 inc ebx ; Set carry
.CharOK_:
 mov [rdx-1], al
 dec r8
 dec rdx
 cmp r8, rdi
 ja .LoopB
.NoLoopB:
 test bl, bl ; Check for a final carry.
 jz .Done
 dec rdx
 mov byte [rdx], '1'
.Done:
 mov rax, rdx ; RAX is ptr to zero-terminated result
 pop rbx
 ret

Tip: The code can be optimized with a further loop, because once the addition in LoopB does not produce another carry, we can exit from that loop and copy the remainder of the string unmodified. If your bigints have very different lengths, this can speed up a lot...



Refinement

Is there any way to make it so you directly write to char *res instead of returning a pointer to a part of it without it affecting the speed?

The little detail that sits in the way for what you desire, is the possibility of having to prepend a "1" digit in case of a final carry. The probability that a final carry occurs IMO would be extremely small when working with these extremely long numbers. Therefore I suggest we rewrite the algorithm so as to ignore a potential final carry at first, but if in the end we do discover one, we just copy the result one byte up in memory and prepend the "1" digit. Sure, this will cost extra when it runs, but it will almost never have to run.

; Version that leaves RDX unchanged
; IN (rdx,rsi,rdi) OUT () MOD (rax,rcx,rsi,rdi,r8,r9)
 push rbx ; The only callee-saved register to preserve
 push rdx ; (1)
 push rsi ; (2)
 push rdi ; (3)
 call strlen ; -> RAX is strlen(a)
 mov ebx, eax ; For sure less than 4GB
 mov rdi, [rsp+8]
 call strlen ; -> RAX is strlen(b)
 pop rdi ; (3)
 pop rsi ; (2)
 pop rdx ; (1)
 ; At RDI are RBX bytes, at RSI are RAX bytes
 cmp ebx, eax
 jae .NoSwap
 xchg rdi, rsi ; Make (RDI:RBX) refer to the longer input
 xchg ebx, eax
.NoSwap:
 mov ecx, eax ; RAX is length of the shorter input
 lea r8, [rdi+rbx] ; R8 points at the terminating zero
 lea r9, [rsi+rax] ; R9 points at the terminating zero
 add rdx, rbx ; RBX is length of the longer input
 push rdx ; (4) "in case there's a final carry"
 xor ebx, ebx ; BL is carry [0,1]
 mov [rdx], bl ; Zero-terminating the result
 jecxz .NoLoopA
.LoopA: ; Deals with the common positions
 movzx eax, byte [r8-1]
 add al, [r9-1]
 sub al, '0'
 add al, bl ; Plus carry
 xor ebx, ebx ; Clear carry
 cmp al, '9'
 jbe .CharOK
 sub al, 10
 inc ebx ; Set carry
.CharOK:
 mov [rdx-1], al
 dec r9
 dec r8
 dec rdx
 dec rcx
 jnz .LoopA
.NoLoopA:
 cmp r8, rdi
 jna .NoLoopB
.LoopB: ; Deals with remainder of the longer input
 movzx eax, byte [r8-1]
 add al, bl ; Plus carry
 xor ebx, ebx ; Clear carry
 cmp al, '9'
 jbe .CharOK_
 sub al, 10
 inc ebx ; Set carry
.CharOK_:
 mov [rdx-1], al
 dec r8
 dec rdx
 cmp r8, rdi
 ja .LoopB
.NoLoopB:
 pop rcx ; (4)
 test bl, bl ; Check for a final carry.
 jz .Done
.CopyUp:
 mov al, [rcx]
 mov [rcx+rbx], al ; RBX=1
 dec rcx
 cmp rcx, rdx
 jae .CopyUp
 mov byte [rdx], '1'
.Done:
 pop rbx ; Unchanged RDX is ptr to zero-terminated result
 ret

Not only does the extra .CopyUp code very seldom have to run, since it's a simple memcpy, you can replace it with any specialized version of it (or just invoke one) should you deem this useful.

answered Sep 23, 2022 at 18:20
\$\endgroup\$
11
  • \$\begingroup\$ Wow, thank you. I compared my code with the modifications and your optimized version with strings of lengths of 10M. Yours was faster by 0.3 seconds (2.9 and 2.6). Is there any way to make it so you directly write to char *res instead of returning a pointer to a part of it without it affecting the speed? \$\endgroup\$ Commented Sep 26, 2022 at 8:12
  • \$\begingroup\$ @avighnac I have amended my answer with a refinement that addresses the issue from your comment. \$\endgroup\$ Commented Oct 2, 2022 at 12:28
  • \$\begingroup\$ Yes, hello, I noticed. Sorry, I haven't been active on this website over the last week and will not be this week either. My exams are on right now, but I'll definitely check this out after they're over! Again, sorry for the late reply! \$\endgroup\$ Commented Oct 4, 2022 at 13:38
  • \$\begingroup\$ Hey, could you explain why your code's slightly faster? What did you do that optimized it? \$\endgroup\$ Commented Oct 6, 2022 at 13:45
  • \$\begingroup\$ @avighnac I take it that you are referring to the 0.3 s speed increase between your code with optimizations applied (those that I wrote about in the first part of my answer), and then the optimized version that I proposed in the middle part of my answer. The key points are (1) branch reduction, and (2) short and simple instructions. \$\endgroup\$ Commented Oct 6, 2022 at 16:04

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.