This is some assembly code I've written to add two base-10 numbers. The input and output are represented as null-terminated ASCII strings, with digits in big-endian order.
I have also provided comments for each line explaining what it does. Is there anywhere I can use SIMD or something which will be faster here?
Since the function's called add_whole
, it only adds whole numbers, a.k.a. non-negative integers. I also use my own strlen which only modifies rcx.
extern strlen
section .text
global add_whole
add_whole:
; Input:
; - char *a -> rdi
; - char *b -> rsi
; - char *res -> rdx
; Registers used:
; - rax
; - rcx
; - r8
; - r9
; - r10
; - r11
; - r12
; - r13
; - r14
push r12 ; Save callee-saved registers used.
push r13 ; In this case,
push r14 ; r12, r13, and r14.
call strlen ; Calculate the length of char *a and store it in rax.
push rax ; Save rax,
push rdi ; and rdi to make another call to strlen.
mov rdi, rsi ; Move char *b into rdi.
call strlen ; Calculate the length of char *b,
mov r8, rax ; and store it in r8.
pop rdi ; Now, rax = strlen(a)
pop rax ; r8 = strlen(b)
mov r9, r8 ; The next two lines of code
cmp rax, r8 ; put std::max(rax, r8) into
cmovnc r9, rax ; the r9 register.
xor r10b, r10b ; r10b = false := carry_flag
xor rcx, rcx ; rcx = 0 := loop_counter
.loop_1:
lea r13, [rcx + 1] ; r13 = loop_counter + 1
mov r11b, 48 ; r11b = '0', if there are no more digits in the number/s,
mov r12b, r11b ; r12b = '0' then these default values will be used.
cmp rax, r13 ; Compare strlen(a) with loop_counter + 1.
js .after_if_1 ; if (loop_counter < strlen(a)):
lea r14, [rdi+rax] ; r11b = a[strlen(a) - loop_counter - 1]
sub r14, rcx ; The next two lines
dec r14 ; execute that one
mov r11b, byte [r14] ; line if statement.
.after_if_1:
cmp r8, r13 ; Compare strlen(b) with loop_counter + 1.
js .after_if_2 ; if (loop_counter < strlen(b)):
lea r14, [rsi+r8] ; r12b = b[strlen(b) - loop_counter - 1]
sub r14, rcx ; The next two lines
dec r14 ; execute that one
mov r12b, byte [r14] ; line if statement.
.after_if_2:
mov r13b, r11b ; These next two lines add individual digits
add r13b, r12b ; of the numbers.
sub r13b, 48
test r10b, r10b ; Check for carry.
jz .after_if_3 ; if (carry_flag):
xor r10b, r10b ; carry_flag = false
inc r13b ; r13b++, add the carry
.after_if_3:
cmp r13b, 57 ; Compare the current addition result with '9'
jle .after_if_4 ; if (r13b > '9'):
sub r13b, 10 ; r13b -= 10
mov r10b, 1 ; carry_flag = true
.after_if_4:
mov byte[rdx+rcx], r13b ; res[loop_counter] = r13b
inc rcx ; Increment the loop counter, which now points to the end of res.
cmp rcx, r9 ; Keep looping,
js .loop_1 ; while (loop_counter < std::max(strlen(a), strlen(b))).
test r10b, r10b ; Check for a final carry.
jz .after_if_5 ; if (carry_flag):
mov byte[rdx+rcx], 49 ; res[loop_counter] = '1'
inc r9 ; Increment r9, which stores strlen(res).
.after_if_5:
xor rcx, rcx ; rcx = 0 := loop_counter
mov r11, r9 ; Note that r9 = strlen(res).
shr r11, 1 ; r11 = strlen(res) / 2
.loop_2:
lea r12, [rdx+r9] ; r12 = &res[strlen(res) - loop_counter - 1]
sub r12, rcx
dec r12
mov r8b, byte [rdx+rcx] ; r8b = res[loop_counter]
mov r10b, byte [r12] ; r10b = *r12 = res[strlen(res) - loop_counter - 1]
mov byte [rdx+rcx], r10b ; res[loop_counter] = r10b
mov byte [r12], r8b ; res[strlen(res) - loop_counter - 1] = r8b
inc rcx
cmp rcx, r11 ; Keep looping while (loop_counter < strlen(res) / 2)
js .loop_2
mov byte [rdx+r9], 0 ; res should be null terminated.
pop r14 ; Pop used callee-saved registers.
pop r13
pop r12
ret
-
\$\begingroup\$ Intel assembly, x86-64 nasm like assembly. And no, it only adds non-negative numbers. \$\endgroup\$avighnac– avighnac2022年09月22日 06:08:00 +00:00Commented Sep 22, 2022 at 6:08
-
\$\begingroup\$ @greybeard done \$\endgroup\$avighnac– avighnac2022年09月22日 10:04:35 +00:00Commented Sep 22, 2022 at 10:04
-
\$\begingroup\$ You seem to have sketched out the procedure "in C": Did you try to have that compiled to assembly with code improvement /optimisation for reference? \$\endgroup\$greybeard– greybeard2022年09月22日 10:45:15 +00:00Commented Sep 22, 2022 at 10:45
-
\$\begingroup\$ You tagged performance, mention SIMD & faster: There used to be a decimal adjust instruction for packed decimals. Can you code something similar for "ASCII decimals"? \$\endgroup\$greybeard– greybeard2022年09月22日 10:45:31 +00:00Commented Sep 22, 2022 at 10:45
-
\$\begingroup\$ What has been your goal coding this in assembly? \$\endgroup\$greybeard– greybeard2022年09月22日 10:48:52 +00:00Commented Sep 22, 2022 at 10:48
1 Answer 1
Following the ABI (you are preserving R12, R13, and R14), the RDX and RSI registers are not amongst the callee-saved registers. Therefore neither might have survived those strlen calls!
Optimizations to the existing code
xor rcx, rcx ; rcx = 0 := loop_counter
Write this as xor ecx, ecx
. It'll shave off the REX prefix and will zero the whole RCX register just as well.
lea r14, [rdi+rax] sub r14, rcx dec r14
Better avoid that separate dec r14
and add the 'minus 1' that it does to the lea
instruction: lea r14, [rdi+rax-1]
.
lea r13, [rcx + 1] ; r13 = loop_counter + 1 cmp rax, r13 ; cmp strlen(a) with loop_counter + 1 js .after_if_1 ; if (loop_counter < strlen(a)):
Why don't you code it the way your C code expresses it? You wouldn't need the detour via the extra R13 register. Since the meaning of '<' for unsigned numbers is 'Below' and you want to by-pass, the conditional to use is 'JumpIfNotBelow':
cmp rcx, rax ; cmp loop_counter, strlen(a)
jnb .after_if_1
mov r13b, r11b ; These next two lines add individual digits add r13b, r12b ; of the numbers. sub r13b, 48
It is unnecessary to first copy R11B to another register. Just process the sum in the R11B register:
add r11b, r12b ; (*) Using R11B instead of R13B
sub r11b, 48
test r10b, r10b ; Check for carry. jz .after_if_3 ; if (carry_flag): xor r10b, r10b ; carry_flag = false inc r13b ; r13b++, add the carry .after_if_3:
The carry variable in the R10B register is limited to the values of 0 and 1. If it holds 0, you don't want to do anything, and if it is 1, you want to increment the sum and reset the carry variable. Next code is much simpler and does just that:
add r11b, r10b ; (*) Using R11B instead of R13B
xor r10b, r10b
mov byte[rdx+rcx], r13b ; res[loop_counter] = r13b
Because you store the result of the addition reversed, you have added a string reversal routine to the code. This is an inefficient approach. The final result will have the same number of digits as the longest of the two inputs, or maybe just one extra digit in case of a final carry.
lea rdx, [rdx+1+r9] ; R9 is longest input
mov byte [rdx], 0 ; Zero-terminated result
Writing an optimized version
Don't do the whole calculation in a single loop. First process the positions that both inputs have in common, then process the positions that remain in the longer of the two inputs. You'll have much less conditional branches to execute.
Carefully choosing your registers can shave off a lot of bytes via not requiring a REX prefix, being able to use the short accumulator encodings, and not having to preserve callee-saved registers.
Input: RDI RSI RDX
v v v
a, b, res -> 1158791457863133@ 2678139@ ..................
Setup: RDI R8 RSI R9 RDX
v v v v v
a, b, res -> 1158791457863133@ 2678139@ .................@
Loops: LoopB LoopA
/-------\/-----\
a -> 1158791457863133@
b -> 2678139@
res -> .1158791460541272@
Output: R8 R9 RDX
v v v
a, b, res -> 1158791457863133@ 2678139@ .1158791460541272@
^
RAX
; IN (rdx,rsi,rdi) OUT (rax) MOD (rcx,rdx,rsi,rdi,r8,r9)
push rbx ; The only callee-saved register to preserve
push rdx ; (1)
push rsi ; (2)
push rdi ; (3)
call strlen ; -> RAX is strlen(a)
mov ebx, eax ; For sure less than 4GB
mov rdi, [rsp+8]
call strlen ; -> RAX is strlen(b)
pop rdi ; (3)
pop rsi ; (2)
pop rdx ; (1)
; At RDI are RBX bytes, at RSI are RAX bytes
cmp ebx, eax
jae .NoSwap
xchg rdi, rsi ; Make (RDI:RBX) refer to the longer input
xchg ebx, eax
.NoSwap:
mov ecx, eax ; RAX is length of the shorter input
lea r8, [rdi+rbx] ; R8 points at the terminating zero
lea r9, [rsi+rax] ; R9 points at the terminating zero
lea rdx, [rdx+1+rbx] ; RBX is length of the longer input
xor ebx, ebx ; BL is carry [0,1]
mov [rdx], bl ; Zero-terminating the result
jecxz .NoLoopA
.LoopA: ; Deals with the common positions
movzx eax, byte [r8-1]
add al, [r9-1]
sub al, '0'
add al, bl ; Plus carry
xor ebx, ebx ; Clear carry
cmp al, '9'
jbe .CharOK
sub al, 10
inc ebx ; Set carry
.CharOK:
mov [rdx-1], al
dec r9
dec r8
dec rdx
dec rcx
jnz .LoopA
.NoLoopA:
cmp r8, rdi
jna .NoLoopB
.LoopB: ; Deals with remainder of the longer input
movzx eax, byte [r8-1]
add al, bl ; Plus carry
xor ebx, ebx ; Clear carry
cmp al, '9'
jbe .CharOK_
sub al, 10
inc ebx ; Set carry
.CharOK_:
mov [rdx-1], al
dec r8
dec rdx
cmp r8, rdi
ja .LoopB
.NoLoopB:
test bl, bl ; Check for a final carry.
jz .Done
dec rdx
mov byte [rdx], '1'
.Done:
mov rax, rdx ; RAX is ptr to zero-terminated result
pop rbx
ret
Tip: The code can be optimized with a further loop, because once the addition in LoopB does not produce another carry, we can exit from that loop and copy the remainder of the string unmodified. If your bigints have very different lengths, this can speed up a lot...
Refinement
Is there any way to make it so you directly write to
char *res
instead of returning a pointer to a part of it without it affecting the speed?
The little detail that sits in the way for what you desire, is the possibility of having to prepend a "1" digit in case of a final carry. The probability that a final carry occurs IMO would be extremely small when working with these extremely long numbers. Therefore I suggest we rewrite the algorithm so as to ignore a potential final carry at first, but if in the end we do discover one, we just copy the result one byte up in memory and prepend the "1" digit. Sure, this will cost extra when it runs, but it will almost never have to run.
; Version that leaves RDX unchanged
; IN (rdx,rsi,rdi) OUT () MOD (rax,rcx,rsi,rdi,r8,r9)
push rbx ; The only callee-saved register to preserve
push rdx ; (1)
push rsi ; (2)
push rdi ; (3)
call strlen ; -> RAX is strlen(a)
mov ebx, eax ; For sure less than 4GB
mov rdi, [rsp+8]
call strlen ; -> RAX is strlen(b)
pop rdi ; (3)
pop rsi ; (2)
pop rdx ; (1)
; At RDI are RBX bytes, at RSI are RAX bytes
cmp ebx, eax
jae .NoSwap
xchg rdi, rsi ; Make (RDI:RBX) refer to the longer input
xchg ebx, eax
.NoSwap:
mov ecx, eax ; RAX is length of the shorter input
lea r8, [rdi+rbx] ; R8 points at the terminating zero
lea r9, [rsi+rax] ; R9 points at the terminating zero
add rdx, rbx ; RBX is length of the longer input
push rdx ; (4) "in case there's a final carry"
xor ebx, ebx ; BL is carry [0,1]
mov [rdx], bl ; Zero-terminating the result
jecxz .NoLoopA
.LoopA: ; Deals with the common positions
movzx eax, byte [r8-1]
add al, [r9-1]
sub al, '0'
add al, bl ; Plus carry
xor ebx, ebx ; Clear carry
cmp al, '9'
jbe .CharOK
sub al, 10
inc ebx ; Set carry
.CharOK:
mov [rdx-1], al
dec r9
dec r8
dec rdx
dec rcx
jnz .LoopA
.NoLoopA:
cmp r8, rdi
jna .NoLoopB
.LoopB: ; Deals with remainder of the longer input
movzx eax, byte [r8-1]
add al, bl ; Plus carry
xor ebx, ebx ; Clear carry
cmp al, '9'
jbe .CharOK_
sub al, 10
inc ebx ; Set carry
.CharOK_:
mov [rdx-1], al
dec r8
dec rdx
cmp r8, rdi
ja .LoopB
.NoLoopB:
pop rcx ; (4)
test bl, bl ; Check for a final carry.
jz .Done
.CopyUp:
mov al, [rcx]
mov [rcx+rbx], al ; RBX=1
dec rcx
cmp rcx, rdx
jae .CopyUp
mov byte [rdx], '1'
.Done:
pop rbx ; Unchanged RDX is ptr to zero-terminated result
ret
Not only does the extra .CopyUp code very seldom have to run, since it's a simple memcpy, you can replace it with any specialized version of it (or just invoke one) should you deem this useful.
-
\$\begingroup\$ Wow, thank you. I compared my code with the modifications and your optimized version with strings of lengths of 10M. Yours was faster by 0.3 seconds (2.9 and 2.6). Is there any way to make it so you directly write to
char *res
instead of returning a pointer to a part of it without it affecting the speed? \$\endgroup\$avighnac– avighnac2022年09月26日 08:12:45 +00:00Commented Sep 26, 2022 at 8:12 -
\$\begingroup\$ @avighnac I have amended my answer with a refinement that addresses the issue from your comment. \$\endgroup\$Sep Roland– Sep Roland2022年10月02日 12:28:46 +00:00Commented Oct 2, 2022 at 12:28
-
\$\begingroup\$ Yes, hello, I noticed. Sorry, I haven't been active on this website over the last week and will not be this week either. My exams are on right now, but I'll definitely check this out after they're over! Again, sorry for the late reply! \$\endgroup\$avighnac– avighnac2022年10月04日 13:38:54 +00:00Commented Oct 4, 2022 at 13:38
-
\$\begingroup\$ Hey, could you explain why your code's slightly faster? What did you do that optimized it? \$\endgroup\$avighnac– avighnac2022年10月06日 13:45:37 +00:00Commented Oct 6, 2022 at 13:45
-
\$\begingroup\$ @avighnac I take it that you are referring to the 0.3 s speed increase between your code with optimizations applied (those that I wrote about in the first part of my answer), and then the optimized version that I proposed in the middle part of my answer. The key points are (1) branch reduction, and (2) short and simple instructions. \$\endgroup\$Sep Roland– Sep Roland2022年10月06日 16:04:49 +00:00Commented Oct 6, 2022 at 16:04
Explore related questions
See similar questions with these tags.