Main loop in assembly for a Brainfuck interpreter

Question 1

I have written 2 Brainfuck interpreters, one in C++, then another in C++ with Assembly inner loop. My interpreter code can be found here (C++/ASM, requires MSVC and MASM) and here (C++).

Both are benchmarked using the mandelbrot BF program, run 20 times. After trying for a while, I managed to make the Asm version about 10% faster than the C++ version, but I'm wondering if it can be made even faster. The main bottlenecks appears to be branch mispredictions and uOps cache. The benchmark ran on SkylakeX CPU.

The inner loop looks like this:

interpret proc C
.code
 push rbx ; these are callee saved ?
 push rsi
 push rdi
 push rbp
 push r12
 push r13
 push r14
 push r15
 sub rsp, 32 ; shadow space?
 xor rsi, rsi ; "program counter", which instruction we are at
 mov rdi, r8 ; "memory pointer", where we are pointing to in the interpreter memory
 xor r14, r14 ; store the value of current mem cell, which of course starts at 0
 lea r15, [jumptable] ; load the address of the table
 mov r13, rcx ; base address of instruction array
lbl_interp_loop: ; beginning of new interpreter cycle
 movzx r10, word ptr [r13 + rsi*4]
 movzx r11, word ptr [r13 + rsi*4 + 2]
 inc rsi ; advance to the next instruction
 jmp qword ptr [r15 + r10 * 8]
ALIGN 4
lbl_Loop:
 cmp byte ptr [rdi], 0
 jne lbl_set_loop_ip
 add rsi, r11
lbl_set_loop_ip:
 movzx r10, word ptr [r13 + rsi*4]
 movzx r11, word ptr [r13 + rsi*4 + 2]
 inc rsi ; advance to the next instruction
 jmp qword ptr [r15 + r10 * 8]
ALIGN 4
lbl_Return:
 cmp byte ptr [rdi], 0
 jne lbl_set_return_ip
 movzx r10, word ptr [r13 + rsi*4]
 movzx r11, word ptr [r13 + rsi*4 + 2]
 inc rsi ; advance to the next instruction
 jmp qword ptr [r15 + r10 * 8]
lbl_set_return_ip:
 sub rsi, r11
 movzx r10, word ptr [r13 + rsi*4]
 movzx r11, word ptr [r13 + rsi*4 + 2]
 inc rsi ; advance to the next instruction
 jmp qword ptr [r15 + r10 * 8]
ALIGN 4
lbl_Right:
 add rdi, r11
 movzx r10, word ptr [r13 + rsi*4]
 movzx r11, word ptr [r13 + rsi*4 + 2]
 inc rsi ; advance to the next instruction
 jmp qword ptr [r15 + r10 * 8]
ALIGN 4
lbl_Left:
 sub rdi, r11
 movzx r10, word ptr [r13 + rsi*4]
 movzx r11, word ptr [r13 + rsi*4 + 2]
 inc rsi ; advance to the next instruction
 jmp qword ptr [r15 + r10 * 8]
ALIGN 4
lbl_Add:
 add byte ptr [rdi], r11b
 movzx r10, word ptr [r13 + rsi*4]
 movzx r11, word ptr [r13 + rsi*4 + 2]
 inc rsi ; advance to the next instruction
 jmp qword ptr [r15 + r10 * 8]
ALIGN 4
lbl_Minus:
 sub byte ptr [rdi], r11b
 movzx r10, word ptr [r13 + rsi*4]
 movzx r11, word ptr [r13 + rsi*4 + 2]
 inc rsi ; advance to the next instruction
 jmp qword ptr [r15 + r10 * 8]
lbl_Print:
 movzx rcx, byte ptr [rdi]
 call printChar
 jmp lbl_interp_loop
lbl_Read:
 jmp lbl_interp_loop
lbl_Invalid:
 jmp lbl_interp_loop
jumptable: ; MASM cannot emit relative address in .data, so must put it in .code this way (agner)
 dq lbl_Loop
 dq lbl_Return
 dq lbl_Right
 dq lbl_Left
 dq lbl_Add
 dq lbl_Minus
 dq lbl_Print
 dq lbl_Read
 dq lbl_Invalid
 dq lbl_End
lbl_End:
 add rsp, 32
 pop r15
 pop r14
 pop r13
 pop r12
 pop rbp
 pop rdi
 pop rsi
 pop rbx
 ret
interpret endp
end

Here is the report from V-Tune:

V-Tune perf

Before this inner loop begins, the interpreter does a couple of simple transformation on the BF source, turning it into instructions that consist of a 2 byte opcode and 2 byte operand. Repeated instructions are collapsed into 1 instructions and the number of repeats is stored in the operand. Loop/Repeat instructions contain the index of the destination instruction it jumps to.
Because of (1), there is no advantage of storing the value of current cell in register. In fact, when I tried to do that, performance got slightly worse. That is because when the interpreter encounters Left/Right instructions, it must save the current cell to memory and load the next cell, probably causing a data dependency on the register.
The reason why code for Loop and Return instructions are different, is because after changing code for Loop ([ in BF) to have rare case right after cmp, I found performance increased considerably, but back fired for Return (] in BF) instruction. That seems counter intuitive to put rare case after cmp, but it seems to works here. V-Tune indicates a reduction in branch misprediction rate too.
Another strange example is right after label lbl_set_return_ip:, where I originally wrote:
```
mov rax, r11
sub rsi, rax
```
but then found out that rewriting it as below noticably improved performance:
```
sub rsi, r11
```
Which is strange because I expected the mov to be pretty much free through register renaming.

The same data dependency problem leads to the fact that I relied on array indexing instead of increasing the pointer. This code

movzx r10, word ptr [r13 + rsi*4]
movzx r11, word ptr [r13 + rsi*4 + 2]
inc rsi ; advance to the next instruction

is found to be faster than

add rsi, 4 ; rsi contains the address of current instruction
movzx r10, word ptr [rsi]
movzx r11, word ptr [rsi + 2]

Question 2

Your pointer-increment sequence has the increment before the loads? movzx r10d, word ptr [rsi+4] / movzx r11d, word ptr [rsi+4 + 2] / add rsi, 4 puts the add after the loads, instead of before, so it's not adding to the load/use latency. (And BTW, compiling to 2-byte opcode / operand bytecode sounds reasonable. Loading with movzx word ptr is good. Writing to a 64-bit dest register is redundant vs. letting implicit zero extension work, but it shouldn't matter. Code-size probably doesn't really matter here, so avoiding REX prefixes normally won't help.

Question 3

Masm does support larger alignment sizes, but you can't use a larger alignment size inside a function than you declare for the segment that contains it. Try something like _TEXT2 SEGMENT ALIGN(64) 'CODE' instead of just .code

Question 4

Can't you make the second argument only 16b signed value, so you can then collapse the right/left into single instruction? Also add/minus are by their nature single instruction, just adjust the byte operand for minus. The loop/return are then almost identical, just the test condition is different, hmm...

Question 5

@Ped7g: Not sure that's a good idea, but implementation-wise, movsx rcx, [rsi] / movzx eax, cl / sar rcx,8 has no extra latency for the opcode, instead of 2c extra latency. And a shorter dep chain for the operand, too. This technique would work with the current 16:16, except that movzx r10d, r11w isn't zero-latency (only byte->dword movzx is zero latency.)

Question 6

@Ethan: data-cache pressure is probably negligible, just use 16:16 if you need large operands for jump displacements. Variable-length encoding would probably hurt. To expose the offsets to C++, either use global on your code labels and declare them as extern "C" extern const char my_label[] in C++ and do pointer subtraction, or create a global offsets / offsets: dw 0, label1-base, label2-base, ... array that the C++ can read as extern "C" const short offsets[], indexed by whatever you're using now for opcodes.

Question 7

I'd expect that most of the performance problem comes from the indirect branch (jmp qword ptr [r15 + r10 * 8]) - it's relatively unpredictable and should cause lots of branch mispredictions, bad speculations and pipeline stalls.

There's only really 2 ways to deal with this:

1) Amortise the cost. For example, if you packed 4 instructions into r10 and had a significantly larger jump table you could reduce the number of indirect branches the CPU has to do to 25% (and therefore reduce the branch mispredictions, etc). Note that BF only really has 8 instructions so you only need 3-bit opcodes, so packing 4 together could cost 12 bits and give you a 4096-entry jump table.

2) Use JIT techniques. This would probably be an order of magnitude faster, but would involve a significant "redesign and rewrite". [ADDED] For one possibility, instead of pre-compiling the BF into a "16-bit opcode + 16-bit operand" form, you could precompile BF instructions into a series of call ... 80x86 instructions and then execute the resulting series of calls. This wouldn't be "full JIT" (wouldn't be generating native code for the actual work) but it would be using "JIT techniques" (to generate code to optimise the whole "fetch and decode" part of the interpreter).

Question 8

WRT point 1, it sounds like some code generation would be required? I don't quite understand how to armotize it that way. OTOH I'm afraid that a 4096 entry jump table is quite large, and the generated code to accompany it would be way too large, missing L1i cache every 4 BF instructions. Finally, with 3 bits, I can encode 8 opcodes, but if I would like to have additional "meta" opcodes that encode multiple BF instructions, there will be no space left. WRT to point 2, I will consider JIT at a later time, but this experiment is about pushing the basic interpreter to its limit.

Question 9

@Ethan: 1) is a much more aggressive version of my fusion suggestion from this comment. And yeah, I'd be worried about uop-cache misses and BTB misses (more different branch targets = harder to predict). L1i misses not so much. But a given BF program wouldn't always use all of those 4096 entries in a hot loop, so not all of them would have to stay hot in the uop cache. And yeah you'd probably want to macro it.

Question 10

@Ethan: You would need to do some code generation; but you're already doing some code generation (e.g. converting to "2 byte opcode + 2 byte operand"). If you want you could think of it as a single 12-bit meta opcode (or a pair of 6-bit meta opcodes, or four separate 3-bit opcodes) - there's no practical difference.

Question 11

Yeah I'd draw the line in the sand in the same way as @PeterCordes. There's definitely a spectrum of techniques from totally vanilla interpretation to full-on JIT, but to me it crosses the JIT line when it executes code that was written at runtime by the process. I.e., exactly when it needs to use mmap or whatever to change the protections to W+X as mentioned. For example, executing a generated sequences of call instructions, one per instruction is JIT, but executing an existing sequence of unrolled call instructions with memory-source targets that you generated is not...

Question 12

... even though the execution pattern and and performance characteristics are likely to be very similar!

Brendan Brendan 6665 silver badges5 bronze badges · Answer 1 · 2018-08-27 09:03:49Z

3

\$\begingroup\$

I'd expect that most of the performance problem comes from the indirect branch (jmp qword ptr [r15 + r10 * 8]) - it's relatively unpredictable and should cause lots of branch mispredictions, bad speculations and pipeline stalls.

There's only really 2 ways to deal with this:

1) Amortise the cost. For example, if you packed 4 instructions into r10 and had a significantly larger jump table you could reduce the number of indirect branches the CPU has to do to 25% (and therefore reduce the branch mispredictions, etc). Note that BF only really has 8 instructions so you only need 3-bit opcodes, so packing 4 together could cost 12 bits and give you a 4096-entry jump table.

2) Use JIT techniques. This would probably be an order of magnitude faster, but would involve a significant "redesign and rewrite". [ADDED] For one possibility, instead of pre-compiling the BF into a "16-bit opcode + 16-bit operand" form, you could precompile BF instructions into a series of call ... 80x86 instructions and then execute the resulting series of calls. This wouldn't be "full JIT" (wouldn't be generating native code for the actual work) but it would be using "JIT techniques" (to generate code to optimise the whole "fetch and decode" part of the interpreter).

Share

answered Aug 27, 2018 at 9:03

Brendan's user avatar

Brendan Brendan

6665 silver badges5 bronze badges

\$\endgroup\$

7

\$\begingroup\$ WRT point 1, it sounds like some code generation would be required? I don't quite understand how to armotize it that way. OTOH I'm afraid that a 4096 entry jump table is quite large, and the generated code to accompany it would be way too large, missing L1i cache every 4 BF instructions. Finally, with 3 bits, I can encode 8 opcodes, but if I would like to have additional "meta" opcodes that encode multiple BF instructions, there will be no space left. WRT to point 2, I will consider JIT at a later time, but this experiment is about pushing the basic interpreter to its limit. \$\endgroup\$

Ethan
– Ethan

2018年08月27日 09:39:13 +00:00
Commented Aug 27, 2018 at 9:39
1

\$\begingroup\$ @Ethan: 1) is a much more aggressive version of my fusion suggestion from this comment. And yeah, I'd be worried about uop-cache misses and BTB misses (more different branch targets = harder to predict). L1i misses not so much. But a given BF program wouldn't always use all of those 4096 entries in a hot loop, so not all of them would have to stay hot in the uop cache. And yeah you'd probably want to macro it. \$\endgroup\$

Peter Cordes
– Peter Cordes

2018年08月27日 11:40:33 +00:00
Commented Aug 27, 2018 at 11:40
\$\begingroup\$ @Ethan: You would need to do some code generation; but you're already doing some code generation (e.g. converting to "2 byte opcode + 2 byte operand"). If you want you could think of it as a single 12-bit meta opcode (or a pair of 6-bit meta opcodes, or four separate 3-bit opcodes) - there's no practical difference. \$\endgroup\$

Brendan
– Brendan

2018年08月27日 12:55:38 +00:00
Commented Aug 27, 2018 at 12:55
2

\$\begingroup\$ Yeah I'd draw the line in the sand in the same way as @PeterCordes. There's definitely a spectrum of techniques from totally vanilla interpretation to full-on JIT, but to me it crosses the JIT line when it executes code that was written at runtime by the process. I.e., exactly when it needs to use mmap or whatever to change the protections to W+X as mentioned. For example, executing a generated sequences of call instructions, one per instruction is JIT, but executing an existing sequence of unrolled call instructions with memory-source targets that you generated is not... \$\endgroup\$

BeeOnRope
– BeeOnRope

2018年09月06日 23:49:58 +00:00
Commented Sep 6, 2018 at 23:49
2

\$\begingroup\$ ... even though the execution pattern and and performance characteristics are likely to be very similar! \$\endgroup\$

BeeOnRope
– BeeOnRope

2018年09月06日 23:50:31 +00:00
Commented Sep 6, 2018 at 23:50

| Show 2 more comments

Stack Exchange Network

Main loop in assembly for a Brainfuck interpreter

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Main loop in assembly for a Brainfuck interpreter

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions