I have written 2 Brainfuck interpreters, one in C++, then another in C++ with Assembly inner loop. My interpreter code can be found here (C++/ASM, requires MSVC and MASM) and here (C++).
Both are benchmarked using the mandelbrot BF program, run 20 times. After trying for a while, I managed to make the Asm version about 10% faster than the C++ version, but I'm wondering if it can be made even faster. The main bottlenecks appears to be branch mispredictions and uOps cache. The benchmark ran on SkylakeX CPU.
The inner loop looks like this:
interpret proc C
.code
push rbx ; these are callee saved ?
push rsi
push rdi
push rbp
push r12
push r13
push r14
push r15
sub rsp, 32 ; shadow space?
xor rsi, rsi ; "program counter", which instruction we are at
mov rdi, r8 ; "memory pointer", where we are pointing to in the interpreter memory
xor r14, r14 ; store the value of current mem cell, which of course starts at 0
lea r15, [jumptable] ; load the address of the table
mov r13, rcx ; base address of instruction array
lbl_interp_loop: ; beginning of new interpreter cycle
movzx r10, word ptr [r13 + rsi*4]
movzx r11, word ptr [r13 + rsi*4 + 2]
inc rsi ; advance to the next instruction
jmp qword ptr [r15 + r10 * 8]
ALIGN 4
lbl_Loop:
cmp byte ptr [rdi], 0
jne lbl_set_loop_ip
add rsi, r11
lbl_set_loop_ip:
movzx r10, word ptr [r13 + rsi*4]
movzx r11, word ptr [r13 + rsi*4 + 2]
inc rsi ; advance to the next instruction
jmp qword ptr [r15 + r10 * 8]
ALIGN 4
lbl_Return:
cmp byte ptr [rdi], 0
jne lbl_set_return_ip
movzx r10, word ptr [r13 + rsi*4]
movzx r11, word ptr [r13 + rsi*4 + 2]
inc rsi ; advance to the next instruction
jmp qword ptr [r15 + r10 * 8]
lbl_set_return_ip:
sub rsi, r11
movzx r10, word ptr [r13 + rsi*4]
movzx r11, word ptr [r13 + rsi*4 + 2]
inc rsi ; advance to the next instruction
jmp qword ptr [r15 + r10 * 8]
ALIGN 4
lbl_Right:
add rdi, r11
movzx r10, word ptr [r13 + rsi*4]
movzx r11, word ptr [r13 + rsi*4 + 2]
inc rsi ; advance to the next instruction
jmp qword ptr [r15 + r10 * 8]
ALIGN 4
lbl_Left:
sub rdi, r11
movzx r10, word ptr [r13 + rsi*4]
movzx r11, word ptr [r13 + rsi*4 + 2]
inc rsi ; advance to the next instruction
jmp qword ptr [r15 + r10 * 8]
ALIGN 4
lbl_Add:
add byte ptr [rdi], r11b
movzx r10, word ptr [r13 + rsi*4]
movzx r11, word ptr [r13 + rsi*4 + 2]
inc rsi ; advance to the next instruction
jmp qword ptr [r15 + r10 * 8]
ALIGN 4
lbl_Minus:
sub byte ptr [rdi], r11b
movzx r10, word ptr [r13 + rsi*4]
movzx r11, word ptr [r13 + rsi*4 + 2]
inc rsi ; advance to the next instruction
jmp qword ptr [r15 + r10 * 8]
lbl_Print:
movzx rcx, byte ptr [rdi]
call printChar
jmp lbl_interp_loop
lbl_Read:
jmp lbl_interp_loop
lbl_Invalid:
jmp lbl_interp_loop
jumptable: ; MASM cannot emit relative address in .data, so must put it in .code this way (agner)
dq lbl_Loop
dq lbl_Return
dq lbl_Right
dq lbl_Left
dq lbl_Add
dq lbl_Minus
dq lbl_Print
dq lbl_Read
dq lbl_Invalid
dq lbl_End
lbl_End:
add rsp, 32
pop r15
pop r14
pop r13
pop r12
pop rbp
pop rdi
pop rsi
pop rbx
ret
interpret endp
end
Here is the report from V-Tune:
- Before this inner loop begins, the interpreter does a couple of simple transformation on the BF source, turning it into instructions that consist of a 2 byte opcode and 2 byte operand. Repeated instructions are collapsed into 1 instructions and the number of repeats is stored in the operand. Loop/Repeat instructions contain the index of the destination instruction it jumps to.
- Because of (1), there is no advantage of storing the value of current cell in register. In fact, when I tried to do that, performance got slightly worse. That is because when the interpreter encounters Left/Right instructions, it must save the current cell to memory and load the next cell, probably causing a data dependency on the register.
- The reason why code for Loop and Return instructions are different, is because after changing code for Loop (
[
in BF) to have rare case right aftercmp
, I found performance increased considerably, but back fired for Return (]
in BF) instruction. That seems counter intuitive to put rare case aftercmp
, but it seems to works here. V-Tune indicates a reduction in branch misprediction rate too. Another strange example is right after label
lbl_set_return_ip:
, where I originally wrote:mov rax, r11 sub rsi, rax
but then found out that rewriting it as below noticably improved performance:
sub rsi, r11
Which is strange because I expected the
mov
to be pretty much free through register renaming.The same data dependency problem leads to the fact that I relied on array indexing instead of increasing the pointer. This code
movzx r10, word ptr [r13 + rsi*4] movzx r11, word ptr [r13 + rsi*4 + 2] inc rsi ; advance to the next instruction
is found to be faster than
add rsi, 4 ; rsi contains the address of current instruction movzx r10, word ptr [rsi] movzx r11, word ptr [rsi + 2]
1 Answer 1
I'd expect that most of the performance problem comes from the indirect branch (jmp qword ptr [r15 + r10 * 8]
) - it's relatively unpredictable and should cause lots of branch mispredictions, bad speculations and pipeline stalls.
There's only really 2 ways to deal with this:
1) Amortise the cost. For example, if you packed 4 instructions into r10
and had a significantly larger jump table you could reduce the number of indirect branches the CPU has to do to 25% (and therefore reduce the branch mispredictions, etc). Note that BF only really has 8 instructions so you only need 3-bit opcodes, so packing 4 together could cost 12 bits and give you a 4096-entry jump table.
2) Use JIT techniques. This would probably be an order of magnitude faster, but would involve a significant "redesign and rewrite". [ADDED] For one possibility, instead of pre-compiling the BF into a "16-bit opcode + 16-bit operand" form, you could precompile BF instructions into a series of call ...
80x86 instructions and then execute the resulting series of calls. This wouldn't be "full JIT" (wouldn't be generating native code for the actual work) but it would be using "JIT techniques" (to generate code to optimise the whole "fetch and decode" part of the interpreter).
-
\$\begingroup\$ WRT point 1, it sounds like some code generation would be required? I don't quite understand how to armotize it that way. OTOH I'm afraid that a 4096 entry jump table is quite large, and the generated code to accompany it would be way too large, missing L1i cache every 4 BF instructions. Finally, with 3 bits, I can encode 8 opcodes, but if I would like to have additional "meta" opcodes that encode multiple BF instructions, there will be no space left. WRT to point 2, I will consider JIT at a later time, but this experiment is about pushing the basic interpreter to its limit. \$\endgroup\$Ethan– Ethan2018年08月27日 09:39:13 +00:00Commented Aug 27, 2018 at 9:39
-
1\$\begingroup\$ @Ethan: 1) is a much more aggressive version of my fusion suggestion from this comment. And yeah, I'd be worried about uop-cache misses and BTB misses (more different branch targets = harder to predict). L1i misses not so much. But a given BF program wouldn't always use all of those 4096 entries in a hot loop, so not all of them would have to stay hot in the uop cache. And yeah you'd probably want to macro it. \$\endgroup\$Peter Cordes– Peter Cordes2018年08月27日 11:40:33 +00:00Commented Aug 27, 2018 at 11:40
-
\$\begingroup\$ @Ethan: You would need to do some code generation; but you're already doing some code generation (e.g. converting to "2 byte opcode + 2 byte operand"). If you want you could think of it as a single 12-bit meta opcode (or a pair of 6-bit meta opcodes, or four separate 3-bit opcodes) - there's no practical difference. \$\endgroup\$Brendan– Brendan2018年08月27日 12:55:38 +00:00Commented Aug 27, 2018 at 12:55
-
2\$\begingroup\$ Yeah I'd draw the line in the sand in the same way as @PeterCordes. There's definitely a spectrum of techniques from totally vanilla interpretation to full-on JIT, but to me it crosses the JIT line when it executes code that was written at runtime by the process. I.e., exactly when it needs to use
mmap
or whatever to change the protections to W+X as mentioned. For example, executing a generated sequences ofcall
instructions, one per instruction is JIT, but executing an existing sequence of unrolledcall
instructions with memory-source targets that you generated is not... \$\endgroup\$BeeOnRope– BeeOnRope2018年09月06日 23:49:58 +00:00Commented Sep 6, 2018 at 23:49 -
2\$\begingroup\$ ... even though the execution pattern and and performance characteristics are likely to be very similar! \$\endgroup\$BeeOnRope– BeeOnRope2018年09月06日 23:50:31 +00:00Commented Sep 6, 2018 at 23:50
Explore related questions
See similar questions with these tags.
movzx r10d, word ptr [rsi+4]
/movzx r11d, word ptr [rsi+4 + 2]
/add rsi, 4
puts theadd
after the loads, instead of before, so it's not adding to the load/use latency. (And BTW, compiling to 2-byte opcode / operand bytecode sounds reasonable. Loading withmovzx word ptr
is good. Writing to a 64-bit dest register is redundant vs. letting implicit zero extension work, but it shouldn't matter. Code-size probably doesn't really matter here, so avoiding REX prefixes normally won't help. \$\endgroup\$_TEXT2 SEGMENT ALIGN(64) 'CODE'
instead of just.code
\$\endgroup\$add/minus
are by their nature single instruction, just adjust the byte operand for minus. The loop/return are then almost identical, just the test condition is different, hmm... \$\endgroup\$movsx rcx, [rsi]
/movzx eax, cl
/sar rcx,8
has no extra latency for the opcode, instead of 2c extra latency. And a shorter dep chain for the operand, too. This technique would work with the current 16:16, except thatmovzx r10d, r11w
isn't zero-latency (only byte->dword movzx is zero latency.) \$\endgroup\$global
on your code labels and declare them asextern "C" extern const char my_label[]
in C++ and do pointer subtraction, or create aglobal offsets
/offsets: dw 0, label1-base, label2-base, ...
array that the C++ can read asextern "C" const short offsets[]
, indexed by whatever you're using now for opcodes. \$\endgroup\$