lua-users home
lua-l archive

Re: Suggestions on implementing an efficient instruction set simulator in LuaJIT2

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


On 15 February 2011 01:33, Mike Pall <mikelu-1102@mike.de> wrote:
> Each function ends with a tail call to the next target(s). This
> calls an existing function or triggers a new compilation. LuaJIT
> recognizes tail recursion and turns it into loops, so this ought
> to perform well.
Thank you for the suggestion. For some reason I didn't consider using
tail calls for control flow, but this makes it very straight-forward.
An initial prototype implementation with a number of obvious further
opportunities for speed improvement can already beat the C instruction
set simulator on some benchmarks.
I've found that one part of the structure of my program completely
throws off the compiler. For memory access, I have functions like the
following:
local function check_mem_range(addr)
 if (addr >= cpu.memsize) then
 warn("word read/write from 0x%x outside memory range\n", addr)
 cpu.exception = C.SIGSEGV
 return false
 end
 return true
end
local function check_mem_alignment(addr, mask)
 if (bit.band(addr, mask) ~= 0) then
 warn("word read/write from unaligned address: 0x%x\n", addr)
 cpu.exception = C.SIGBUS
 return false
 end
 return true
end
function rlat(addr)
 if not check_mem_range(addr) then return 0 end
 if not check_mem_alignment(addr, 3) then return 0 end
 return ffi.cast(int32p_t, cpu.memory+addr)[0]
end
-- and similar for wlat, rhat, rbat etc
Obviously now I'm doing run-time generation of Lua code it would make
a lot more sense to inline all this into the generated opcode body.
When writing this code initially, I had hoped that LuaJIT would just
inline check_mem_alignment and check_mem_range. But with the standard
heuristics the presence of these calls means it fails to generate any
acceptable traces, as far as I can see due to 'too many snapshots'.
To give an idea of what a massive difference this makes, consider the
following numbers from my test program (the simulated code is a naive
fibonacci implementation). Using my prototype codegen and including
the check_mem calls it takes ~2 minutes to run. Just adding
-Omaxsnap=200 (up from the default 100) results in a drastically
reduced 1.1-1.4 second runtime. Commenting out the calls results in ~1
second runtime with default parameters.
I'm very pleased with the performance numbers I've had so far (though
recognise fib is something of a best case), and I've hardly begun to
fix the cases where unnecessary work is being done. I'm only posting
at this point as I thought it was interesting just how much difference
the maxsnap parameter can make in this case.
Alex

AltStyle によって変換されたページ (->オリジナル) /