lua-users home
lua-l archive

Re: Possible Bug in bitlib under Windows?

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


David Given wrote:
> Mike Pall wrote:
> > bor/band/bxor 129.5 ns 67.0 ns 13.7 ns 0.0 ns
> 
> That's impressive performance for LuaJIT 2; does that 0.0ns also include
> the overhead needed to call out to a Lua extension function in C, or is
> there some streamlined mechanism to emit the appropriate instructions
> inline with the generated code? (Or, more prosaically, does your
> benchmark not count the C overhead?)
There is no C overhead. The LuaJIT 2.x interpreter just dispatches
to an internal "fast function", written in assembler. That and the
argument setup cause the 13.7ns (mainly the dispatch overhead for 3
bytecodes).
The trace compiler basically ignores all control flow, including
function calls. It uses a natural-loop-first (NLF) region-selection
algorithm and then extends traces from the exits of the loops it
has formed. Loops are also pre-rolled to enable hoisting of
loop-invariant code (e.g. the check for the specialization to the
called function).
It's able to compile this:
 local x=0; for i=1,1e9 do x=x+bit.bor(i,1) end
into this machine code (only inner loop shown):
 [...]
->loop:
 mov edi, esi
 or edi, +0x01
 cvtsi2sd xmm6, edi
 addsd xmm7, xmm6
 add esi, +0x01
 cmp esi, 0x3b9aca00
 jle ->loop
 jmp ->EXIT_3
Note that the reduction variable needs to be a double in this case
(the sum is larger than an int32 can hold). So the bottleneck is the
addsd dependency chain with a latency of 3 cycles per instruction.
This is where the basic loop overhead comes from (1 ns = 3 cycles at
3 GHz). The remaining opcodes have plenty of execution bandwidth left
and thus do not contribute to the final result.
Ok, so this is not a useful microbenchmark for measuring the
overhead of individual machine code instructions (*). But the
intention was to show a (coarse) relative comparison of the cost of
bit operations across Lua implementations.
(*) Like most other integer instructions "or reg, imm" has 1 uop and
 1 cycle latency on a Core 2.
--Mike

AltStyle によって変換されたページ (->オリジナル) /