Re: Possible Bug in bitlib under Windows?

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: Possible Bug in bitlib under Windows?
From: Mike Pall <mikelu-0812@...>
Date: 2008年12月13日 20:08:28 +0100

David Given wrote:
> Mike Pall wrote:
> > bor/band/bxor 129.5 ns 67.0 ns 13.7 ns 0.0 ns
> 
> That's impressive performance for LuaJIT 2; does that 0.0ns also include
> the overhead needed to call out to a Lua extension function in C, or is
> there some streamlined mechanism to emit the appropriate instructions
> inline with the generated code? (Or, more prosaically, does your
> benchmark not count the C overhead?)
There is no C overhead. The LuaJIT 2.x interpreter just dispatches
to an internal "fast function", written in assembler. That and the
argument setup cause the 13.7ns (mainly the dispatch overhead for 3
bytecodes).
The trace compiler basically ignores all control flow, including
function calls. It uses a natural-loop-first (NLF) region-selection
algorithm and then extends traces from the exits of the loops it
has formed. Loops are also pre-rolled to enable hoisting of
loop-invariant code (e.g. the check for the specialization to the
called function).
It's able to compile this:
 local x=0; for i=1,1e9 do x=x+bit.bor(i,1) end
into this machine code (only inner loop shown):
 [...]
->loop:
 mov edi, esi
 or edi, +0x01
 cvtsi2sd xmm6, edi
 addsd xmm7, xmm6
 add esi, +0x01
 cmp esi, 0x3b9aca00
 jle ->loop
 jmp ->EXIT_3
Note that the reduction variable needs to be a double in this case
(the sum is larger than an int32 can hold). So the bottleneck is the
addsd dependency chain with a latency of 3 cycles per instruction.
This is where the basic loop overhead comes from (1 ns = 3 cycles at
3 GHz). The remaining opcodes have plenty of execution bandwidth left
and thus do not contribute to the final result.
Ok, so this is not a useful microbenchmark for measuring the
overhead of individual machine code instructions (*). But the
intention was to show a (coarse) relative comparison of the cost of
bit operations across Lua implementations.
(*) Like most other integer instructions "or reg, imm" has 1 uop and
 1 cycle latency on a Core 2.
--Mike

References:
- Re: Possible Bug in bitlib under Windows?, duck
- Re: Possible Bug in bitlib under Windows?, Andrew Gorges
- Re: Possible Bug in bitlib under Windows?, KHMan
- RE: Possible Bug in bitlib under Windows?, Jeff Wise
- Re: Possible Bug in bitlib under Windows?, KHMan
- Re: Possible Bug in bitlib under Windows?, David Manura
- Re: Possible Bug in bitlib under Windows?, Mike Pall
- Re: Possible Bug in bitlib under Windows?, RJP Computing
- Re: Possible Bug in bitlib under Windows?, Mike Pall
- Re: Possible Bug in bitlib under Windows?, David Given

Prev by Date: Re: Possible Bug in bitlib under Windows?
Next by Date: Re: Possible Bug in bitlib under Windows?
Previous by thread: Re: Possible Bug in bitlib under Windows?
Next by thread: Re: Possible Bug in bitlib under Windows?
Index(es):
- Date
- Thread