I've attached a patch, that can allow using some kind of direct threaded technique to lua 5.1 (on lvm.c) for anything but i386 with gcc compiler (selection is done in luaconf.h and can be improved I guess, notably I made assumption for powerpc... but at least it shall preserve portability). On i386 it keeps switch/case. A quick test let me think it allows to gain up to 5%-10% on sparcs using some benches of http://shootout.alioth.debian.org/, but performs worst on i386 (because of replicated code in BREAK I guess, and less registers). So if anybody wants to play with it and see how it performs on their system... Regarding LuaJIT, and referring to article I mentioned (more details here: http://www.cs.toronto.edu/~bv/tcl2005/tcl2005-slides.pdf). Does LuaJIT use similar techniques to reduce misprediction and/or inline code in branches? -----Original Message----- From: lua-bounces@bazar2.conectiva.com.br [mailto:lua-bounces@bazar2.conectiva.com.br] On Behalf Of Mike Pall Sent: Monday, May 29, 2006 6:33 PM To: Lua list Subject: Re: Implementation of Lua and direct/context threaded code Hi, Grellier, Thierry wrote: > I was reading the article: The Implementation of Lua 5.0 and went > through the usage of switch/case instruction dispatch preferred to > direct threaded code techniques (bound to gcc usage) for portability > reasons. I thought that conditional compilation was also key to > portability more than language... I also guess that a lot of us are > building our lua interpreter with gcc. > > It is hard to fully understand how much it improves a real application > in the end, so I was wondering if anyone has experimented with using > these techniques instead of default lua implementation. I wished I could > have had time to do so, but... http://lua-users.org/lists/lua-l/2004-09/msg00610.html Summary: not worth it -- at least not on x86. There's a reason: Lua uses a one-opcode + three-operand bytecode and operates on a virtual (caller/callee-overlapping) register file. This means the machine code implementing each opcode is much "fatter" (compared to a stack VM) and some of the operand decoding can be moved before the opcode dispatch. This offers more opportunities for out-of-order scheduling and filling the pipeline bubbles caused by the branch mispredictions (note that the direct threaded code technique does not remove all branch mispredictions either). Shameless plug: if you want faster execution (at the expense of portability) then try LuaJIT: http://luajit.luaforge.net/ Bye, Mike
Attachment:
directthreaded.patch
Description: directthreaded.patch