lua-users home
lua-l archive

Tuning for large number of Lua threads on FreeBSD

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


I've been playing around a bit with benchmarking Lua coroutines.
Creating a Lua thread allocates three major datastructures: a stack, a callinfo stack, and a lua_State. The memory useds for these, on x86/double, are as follows (in work6):
stack: 45 slots @ 12 bytes/slot 540
cistack: 8 slots @ 24 bytes/slot 192
state: 192
for a total used of 924 bytes.
However, the FreeBSD system malloc() always allocates blocks whose size is a power of two; consequently, these three mallocs actually consume 1.5K.
In particular, the stack allocation is almost pessimal.
A slight adjustment in lua.h makes a notable difference; changing LUA_MINSTACK from 20 to 18, which has virtually no performance impact as far as I can see, reduces the initial stack from 45 slots (540 bytes) to 41 slots (492 bytes), halving the actual memory used. Furthermore, this continues to be beneficial as the stack grows, since it typically doubles in size on every reallocation:
default (MINSTACK = 20):
 initial stack 45 slots alloc: 540 bytes used: 1k
 first increment 90 slots alloc: 1080 bytes used: 2k
 second increment 180 slots alloc: 2160 bytes used: 4k
 third increment 360 slots alloc: 4320 bytes used: 8k
 fourth increment 720 slots alloc. 8640 bytes used: 12k*
* useds of more than one page -- 4k on x86 -- are rounded
 to an integer number of pages
adjusted (MINSTACK = 18):
 initial stack 41 slots alloc: 492 bytes used: 512 bytes
 first increment 82 slots alloc: 984 bytes used: 1k
 second increment 164 slots alloc: 1968 bytes used: 2k
 third increment 328 slots alloc: 3936 bytes used: 4k
 fourth increment 656 slots alloc: 7872 bytes used: 8k
An alternative is to leave LUA_MINSTACK as 20, but change BASIC_STACK_SIZE in src/lstate.h, which is (LUA_MINSTACK*2). This could be changed to 36 (or even 37); however, that would be a somewhat more fragile change. Also in src/lstate.h, BASIC_CI_SIZE is defined as 8. On FreeBSD, changing this to 5 halves the initial allocation for the cistack, but for reasons which are not clear to me some of my benchmarks slow down by up to 7% with this change. (Although this is partially compensated for by a 20% improvement in the time to create the threads.) My first guess was that the change was leading to repeated reallocations of the ci-stack in the thread scheduler, but it turned out that the code in lgc.c which might shrink the ci-stack was never being called. In any event, this minor exercise in tuning reduced the RSS for 100,000 threads from 150MB to 100MB (with only the change to MINSTACK) or 88 MB (with both changes). (This also includes an allocation for a table containing all the threads, and for a closure for each thread created by coroutine.wrap). The 50% saving resulting from changing MINSTACK from 20 to 18 strikes me as worthwhile (particularly as it also ran slightly faster on all benchmarks.) The Linux malloc() is quite different from the FreeBSD malloc(). Windows and Mac OS X will also have different tuning optimizations. I haven't yet had a chance to play with OS's other than FreeBSD, but I suspect that the difference will be less marked.

AltStyle によって変換されたページ (->オリジナル) /