user-space threads

Thu Oct 24 13:34:00 GMT 2002

> From: Adam Megacz [mailto:gcj@lists.megacz.com]
> "Boehm, Hans" <hans_boehm@hp.com> writes:
> > I understand it is claimed to solve the 10K thread problem.)
>> How do they get around the L1/TLB cache thrash problem of having the
> "working set" be 10,000 seperate memory pages, each with its own page
> table entry (since the page above the stack must be write-protected)?
Is that really a serious issue? I think the cache isn't much of an issue, since a cache line will be smaller than the active part of each thread stack anyway. You won't fit 10K thread stacks into the L1 cache, no matter what you do. But you can probably get a small section of each into the L2 or L3 cache. (There might be associativity issues, if they are all doing the same thing. Reserving a random amount of space at the beginning of each stack will fix that, though I doubt it matters.)
I think TLB miss handling times are generally considerably less than context-switch times. I don't think that taking a TLB miss per context-switch is a huge deal, though you've probably seen more performance numbers than I have. I'm also not sure it's avoidable.
>> I propose to solve this with stack-sharing.
That would help the TLB issue. But is it worth it?
>> > I would be opposed to making a user-level threads package the
> > default on any system that had another, more standard threads
> > implementation,
>> Agreed. I'm definately not suggesting this as the default. However,
> it could make multithreaded servers as fast as nonblocking-io,
> singlethread servers. The big advantage is that the multithreaded
> paridigm is far easier to write code for.
I agree that supporting lots of threads cheaply is good
>>> > I'm not sure how far you can really reduce maximum thread stack
> > sizes,
>> I don't propose to decrease the total amount of stack used, but I do
> propose to coagulate it all onto a single, contiguous set of pages in
> order to maximize L1 and TLB cache effectiveness. This is widely
> believed to be the biggest reason why high-performance thread
> implementations are still slower than select()-based single-threaded
> servers.
I would have guessed there are other significant differences. On the negative side, "Context switches" are cheaper in the select case. Explicit state is probably much more compact than a thread stack and thread descriptor.
Hans