Why is Linux thread locking so slow?

Tue Oct 19 09:04:00 GMT 1999

> However, I think the case we're looking at here may be slightly different.
> In our case, one thread (A) is running on one CPU, and another thread (B) 
> is running on another. Thread A holds a lock, thread B wants to acquire it.
> Once thread A unlocks the lock, thread B could just go ahead and run
> on the other CPU. In this case, *neither* CPU would have to context
> switch. It would simply be a system call executed on each CPU.

Assuming B doesn't do busy-waiting, if it finds the mutex locked by A,
it has no choice but to suspend and presumably the kernel will then
switch context to another runnable process. Later, A unlocks the
mutex, and seeing that B is waiting on it, will restart B.
The situation is different if the thread implementation allows some
limited amount of busy-waiting on locks, e.g. it spins on the mutex
for some number of iterations before suspending itself. This can help
on multiprocessors, at least for mutexes that are locked for very
short durations, although it's never beneficial on a uniprocessor.
> May I ask what you mean by two-level scheduling?
> Do you mean a hybrid threading scheme similar to Solaris, where some
> threads are scheduled as user-level threads at the process level, while
> the kernel only schedules fewer kernel-level threads (aka LWP), on top
> of which user-level threads are scheduled? 

Yes, exactly.
> Let me point out though that your suspend operation right now
> incurs two system calls (sigprocmask and sigsuspend); this seems more
> expensive that a single semaphore_lock(), if it existed in usable form.

It is true that a more efficient suspend/restart mechanism for threads
could lower the cost. However, you'd still have a context switch
going on, and I believe (without empirical evidence) that this cost
dominates the cost of e.g. doing two syscalls instead of one.
E.g. maybe you'll see 50 operations on contended locks per microsecond
rather than the current 25; but you'll still be very far from the
uncontended times.
> Finally, I believe that a certain portion of the cost doesn't necessarily 
> stem from the costs involved in the context switch; it stems from the fact 
> that on top of the context switch you're going through the signal
> delivery path. This adds unnecessary overhead, especially in the case 
> described above where the two threads run on different CPUs. Couldn't
> such overhead be avoid by having a dedicated lock/unlock system call?

Yes, but 1- I don't know how much of a performance gain this would be,
and 2- one needs to design and implement this mechanism in the kernel,
and get it accepted by the kernel developers -- they don't seem that
much concerned about performance nor POSIX compliance for threads.
> > Using SysV semaphores as an alternative to mutexes is even worse:
> > you'd pay the cost of a system call on each mutex operation, not just
> > on those that contend for the mutex.
>> Is this really true?
> Why would an implementation that uses test-and-set at user-level
> and that would fall back on the kernel lock/unlock construct not be 
> applicable in this case as well as in the signal-based implementation?

This is exactly what I described in the paragraph before the one you quote
(SysV semaphores as a suspend/restart mechanism, mutual exclusion
being still ensured by user-level atomic operations).
> Structuring programs to reduce contention is hard and requires effort,
> and in some cases it is very hard. [1] It involves partitioning resources
> which is not always possible. Don't get me wrong, I am aware of 
> implementation techniques that can reduce contention, such as per-thread
> memory allocation arenas to reduce contention for the memory allocator
> lock, for instance. I am also willing to tolerate some penalty if somebody 
> writes an application with a bad locking strategy. On the other hand,
> there's already a lot of code with bad locking strategies out there.
> Take the java class libraries as an example. [2]

That's right, but I'm pessimistic about the chances of such code
running efficiently on a multiprocessor, regardless of the language
and the threading technology.
> Therefore, I don't believe it is unreasonable for us to think about an
> SMP lock implementation that optimizes the handling of contended locks
> in order to make the penalty not as harsh.

Sure, but I doubt you can get what you want within the LinuxThreads
implementation model at least (i.e. 1-to-1 threads being scheduled by
the kernel). Actually, I'd encourage Java implementors to develop
their own two-level scheduling, using LinuxThreads threads as their
"execution engines", and scheduling the Java code in userland on top
of that. Or, lobby the Linux kernel developers until they take
two-level scheduling seriously and provide the necessary kernel
support for writing a general-purpose two-level POSIX thread
implementation. (Good luck.)
- Xavier Leroy