thin locks (was Re: libgcj/117:)

Fri Dec 10 15:12:00 GMT 1999

Tom Tromey and I discussed one variant of this a bit before I got
sidetracked.
My impression is that the right solution is not at all obvious. Very few of
the relevant performance comparisons have been done. I proposed yet another
solution, which is closely related to one I implemented for SGIs JVM, while
I was there. I would be interested in implementing it in gcj. But I
haven't looked at Kaffe's locks. 
(In any case, it would be nice if the locking scheme were configurable at
build time. I doubt that a single scheme dominates in all cases.)
My basic approach would be targetted more at removing the header word than
reducing the number of executed instructions relative to the current
approach, though it should do both to some extent:
1. Use a central table (like the monitor cache in the Sun VM) to hold
heavyweight locks. This table would hold a pthread mutex and condition
variable for at least each monitor that is either being waited on or had
sufficient contention to warrant a heavyweight lock. Accessing this table
is a fairly heavyweight operation, since you need to use lower level
synchronization to deal with concurrent accesses to the table itself.
(There probably is a convenient way to combine this with (2), though the SGI
implementation doesn't do that, for reasons not relevant to gcj.) If all
active monitors were entered in this table, this would give a complete,
though slow, synchronization facility. No per object space is needed for
synchronization.
2. Maintain a direct-mapped cache mapping object addresses to lightweight
locks. The cache can also be thought of as a hash table with no collision
resolution mechanism. Collisions are resolved by using heavyweight locks.
The table size is static, or configurable at JVM startup (probably one or
two thousand entries).
Each lightweight lock entry in the cache is in one of three states:
s0) No thread holds any lock mapping to this cache location. 
s1) Exactly one object mapping to this address is locked. The lock entry
contains the (remaining bits of) the object address or id, the thread id,
and the lock acquisition count.
s2) Any locked objects mapping to this cache address have an associated
heavyweight lock. The cache entry holds a count of the total number of
times all such locks have been entered (including waiters). When the count
reaches zero, the entry can revert to s0.
Ideally each cache entry for (2) is small enough that it can be updated with
a single compare-and-swap or similar primitive. Ideally the thread id is
kept in a register. (I'm personally interested primarily in IA64, where it
will be. I'm told that linuxthreads on IA32 will soon also use a segment
register for the thread id.) This allows a lock acquisition or release fast
path to consiste of something like:
1) Hash computation: two or three in-register boolean ops, possibly a load
of a bit mask.
2) Address computation of the hash location. (Perhaps a load and an add in
the dynamically linked case, but some of this may be loop invariant.)
3) A read of the cache entry (or load-locked).
4) A small number of in-register instructions to compute the new cache entry
from the old one. In the fastest case, a zero test, and perhaps a shift and
"or" to combine the lock address and thread id.
5) Compare-and-swap (or store conditional) to write back the result.
This should be in the 10 to 20 instruction range in most cases, with the
compare-and-swap instruction probably dominating the runtime. In terms of
instructions, it's a little slower than some of the competing schemes, and
probably only a little faster than the current scheme could be made to be in
the best case. But it has the big advantage that it needs no per object
space, which affects the cache hit rate, GC cost, etc. It also only rarely
requires allocation in the synchronization code. (I also don't think it
runs into any intellectual property issues, though that seems to be tough to
verify.)
This scheme doesn't need to impose ordering requirements on monitor entry
and exit. It works with the original JDK 1.1 rules. It plays acceptably
with pthreads code. (Java locks are different from pthread mutexes, of
course. But you can use both in one program. It's probably beneficial to
have Java locks spin for a while on contention. Thus they may not play by
strict priority rules, but that's neither required, nor usually useful.)
Hans
-----Original Message-----
From: Godmar Back [ mailto:gback@cs.utah.edu ]
Sent: Friday, December 10, 1999 11:49 AM
To: hboehm@exch.hpl.hp.com
Cc: joerg.brunsmann@FernUni-Hagen.de; tromey@cygnus.com;
java-discuss@sourceware.cygnus.com
Subject: thin locks (was Re: libgcj/117:)
>> 3) Ideally negative overhead relative to pthreads. Or enough hooks to add
> some flavor of platform-specific thin locks later. This stuff seems to be
a
> lot more performance critical for Java than it is for most other clients,
> and is probably more performance critical for gcj than Mozilla. That may
> eliminate them both (again?).
>
Apropos thin locks.
Has anybody developed or picked a design for those yet?
I was really disappointed to learn that Kaffe's thin locks won't work with
gcj's current _Jv_MonitorEnter/_Jv_MonitorExit. They don't work because
Kaffe's
symmetry assumption demands that a lock is unlocked with a stack pointer
value
that is equal to or greater (on a downward-growing stack) than the stack
pointer
value when the lock was locked. If the stack pointer is lower on the
unlock
than on the lock, Kaffe assumes that this was a non-final recursive unlock.
That's why kaffe's awt currently deadlocks when compiled with gcj :-(
The reason this doesn't work with gcj is that gcj doesn't guarantee that the
sp 
value is the same (or higher) for a matching Jv_MonitorExit as it is for a 
Jv_MonitorEnter. Because of deferred pop optimizations, it's also not
something 
one should ask for.
Anyway, you probably don't care much why it doesn't work in Kaffe: my point
is
that it would be nice to develop a design that does work together, and draw
from
the experiences in previous work. Or at least, that whoever does the design
posts 
it to the list for feedback.
I believe that EF's thin locks may suffer from a similar problem as Kaffe's;
I've contacted one of the authors to confirm that. EF stores information in
the current frame, so it should deal with deferred pop; but I don't believe
it
can deal with locked objects being unlocked in a caller. Keep in mind that
JNI and CNI, for the matter, demands that (Jv_)MonitorEnter/Exit can be
called
from anywhere in native C code, and that the call has the same effect as an
inlined thin lock somewhere in compiled Java code.
IBM's thinlocks (Bacon et al PLDI98) may be one alternative. They keep an 
explicit 8 bit count in the object header for shallow recursive
acquisitions, which 
means they don't rely on the stack pointer position. However, they require
that 
a 16bit thread id is stored with the object to identify recursive
invocations. 
Clearly, getting a thread id by using pthread_self() or the like would
eliminate
any advantage you hope to see from thin locks. They say that they get the
thread 
index with a single load from the so-called "execution environment" ---
whatever 
that is in their JVM. That work was done on IBM/AIX PPC, btw.
There's two possibilities, IMO, as to what that "execution environment" is:
either 
it's a global variable that is adjusted on every context switch, in which
case it 
would be right out for any native threading.
Or it's an additional argument that's carried with every function, such as
JNIEnv*
in JNI. If the latter, it should be possible to combine this with stack
limit checking.
In fact, one may be able to use a thread's stack limit, or bits thereof, as
its id.
Btw, I noted that somebody added some stack limit checking options to gcc;
but
it seems this is only for architectures such as the ARM where a register is
or
can be set aside to hold the limit. But no discussion about this has ever
reached
this list, it seems; so I assume it's for some other language.
On a unrelated note, and not to bug you too much, what is the current
thinking on
getting backtraces in gcj? Do you expect to implement that in libgcj or in
libgcc
or in some combination thereof? Me thinks providing stacktraces for C++
code may
be a nice g++ extension(?)
	- Godmar