locks on X86

Tue Mar 13 10:03:00 GMT 2001

My intuition is that inlining the hash synchronization code is really
worthwhile if you do it very selectively, which probably means you need
profile information. Thus I'm not sure we should worry a lot about it in
the near term. It would be nice to have some mechanism in mind that can
handle it in the long run.
The problem is that eventhough the synchronizarion fast path has gotten a
bit shorter, it's still longer than I would like. I think you would get
significant code bloat with an appreciable cache impact if you inlined it
everywhere. For _Jv_MonitorEnter you need to:
1) Compute some sort of thread identifier. On x86 this is currently done
with a hash lookup on the sp and involves several memory references and
probably at least a dozen instructions. It will eventually be done by
loading a segment register. (But I think that requires the demise of 2.2
and earlier Linux kernels.) On architectures that dedicate a register to
some sort of thread id this is already cheap.
2) Compute the right address in the lock hash table. (Involves loading the
address of a global through the GOT and 3 or 4 instructions to compute the
hash function.)
3) Compare and swap.
4) Store thread id in hash table. (Needed since locks are reentrant.)
5) Currently 3 assertions are checked. Eventually we'll turn those off.
_Jv_MonitorExit is similar.
Inlining all of this really wins if you can move the thread id and hash
address calculation out of a loop. But that won't happen for a synchronized
method call unless the method itself is also inlined.
The cost of the lock prefix (nearly 15 cycles) is probably a bit more you
can save by simply removing a call. My guess is that it's less than you
could save if you could also move the hash address calculation and thread id
calculation out of a loop.
I don't understand the various dynamic library calling conventions well
enough to be able to judge whether indirecting the call to _Jv_MonitorEnter
would be acceptable. If it really replaces a direct call by an indirect
one, my guess would be "no". These are called from many different places.
Thus it would take many branch target buffer slots to predict these
correctly, even assuming the processor has a branch target buffer and can
predict indirect branches. (If it can't, I think a 5-10 cycle penalty is
typical.)
I would be surprised if gcj synchronization performance ever became really
competitive with something like HotSpot. The goal is to get close enough
that we can make up the difference elsewhere. There are several reasons for
that:
1) We don't get to adjust the ABI to keep a pointer to our thread structure
in a register.
2) It's harder to get profiling information.
3) Objects only have a one word header, with no room for synchronization
information. We can't play the games used by the Sun implementations.
(There may also be patents that would prevent that, but that's a secondary
issue.)
On the other hand, each of those 3 buy us corresponding advantages
elsewhere:
1) Calls to native code (including that in the runtime) are cheaper. That
makes many Java library calls cheaper.
2) We don't have to pay for compilation at run-time.
3) Everything is smaller, and the application touches less memory.
Hans
> -----Original Message-----
> From: Bryce McKinlay [ mailto:bryce@albatross.co.nz ]
> Sent: Monday, March 12, 2001 5:52 PM
> To: Boehm, Hans
> Cc: 'green@redhat.com'; Jeff Sturm; java@gcc.gnu.org; 
> drepper@redhat.com
> Subject: Re: locks on X86
>>> "Boehm, Hans" wrote:
>> > I'm assuming that on X86 we'll usually use hash 
> synchronization. I believe
> > there will be exactly two time critical instances of 
> cmpxchg in libgcj
>> But don't we (eventually) want to have the compiler be able 
> to inline the
> "lightweight" part of the synchronization mechanism? I was 
> imagining that the
> compiler will generate an inline compare-and-exchange which 
> only calls out to
> an external function if contention is detected. The external 
> function would do
> the spinning and deal with heavyweight (contended) locks.
>> Of course, the disadvantage here, as Anthony suggests, is 
> that we'd have to
> multilib if we wanted faster synchronization on 
> uniprocessors. Obviously if we
> don't do any inlining then its much easier because we just 
> generate two
> implementation of the lock function, and figure out at 
> runtime which one to
> use. Would the benifit of being able to drop the lock prefix 
> outweigh the
> benefit of inlining?
>> regards
>> [ bryce ]
>>