Race condition in JV_HASH_SYNCHRONIZATION in libgcj??
David Daney
ddaney@avtrex.com
Tue Sep 18 15:04:00 GMT 2007
Hans Boehm wrote:
>>> On 2007年9月14日, David Daney wrote:
>>> Hans Boehm wrote:
>>> The compare_and_swap just succeeded in overwriting the address field,
>>> which makes us the holder of the lightweight lock on the given
>>> address, and gives us exclusive access to light_thr_id and light_count
>>> for the corresponding hash entry. (The LOCKED bit in the address
>>> field protects other parts of the hash entry, but not those two
>>> fields.)
>>>>>>> I am talking about the part where the compare_and_swap fails. The
>> problem I see is all the things that happen after the 'if' block I
>> quoted. Specifically, how can you do anything sane with the value of
>> retrieved from he->address (the last line I quoted)? It could be
>> zero (the thread the owned the lock released it), or it could be an
>> object that hashes to the same value, or the same object we are
>> synchronizing on, there does not seem to be anything that keeps
>> multiple threads (where compare_and_swap fail) out of this area after
>> a first thread succeeds at compare_and_swap.
> Right. The code is counting on being able to read the (volatile)
> address field concurrently with a compare_and_swap and seeing a value
> that was there at some point. The only case in which it does much
> with the data without relying on the success of another
> compare_and_swap is the immediately next one ((address & ~(HEAVY |
> REQUEST_CONVERSION)) == addr && he -> light_thr_id == self). In that
> case, it saw that that specific hash entry was in use for a
> lightweight lock held by the current thread. Since it was held by the
> current thread, it could not have been concurrently released. Thus
> the address field also could not have been changed asynchronously in
> the meantime. Thus those field values in fact cannot change
> asynchronously, and I hold the lightweight lock. Hence I'm allowed to
> update light_count.
>
Right, I figured that out myself. I am just starting to dig into it.
> This is admittedly a little dubious, since I'm treating the fact that
> address and light_thr_id are volatile as a guarantee that I can read it
> atomically. That isn't really true in general. But it should be true
> on workstation or server-class MIPS machines.
>> (There is a proposal before the C++ committee to add real atomics to
> the language, so this will probably get fixed. Until then, this
> problem is unavoidable.)
>>>>> This code is extremely tricky. It's implementing locks, and hence
>>> doesn't itself follow a standard locking discipline. It's unlikely
>>> there's anything wrong with this particular piece of code, since
>>> it's the most common path.
>>>>>>> As I said above, I am worried about the less common path. I am
>> seeing occasional failures of the SyncTest.java testcase on a mips
>> system with 2 CPUs. This is a test where four threads acquire and
>> release a single monitor 1,000,000 times each. There are a couple of
>> possibilities I am investigating:
>>>> 1) Races in the HASH_SYNCHRONIZATION code.
>> 2) Modifications I made to the compare_and_swap primitive have bugs.
>> 3) A hardware error.
>>>> Since I cannot convince myself that the HASH_SYNCHRONIZATION code is
>> correct, I thought I would ask about how it was supposed to work.
> This is on a MIPS system?
Yes. It is a SiByte SB1 CPU, which is a dual core mips64 device.
> The failure symptom is that the final value is
> one or two too low?
The failure is what seems to be a deadlock. Normally the test completes
in about 10 seconds. Occasionally however it seems never to complete.
If I attach gdb, I can see that there are several threads in
wait_unnlocked().
> You can't get it to fail with assertions enabled
> here? Or it fails the same way with assertions enabled?
>> This code was originally tested on Itanium and X86 multiprocessors, so
> it seems unlikely, but not impossible, that it's generically broken
> across architectures. I would normally be suspicious about memory
> ordering issues. But MIPS is normally either sequentially consistent
> or at least close.
>I am likewise suspicious about memory ordering. I think that perhaps
adding memory barriers at the appropriate places might fix it.
Really I have been thinking that there should be no volatile variables
here. Instead memory barriers should be used at all synchronization points.
> It might be worth making sure that the compiler isn't moving
> memory references across a compare_and_swap.
>This could be the problem also. I have completely re-written all the
atomic memory primitives for MIPS and it is possible that I made a
mistake. I should note that the code I replaced had bugs. The
compare_and_swap was returning the old value instead of the result of
the compare.
I will keep working on this.
Thanks,
David Daney.
> Hans
More information about the Java
mailing list