GC failure w/ THREAD_LOCAL_ALLOC ?

Wed Mar 20 10:19:00 GMT 2002

Bryce McKinlay wrote:
> While testing thread local allocation on PowerPC, I ran into a problem 
> which is also reproducable on x86. The attached stress-test-case 
> GCTest.java will lock up with ~100% reproducability with 
> THREAD_LOCAL_ALLOC enabled. It runs fine without THREAD_LOCAL_ALLOC.
>> What I am seeing in the debugger is most threads waiting in 
> GC_suspend_handler, but one thread segfaulting in GC_mark_read. 
> libjava's segv handler gets called and the collector is re-entered 
> during the stack trace, causing the freeze.

I actually ran into this problem in my application 2 months ago (using 
gcc version 3.1 20010911 (experimental)), and reported it to Hans. I 
couldn't water down my application to create such a simple test case, so 
tracking it down was somewhat difficult.
 From the stack trace I provided back in January, Hans intially 
responded with:
Hans Boehm wrote:
 > I'm not terribly worried about the SIGSEGV getting turned into a
 > deadlock. Such things seem to be largely unavoidable.
 >
 > I would like to understand where the SIGSEGV is coming from. Typically
 > a failure here is caused by a bogus object descriptor. This may
 > happen because something was overwritten by client code, or because
 > there's an undiscovered bug in the GC, or in the gcj generated
 > descriptor.
With some further pointers, it turns out there _was_ a bogus object 
descriptor. At my last contact with Hans, he suspected the problem was 
related to THREAD_LOCAL_ALLOC, but was unable to find any likely 
problems when reviewing the code. Here's an excerpt:
Hans Boehm wrote:
 > I spent a bit of time:
 >
 > - Staring at the thread-specific-storage implementation, and
 >
 > - adding some tests for thread-local allocation to gctest.
 >
 > The new tests failed to make the problem reproducible here.
 >
 > I cleaned up a few things. The only thing substantive I found was
 > that specific.c could fail if one of the thread stacks ended up at the
 > extreme high end of the addres space, i.e. if 0xfffff000 is the
 > address of a valid stack page. Are you configuring your kernel in
 > some nonstandard way, e.g. to maximize virtual address space?
 > Otherwise this seems unlikely to account for the problem, since that's
 > normally kernel address space on Linux/X86, as I recall. (I vaguely
 > recall that Mandrake Linux might do something strange in this area.)
Hans sent me new versions of specific.c and specific.h to fix the above 
mentioned problem (thread stacks at the high end of the address space), 
but I never had the chance to try them out. I had a workaround that 
made the problem go away for me, and other work priorities are 
preventing me from continuing to dig into the issue.
My workarounds were to increase the initial heap size of my application 
(reducing the required garbage collections), and turning on 
GC_IGNORE_GCJ_INFO (which I had to add to gcj's version of the collector 
since it was added after the version I am using). Neither of which 
really "fixes" the problem though. They just make it much more unlikely 
that I'll hit the problem (I haven't since then).
regards,
michael