gcj: mem leaks & speed.

Mon Feb 2 17:37:00 GMT 2004

GC tradeoffs are fairly complex and intertwined. For GCBench
(yet another way-too-small-and-too-contrived) benchmark, gcj performance
ranges from 4x slower to 2x faster than HotSpot, depending on the
size of the data structures being allocated and dropped. (For the graph,
see http://h21007.www2.hp.com/dspp/files/unprotected/gcj2.pdf, around
slide 12.)
I think there is an emerging consensus that for long-lived objects, some
flavor of mark-lazy-sweep collector usually wins. That's what we do, so
we're fine there. (We also do some other things, e.g. not touching
pointer-free objects such as most strings during GC, which I think are
also important even though most of the commercial VMs haven't gotten there
yet. Most of the commercial VMs can occasionally compact the Mark-Sweep
space, which is an advantage for them. But they need an extra header
word, which isn't.)
I think there is also a consensus that, especially for client-side
apps, it helps to have a copied young generation, so that you can very cheaply
deal with short-lived objects. (Actually, in many circles this was believed
to be essential. It's actually not always an advantage, but on average it is.)
We currently don't so this. It is sort-of possible to turn on generational
GC in gcj, but it's rarely much of a performance advantage. It might be possible
to improve this a bit, but ...)
Adding a conventional copied young generation to gcj is nontrivial. It
would have to really be a "mostly copied" young generation, since we have to
deal with ambiguous roots. We would have to teach gcc to generate write
barriers. And we would have to somehow shoehorn this into the standard ABI.
Particularly the latter seems hard, since we can't really reserve two registers
for allocation as most dynamic compilers probably do. Accessing thread-local
data probably makes the allocation sequence too long for in-lining, and will
end up discarding a large fraction of the bump-the-pointer allocation advantage.
(If we could get even one register, we could speed up the current allocation sequence
appreciably, perhaps to the point of inlining the frequent cases.)
I'm personally less interested here in copying everyone else, than in seeing
how close we can get (or for how many benchmarks we can win) by extending the
existing techniques while preserving gcj's current advantages. I think there's
still a fair amount of headroom:
- The allocation sequence is still very suboptimal. We should be generating
calls directly into the GC in most cases. And the allocation routine should
perhaps be cloned in each dynamic library. The C version of GCBench was still
a lot faster than the Java version, the last time I tried. It shouldn't be.
- We're currently still scanning way too many roots, though I think that's being
worked on. (That probably has something to do with the mediocre Java GCBench performance.)
- We haven't really looked at making more objects pointerfree. That often doesn't matter
much for toy benchmarks, but I think it matters a lot for real apps, since it reduces
the amount of stuff that needs to flow through the cache during a GC.
- The incremental/generational collector should be a real feature, instead of
something that only sort of works with a lot of tweaking. And it needs to be
tuned a bit more. Reducing the root size probably makes that much more viable.
(This should help pause times a lot, if nothing else.)
- Given that we don't have a terribly fast allocator for short-lived objects,
a good escape analysis in the compiler is more important than for commercial JVMs.
It may go a long way towards eliminating the need for a copied young generation.
None of this will cause us to win on allocate-drop loops (or even real programs
that behave like an allocate-drop loop). But I think it will make us compare
much better on a bunch of other benchmarks.
Hans
P.S. There are some interesting GC algorithm comparisons at
http://www.research.ibm.com/people/d/dfb/papers/Attanasio01Comparative.pdf
and
http://cs.anu.edu.au/techreports/2003/TR-CS-03-02.pdf
Remember that every such comparison unavoidably omits some potentially important
optimization from at least one of the algorithms, so they should all be taken
with a grain of salt.
> -----Original Message-----
> From: Andrew Haley [mailto:aph@redhat.com]
> Sent: Monday, February 02, 2004 2:49 AM
> To: Hans Boehm
> Cc: Rutger Ovidius; java@gcc.gnu.org
> Subject: Re: gcj: mem leaks & speed.
>>> Hans Boehm writes:
> > [Comments added below.]
> > 
> > > Also, while trying to track down memory use, I wrote the 
> following
> > > class:
> > >
> > > public class Leak {
> > > public static void main(String[] argv) {
> > > long start = System.currentTimeMillis();
> > > for (int i = 0;; i++) {
> > > Leak l = new Leak();
> > > if (i % 100000000 == 0) {
> > > long curr = System.currentTimeMillis();
> > > System.out.println(i / 100000000 + ": " + (curr 
> - start));
> > > }
> > > }
> > > }
> > > }
> > >
> > > This runs 10x as slow in gcj as it does in java on 
> windows (~6x on
> > > linux) If you remove the "Leak l = new Leak();" line, the speed
> > > becomes rather similar to java. Why is class creation so 
> slow? (or is
> > > it GC cleanup that is slow?)
>> > The garbage collector used by gcj does not handle very short-lived
> > objects as efficiently as most JVMs, though it sometimes performs
> > better for applications that alternately allocate and drop 
> very large
> > linked structures. This example exercises short-lived 
> allocation to
> > the extreme.
>> That's interesting. Could you briefly tell use why the gc has this
> property, and if anything can or should be done about that?
>> > A factor of 10 is a larger difference than I would have expected.
> > It may well be that some JVMs recognize that the allocation is dead
> > and remove it (probably not very useful in practice) or recognize
> > that l doesn't escape the loop body, and hence the Leak object can
> > be stack allocated (may be useful in practice). 
>> A compiler can trivially determine that the allocation is dead and
> that the constructor does nothing. For that reason, this code isn't
> much use as a comparison: most Java code uses objects in between
> creation and destruction.
>> I'm eager to add Java-specific optimizations to gcj, but I'm not
> convinced that this one would be a big bonus. However, when combined
> with escape analysis and stack allocation of short-lived objects it
> could well be useful.
>> > AFAIK, gcj currently does neither.
>> That's right.
>> Andrew.
>