Initial profiling results (was: Re: -O2 loop problems)

Thu Jan 4 15:23:00 GMT 2001

Tom Tromey wrote:
> If you've got any profiling info I'd love to see it. I don't really
> have a feeling for what bottlenecks might exist for real applications.

Data is now trickling in. I had hoped to run some fairly comprehensive programs
by this week, but I didn't plan on spending the holidays fixing my alpha-linux
build... *sigh*
(Anybody else seeing weird memory corruption from iconv? It could be a platform
bug, or a 64-bit thing.)
My first experiment, a simple interpreter test, exercises various parts of the
runtime a bit, particularly collections and object allocation. Here is what I
have so far for libgcj:
 Name Count Pct
 ---- ------ ----
 _Jv_FindArrayClass__FPQ34java4lang5Cla 5620 2.5
 _Jv_AllocObject 5487 2.4
 _Jv_CheckArrayStore 5062 2.2
 table_search__FPA1_Cwiw 4513 2.0
 _Jv_InitClass 3661 1.6
 _Jv_FindClassInCache__FP13_Jv_Utf8Cons 2847 1.2
 _Jv_AllocObj__FiPQ34java4lang5Class 2809 1.2 
 get__Q34java4util9HashtablePQ34java4la 2774 1.2
 _Jv_MonitorEnter 2765 1.2
 _Jv_equalUtf8Consts__FP13_Jv_Utf8Const 2690 1.2
 hashUtf8String__FPci 2567 1.1
2nd column is a count of samples, 3rd is the percentage of total samples for the
entire process. Samples below 1% are omitted.
I wouldn't have guessed that _Jv_FindArrayClass is hit so hard. It also
surprises me that so many cycles are spent in _Jv_AllocObject, which doesn't do
much besides call the GC allocator.
I have similar counts for the GC:
 GC_malloc 12232 5.3
 GC_mark_from_mark_stack 6759 3.0
 GC_malloc_atomic 3559 1.6
 GC_reclaim_clear 2425 1.1
and libc:
 memset_loop 11041 4.8
 memcpy 5963 2.6
 __divqu 5217 2.3
 __pthread_lock 3488 1.5
 __pthread_getspecific 2523 1.1
 __pthread_unlock 2512 1.1
(__divqu is an Alpha thing... no integer division in h/w.) Overall, everything
seems well-balanced, no single function consumes more than about 5% overall
runtime.
In theory I can measure relative times spent in
application/libgcj/libgcjgc/libc/kernel mode, once I have shared libs working
again. *double sigh*
I tried some event counters to compare gcj to the Compaq JDK:
 VM Wall time Instructions Cycles CPI
 ------------------ --------- ------------ ---------- -----
 gcj, -O2 -fno-defer-pop 30.7 s 5.9x10^9 16.3x10^9 2.76
 JDK 1.2.2, symcjit 101.8 s 16.1x10^9 55.7x10^9 3.45
Not a fair comparison, really. The JDK is fast on single method execution, but
far inferior at real programs, because in spite of the JIT it is based on Sun's
original "Classic" VM without the runtime speedups.
The CPI is interesting, since it gives a relative measure of pipeline
performance. This machine is capable of issuing two integer insns per cycle, in
theory. 2.76 is not a particularly good result (lower is better) though it
fares far better than the JDK. Reasons for pipeline stalls are cache misses,
branch mispredictions and poorly scheduled code (read-after-write conflicts,
etc.). It should improve once we can do adequate method inlining. 
I'd love to know how to get at similar hardware event counters on x86. I'm not
familiar with any free tools for that arch. Since I can't do reasonable-cost
multithreaded profiling in software, I'm sticking with Iprobe on Alpha.
(Sorry about the long post. This is perhaps far more detail than you expected,
and I've just hit the tip of the iceberg. The real fun is figuring out how to
interpret the data. I'll keep you posted on any unusual results.)
--
Jeff Sturm
jeff.sturm@commerceone.com