string hash performance

Wed Aug 7 16:45:00 GMT 2002

Is it surprising that the IBM JDK would cache the hashCode? If Object.hashCode() computes hash values based on address, and the garbage collector compacts, then you need to have someplace to store previously computed hashCodes when an object is moved. (That could be in an external address-indexed table, I guess.) It seems at least possible that String.hashCode could use the same space.
Since we don't move objects, gcj has a choice about whether to store String hashCodes. I would guess the current choice is the right one.
Hans
> -----Original Message-----
> From: Jeff Sturm [mailto:jsturm@one-point.com]
> Sent: Wednesday, August 07, 2002 3:36 PM
> To: java@gcc.gnu.org
> Subject: string hash performance
>>> Continuing the recent performance and benchmarking theme, 
> here's a puzzle
> I found while experimenting with String.hashCode().
>> With gcj, the elapsed time of hashCode() is nearly linear with string
> size, as expected. With the IBM JIT however, hashCode() 
> seemed to run in
> constant time.
>> Unsure if the JIT was fooling my benchmark, I modified the 
> benchmark to
> use the x86 TSC counter so I could measure the first few hashCode()
> computations. Below are results for gcj 3.1 and IBM JDK 1.3. The
> columns are results (TSC delta) for each of 10 iterations. 
> Input is 11
> strings of increasing (power of two) length:
>> -- gcj --
>> length 1: 428 241 169 158 162 160 162 160 162 158
> length 2: 179 191 163 177 151 179 167 177 151 177
> length 4: 183 203 199 189 181 191 174 189 174 189
> length 8: 203 211 188 199 188 199 186 199 238 193
> length 16: 242 244 234 234 234 232 234 232 234 234
> length 32: 317 315 311 303 309 303 309 305 311 303
> length 64: 495 447 465 445 458 447 465 445 470 445
> length 128: 795 754 759 752 765 754 759 752 767 752
> length 256: 1426 1359 1358 1343 1351 1345 1358 1343 1420 1343
> length 512: 2628 2576 2573 2566 2574 2568 2571 2566 2574 2566
> length 1024: 5035 4971 4972 4952 4960 4954 4970 4952 4962 4961
>> -- JDK --
>> length 1: 10542 1124 974 1131 1109 1109 1109 1109 1109 1109
> length 2: 3485 1132 1191 1139 1214 1131 1132 1132 1131 1274
> length 4: 3576 1289 1349 1289 1289 1289 1394 1342 1334 1334
> length 8: 3435 1132 1109 1266 1199 1274 1296 1364 1146 1414
> length 16: 4311 1117 1117 1117 1154 1116 1117 1117 1116 1117
> length 32: 6179 1116 1116 1117 1139 1117 1139 1184 1169 1139
> length 64: 8984 1192 1139 1139 1146 1139 1139 1139 1139 1139
> length 128: 15525 1064 1056 1071 1184 1057 1064 1064 1169 1116
> length 256: 27995 1117 1116 1116 1117 1117 1117 1116 1116 1026
> length 512: 55099 1109 1109 1109 1109 1109 1282 1206 1161 1251
> length 1024: 105568 1775 1264 1272 1242 1242 1242 1242 1242 1242
>> For the JDK, the first result is high, presumably due to JIT 
> compilation.
> On successive inputs the first iteration grows expectedly, 
> but the 2nd and
> later iterations are almost identical throughout.
>> I'd wondered if the JIT wasn't fooling my benchmark again, somehow
> inlining hashCode() and hoisting the result. Further tests 
> suggest that
> isn't the case. Could it be the JDK stores hash codes somewhere in a
> java.lang.String? I would've thought the implementation 
> would favor size
> over speed.
>> For gcj, I used CNI for the native rdtsc hook. The JDK test must use
> JNI. Indeed the initial gcj numbers are far lower than the 
> JDK's. But
> I can't explain the high first interval for longer inputs on the JDK
> test, unless the JDK's hashCode() algorithm really is far 
> less efficient
> than that in libgcj.
>> (Either that, or my benchmark is flawed and I haven't 
> realized it yet...)
>> Test sources and input are attached.
>> Jeff
>