string hash performance

Wed Aug 7 15:36:00 GMT 2002

Continuing the recent performance and benchmarking theme, here's a puzzle
I found while experimenting with String.hashCode().
With gcj, the elapsed time of hashCode() is nearly linear with string
size, as expected. With the IBM JIT however, hashCode() seemed to run in
constant time.
Unsure if the JIT was fooling my benchmark, I modified the benchmark to
use the x86 TSC counter so I could measure the first few hashCode()
computations. Below are results for gcj 3.1 and IBM JDK 1.3. The
columns are results (TSC delta) for each of 10 iterations. Input is 11
strings of increasing (power of two) length:
-- gcj --
length 1: 428 241 169 158 162 160 162 160 162 158
length 2: 179 191 163 177 151 179 167 177 151 177
length 4: 183 203 199 189 181 191 174 189 174 189
length 8: 203 211 188 199 188 199 186 199 238 193
length 16: 242 244 234 234 234 232 234 232 234 234
length 32: 317 315 311 303 309 303 309 305 311 303
length 64: 495 447 465 445 458 447 465 445 470 445
length 128: 795 754 759 752 765 754 759 752 767 752
length 256: 1426 1359 1358 1343 1351 1345 1358 1343 1420 1343
length 512: 2628 2576 2573 2566 2574 2568 2571 2566 2574 2566
length 1024: 5035 4971 4972 4952 4960 4954 4970 4952 4962 4961
-- JDK --
length 1: 10542 1124 974 1131 1109 1109 1109 1109 1109 1109
length 2: 3485 1132 1191 1139 1214 1131 1132 1132 1131 1274
length 4: 3576 1289 1349 1289 1289 1289 1394 1342 1334 1334
length 8: 3435 1132 1109 1266 1199 1274 1296 1364 1146 1414
length 16: 4311 1117 1117 1117 1154 1116 1117 1117 1116 1117
length 32: 6179 1116 1116 1117 1139 1117 1139 1184 1169 1139
length 64: 8984 1192 1139 1139 1146 1139 1139 1139 1139 1139
length 128: 15525 1064 1056 1071 1184 1057 1064 1064 1169 1116
length 256: 27995 1117 1116 1116 1117 1117 1117 1116 1116 1026
length 512: 55099 1109 1109 1109 1109 1109 1282 1206 1161 1251
length 1024: 105568 1775 1264 1272 1242 1242 1242 1242 1242 1242
For the JDK, the first result is high, presumably due to JIT compilation.
On successive inputs the first iteration grows expectedly, but the 2nd and
later iterations are almost identical throughout.
I'd wondered if the JIT wasn't fooling my benchmark again, somehow
inlining hashCode() and hoisting the result. Further tests suggest that
isn't the case. Could it be the JDK stores hash codes somewhere in a
java.lang.String? I would've thought the implementation would favor size
over speed.
For gcj, I used CNI for the native rdtsc hook. The JDK test must use
JNI. Indeed the initial gcj numbers are far lower than the JDK's. But
I can't explain the high first interval for longer inputs on the JDK
test, unless the JDK's hashCode() algorithm really is far less efficient
than that in libgcj.
(Either that, or my benchmark is flawed and I haven't realized it yet...)
Test sources and input are attached.
Jeff
-------------- next part --------------
import java.io.*;
public class Hash {
 static {
 System.loadLibrary("rdtsc");
 }
 native long rdtsc();
 int test(String line) {
 int h = 0;
 System.out.print("length " + line.length() + ": ");
 for (int n = 0; n < 10; n++) {
 long s = rdtsc();
 h = line.hashCode();
 long f = rdtsc();
 System.out.print(f - s);
 System.out.print(' ');
 }
 System.out.println();
 return h;
 }
 public static void main(String[] args) {
 Hash hash = new Hash();
 try {
 BufferedReader reader = new BufferedReader(
 new InputStreamReader(System.in));
 while (true) {
 String line = reader.readLine();
 if (line == null) break;
 hash.test(line);
 }
 } catch (Throwable t) {
 t.printStackTrace();
 }
 }
}
-------------- next part --------------
#if CNI
#include "Hash.h"
#include <gcj/cni.h>
jlong Hash::rdtsc() {
 asm("rdtsc");
}
#else
#include <jni.h>
jlong Java_Hash_rdtsc(JNIEnv *env, jobject obj) {
 asm("rdtsc");
}
#endif
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: strings.txt
URL: <http://gcc.gnu.org/pipermail/java/attachments/20020807/90504aa0/attachment.txt>