can array access be optimized?

Question 1

Maybe I'm being misled by my profiler (Netbeans), but I'm seeing some odd behavior, hoping maybe someone here can help me understand it.

I am working on an application, which makes heavy use of rather large hash tables (keys are longs, values are objects). The performance with the built in java hash table (HashMap specifically) was very poor, and after trying some alternatives -- Trove, Fastutils, Colt, Carrot -- I started working on my own.

The code is very basic using a double hashing strategy. This works fine and good and shows the best performance of all the other options I've tried thus far.

The catch is, according to the profiler, lookups into the hash table are the single most expensive method in the entire application -- despite the fact that other methods are called many more times, and/or do a lot more logic.

What really confuses me is the lookups are called only by one class; the calling method does the lookup and processes the results. Both are called nearly the same number of times, and the method that calls the lookup has a lot of logic in it to handle the result of the lookup, but is about 100x faster.

Below is the code for the hash lookup. It's basically just two accesses into an array (the functions that compute the hash codes, according to profiling, are virtually free). I don't understand how this bit of code can be so slow since it is just array access, and I don't see any way of making it faster.

Note that the code simply returns the bucket matching the key, the caller is expected to process the bucket. 'size' is the hash.length/2, hash1 does lookups in the first half of the hash table, hash2 does lookups in the second half. key_index is a final int field on the hash table passed into the constructor, and the values array on the Entry objects is a small array of longs usually of length 10 or less.

Any thoughts people have on this are much appreciated.

Thanks.

public final Entry get(final long theKey) {
 Entry aEntry = hash[hash1(theKey, size)];
 if (aEntry != null && aEntry.values[key_index] != theKey) {
 aEntry = hash[hash2(theKey, size)];
 if (aEntry != null && aEntry.values[key_index] != theKey) {
 return null;
 }
 }
 return aEntry;
}

Edit, the code for hash1 & hash2

private static int hash1(final long key, final int hashTableSize) { 
 return (int)(key&(hashTableSize-1)); 
}
private static int hash2(final long key, final int hashTableSize) { 
 return (int)(hashTableSize+((key^(key>>3))&(hashTableSize-1))); 
}

Question 2

Can we see the code in hash1/hash2? Array access in Java is intrinsically slower than some other languages because of the bounds checking. But for it to be a bottle neck in any application is surprising.

Question 3

Please post the code for methods hash1 and hash2.

Question 4

By the way this doesn't seem to really be double hashing...you use chaining to avoid collisions, don't you? Double hashing is when you have one hash function to find the base index and a second to determine the next address to check while there is a collision.

Question 5

@Mark, yes, this is true, I misspoke, thanks for the clarification.

Question 6

@Michael, When I perform get() 10 million times, it takes about 2.8 ns per lookup. How fast do you need it to be?

Question 7

Nothing in your implementation strikes me as particularly inefficient. I'll admit I don't really follow your hashing/lookup strategy, but if you say it's performant in your circumstances, I'll believe you.

The only thing that I would expect might make some difference is to move the key out of the values array of Entry.

Instead of having this:

class Entry {
 long[] values;
}
//...
if ( entry.values[key_index] == key ) { //...

Try this:

class Entry {
 long key;
 long values[];
}
//...
if ( entry.key == key ) { //...

Instead of incurring the cost of accessing a member, plus doing bounds checking, then getting a value of the array, you should just incur the cost of accessing the member.

Is there a random-access data type faster than an array?

I was interested in the answer to this question, so I set up a test environment. This is my Array interface:

interface Array {
 long get(int i);
 void set(int i, long v);
}

This "Array" has undefined behaviour when indices are out of bounds. I threw together the obvious implementation:

class NormalArray implements Array {
 private long[] data;
 public NormalArray(int size) {
 data = new long[size];
 }
 @Override
 public long get(int i) {
 return data[i];
 }
 @Override
 public void set(int i, long v) {
 data[i] = v;
 }
}

And then a control:

class NoOpArray implements Array {
 @Override
 public long get(int i) {
 return 0;
 }
 @Override
 public void set(int i, long v) {
 }
}

Finally, I designed an "array" where the first 10 indices are hardcoded members. The members are set/selected through a switch:

class TenArray implements Array {
 private long v0;
 private long v1;
 private long v2;
 private long v3;
 private long v4;
 private long v5;
 private long v6;
 private long v7;
 private long v8;
 private long v9;
 private long[] extras;
 public TenArray(int size) {
 if (size > 10) {
 extras = new long[size - 10];
 }
 }
 @Override
 public long get(final int i) {
 switch (i) {
 case 0:
 return v0;
 case 1:
 return v1;
 case 2:
 return v2;
 case 3:
 return v3;
 case 4:
 return v4;
 case 5:
 return v5;
 case 6:
 return v6;
 case 7:
 return v7;
 case 8:
 return v8;
 case 9:
 return v9;
 default:
 return extras[i - 10];
 }
 }
 @Override
 public void set(final int i, final long v) {
 switch (i) {
 case 0:
 v0 = v; break;
 case 1:
 v1 = v; break;
 case 2:
 v2 = v; break;
 case 3:
 v3 = v; break;
 case 4:
 v4 = v; break;
 case 5:
 v5 = v; break;
 case 6:
 v6 = v; break;
 case 7:
 v7 = v; break;
 case 8:
 v8 = v; break;
 case 9:
 v9 = v; break;
 default:
 extras[i - 10] = v;
 }
 }
}

I tested it with this harness:

import java.util.Random;
public class ArrayOptimization {
 public static void main(String[] args) {
 int size = 10;
 long[] data = new long[size];
 Random r = new Random();
 for ( int i = 0; i < data.length; i++ ) {
 data[i] = r.nextLong();
 }
 Array[] a = new Array[] {
 new NoOpArray(),
 new NormalArray(size),
 new TenArray(size)
 };
 for (;;) {
 for ( int i = 0; i < a.length; i++ ) {
 testSet(a[i], data, 10000000);
 testGet(a[i], data, 10000000);
 }
 }
 }
 private static void testGet(Array a, long[] data, int iterations) {
 long nanos = System.nanoTime();
 for ( int i = 0; i < iterations; i++ ) {
 for ( int j = 0; j < data.length; j++ ) {
 data[j] = a.get(j);
 }
 }
 long stop = System.nanoTime();
 System.out.printf("%s/get took %fms%n", a.getClass().getName(), 
 (stop - nanos) / 1000000.0);
 }
 private static void testSet(Array a, long[] data, int iterations) {
 long nanos = System.nanoTime();
 for ( int i = 0; i < iterations; i++ ) {
 for ( int j = 0; j < data.length; j++ ) {
 a.set(j, data[j]);
 }
 }
 long stop = System.nanoTime();
 System.out.printf("%s/set took %fms%n", a.getClass().getName(), 
 (stop - nanos) / 1000000.0);
 }
}

The results were somewhat surprising. The TenArray performs non-trivially faster than a NormalArray does (for sizes <= 10). Subtracting the overhead (using the NoOpArray average) you get TenArray as taking ~65% of the time of the normal array. So if you know the likely max size of your array, I suppose it is possible to exceed the speed of an array. I would imagine switch uses either less bounds checking or more efficient bounds checking than does an array.

NoOpArray/set took 953.272654ms
NoOpArray/get took 891.514622ms
NormalArray/set took 1235.694953ms
NormalArray/get took 1148.091061ms
TenArray/set took 1149.833109ms
TenArray/get took 1054.040459ms
NoOpArray/set took 948.458667ms
NoOpArray/get took 888.618223ms
NormalArray/set took 1232.554749ms
NormalArray/get took 1120.333771ms
TenArray/set took 1153.505578ms
TenArray/get took 1056.665337ms
NoOpArray/set took 955.812843ms
NoOpArray/get took 893.398847ms
NormalArray/set took 1237.358472ms
NormalArray/get took 1125.100537ms
TenArray/set took 1150.901231ms
TenArray/get took 1057.867936ms

Now whether you can in practice get speeds faster than an array I'm not sure; obviously this way you incur any overhead associated with the interface/class/methods.

Question 8

This was something I thought about doing initially but the Entry objects are shared, they can be in different hash tables based on different values in their values array at the same time.

Question 9

fwiw, I changed the sharing of the Entry objects since that was easy, and implemented this idea. Total time was reduced from 23% of the overall execution to a little over 12%. I would not have expected the cost of the bounds check to be that expensive. Any other ways to avoid bounds checks? I don't suppose they can be compiled out? =)

Question 10

@Michael, The compiler does next to no optimising, the JVM does all the optmisation. What is the average lookup time that you see?

Question 11

@Michael: I did some testing/analysis and added it to my answer. Nothing I'd suggest actually using, but I did find the results interesting.

Question 12

@Peter, doing the math from the profiler results, knowing that it takes x% of y execution time over z iterations, its ~4.5ns. Writing a test program that actually times the accesses (hash table of 500k elements, 500k random accesses), its ~160ns, or ~140ns with Mark's suggested change. The test harness only tests hits, so that's the access time for a hit, the lower bound for a miss is probably only a ns or two, and there are probably lots of misses during a normal invocation, which is why the overall calcuated time is somewhat small compared to the avg time for a hit.

Question 13

Most likely you are partially misled in your interpretation of the profilers results. Profilers are notoriously overinflating the performance impact of small, frequently called methods. In your case, the profiling overhead for the get()-method is probably larger than the actual processing spent in the method itself. The situation is worsened further, since the instrumentation also interferes with the JIT's capability to inline methods.

As a rule of thumb for this situation - if the total processing time for a piece of work of known length increases more then two- to threefold when running under the profiler, the profiling overhead will give you skewed results.

To verify your changes actually do have impact, always measure performance improvements without the profiler, too. The profiler can hint you about bottlenecks, but it can also deceive you to look at places where nothing is wrong.

Array bounds checking can have a surprisingly large impact on performance (if you do comparably little else), but it can also be hard to clearly separate from general memory access penalties. In some trivial cases, the JIT might be able to eliminate them (there have been efforts towards bounds check elimination in Java 6), but this is AFAIK mostly limited to simple loop constructs like for(x=0; x<array.length; x++). Under some circumstances you may be able to replace array access by simple member access, completely avoiding the bound checks, but its limited to the rare cases where you access you array exclusively by constant indices. I see no way to apply it to your problem.

The change suggested by Mark Peters is most likely not solely faster because it eliminates a bounds check, but also because it alters the locality properties of your data structures in a more cache friendly way.

Question 14

A lot of the codebase is little more than array manipulations; I've been wondering if I was getting bitten by the overhead of the array bounds checks, but you make a good point about the locality of the properties.

Question 15

Many profilers tell you very confusing things, partly because of how they work, and partly because people have funny ideas about performance to begin with. For example, you're wondering about how many times functions are called, and you're looking at code and thinking it looks like a lot of logic, therefore slow.

There's a very simple way to think about this stuff, that makes it very easy to understand what's going on.

First of all, think in terms of the percent of time a routine or statement is active, rather than the number of times it is called or the average length of time it takes. The reason for that is it is relatively unaffected by irrelevant issues like competing processes or I/O, and it saves you having to multiply the number of calls by the average execution time and divide by the total time just to see if it is a big enough to even care about. Also, percent tells you, bottom line, how much fixing it could potentially reduce the overall execution time.
Second, what I mean by "active" is "on the stack", where the stack includes the currently running instruction and all the calls "above" it back to "call main". If a routine is responsible for 10% of the time, including routines that it calls, then during that time it is on the stack. The same is true of individual statements or even instructions. (Ignore "self time" or "exclusive time". It's a distraction.)
Profilers that put timers and counters on functions can only give you some of this information. Profilers that only sample the program counter tell you even less. What you need is something that samples the call stack and reports to you by line (not just by function) the percent of stack samples containing that line. It's also important that they sample the stack a) during I/O or other blockage, but b) not while waiting for user input.

There are profilers that can do this. I'm not sure about Java.

If you're still with me, let me throw out another ringer. You're looking for things you can optimize, right? and only things that have a large enough percent to be worth the trouble, like 10% or more? Such a line of code costing 10% is on the stack 10% of the time. That means if 20,000 samples are taken, it is on about 2,000 of them. If 20 samples are taken, it is on about 2 of them, on average. Now, you're trying to find the line, right? Does it really matter if the percent is off a little bit, as long as you find it? That's another one of those happy myths of profilers - that precision of timing matters. For finding problems worth fixing, 20,000 samples won't tell you much more than 20 samples will. So what do I do? Just take the samples by hand and study them. Code worth optimizing will simply jump out at me.

Finally, there's a big gob of good news. There are probably multiple things you could optimize. Suppose you fix a 20% problem and make it go away. Overall time shrinks to 4/5 of what it was, but the other problems aren't taking any less time, so now their percentage is 5/4 of what it was, because the denominator got smaller. Percentage-wise they got bigger, and easier to find. This effect snowballs, allowing you to really squeeze the code.

Question 16

You could try using a memoizing or caching strategy to reduce the number of actual calls. Another thing you could try if you're very desperate is a native array, since indexing those is unbelievably fast, and JNI shouldn't invoke toooo much overhead if you're using parameters like longs that don't require marshalling.

Mark Peters Mark Peters 81.3k17 gold badges162 silver badges193 bronze badges · Accepted Answer · 2010-11-05 15:34:57Z

Nothing in your implementation strikes me as particularly inefficient. I'll admit I don't really follow your hashing/lookup strategy, but if you say it's performant in your circumstances, I'll believe you.

The only thing that I would expect might make some difference is to move the key out of the values array of Entry.

Instead of having this:

class Entry {
 long[] values;
}
//...
if ( entry.values[key_index] == key ) { //...

Try this:

class Entry {
 long key;
 long values[];
}
//...
if ( entry.key == key ) { //...

Instead of incurring the cost of accessing a member, plus doing bounds checking, then getting a value of the array, you should just incur the cost of accessing the member.

Is there a random-access data type faster than an array?

I was interested in the answer to this question, so I set up a test environment. This is my Array interface:

interface Array {
 long get(int i);
 void set(int i, long v);
}

This "Array" has undefined behaviour when indices are out of bounds. I threw together the obvious implementation:

class NormalArray implements Array {
 private long[] data;
 public NormalArray(int size) {
 data = new long[size];
 }
 @Override
 public long get(int i) {
 return data[i];
 }
 @Override
 public void set(int i, long v) {
 data[i] = v;
 }
}

And then a control:

class NoOpArray implements Array {
 @Override
 public long get(int i) {
 return 0;
 }
 @Override
 public void set(int i, long v) {
 }
}

Finally, I designed an "array" where the first 10 indices are hardcoded members. The members are set/selected through a switch:

class TenArray implements Array {
 private long v0;
 private long v1;
 private long v2;
 private long v3;
 private long v4;
 private long v5;
 private long v6;
 private long v7;
 private long v8;
 private long v9;
 private long[] extras;
 public TenArray(int size) {
 if (size > 10) {
 extras = new long[size - 10];
 }
 }
 @Override
 public long get(final int i) {
 switch (i) {
 case 0:
 return v0;
 case 1:
 return v1;
 case 2:
 return v2;
 case 3:
 return v3;
 case 4:
 return v4;
 case 5:
 return v5;
 case 6:
 return v6;
 case 7:
 return v7;
 case 8:
 return v8;
 case 9:
 return v9;
 default:
 return extras[i - 10];
 }
 }
 @Override
 public void set(final int i, final long v) {
 switch (i) {
 case 0:
 v0 = v; break;
 case 1:
 v1 = v; break;
 case 2:
 v2 = v; break;
 case 3:
 v3 = v; break;
 case 4:
 v4 = v; break;
 case 5:
 v5 = v; break;
 case 6:
 v6 = v; break;
 case 7:
 v7 = v; break;
 case 8:
 v8 = v; break;
 case 9:
 v9 = v; break;
 default:
 extras[i - 10] = v;
 }
 }
}

I tested it with this harness:

import java.util.Random;
public class ArrayOptimization {
 public static void main(String[] args) {
 int size = 10;
 long[] data = new long[size];
 Random r = new Random();
 for ( int i = 0; i < data.length; i++ ) {
 data[i] = r.nextLong();
 }
 Array[] a = new Array[] {
 new NoOpArray(),
 new NormalArray(size),
 new TenArray(size)
 };
 for (;;) {
 for ( int i = 0; i < a.length; i++ ) {
 testSet(a[i], data, 10000000);
 testGet(a[i], data, 10000000);
 }
 }
 }
 private static void testGet(Array a, long[] data, int iterations) {
 long nanos = System.nanoTime();
 for ( int i = 0; i < iterations; i++ ) {
 for ( int j = 0; j < data.length; j++ ) {
 data[j] = a.get(j);
 }
 }
 long stop = System.nanoTime();
 System.out.printf("%s/get took %fms%n", a.getClass().getName(), 
 (stop - nanos) / 1000000.0);
 }
 private static void testSet(Array a, long[] data, int iterations) {
 long nanos = System.nanoTime();
 for ( int i = 0; i < iterations; i++ ) {
 for ( int j = 0; j < data.length; j++ ) {
 a.set(j, data[j]);
 }
 }
 long stop = System.nanoTime();
 System.out.printf("%s/set took %fms%n", a.getClass().getName(), 
 (stop - nanos) / 1000000.0);
 }
}

The results were somewhat surprising. The TenArray performs non-trivially faster than a NormalArray does (for sizes <= 10). Subtracting the overhead (using the NoOpArray average) you get TenArray as taking ~65% of the time of the normal array. So if you know the likely max size of your array, I suppose it is possible to exceed the speed of an array. I would imagine switch uses either less bounds checking or more efficient bounds checking than does an array.

NoOpArray/set took 953.272654ms
NoOpArray/get took 891.514622ms
NormalArray/set took 1235.694953ms
NormalArray/get took 1148.091061ms
TenArray/set took 1149.833109ms
TenArray/get took 1054.040459ms
NoOpArray/set took 948.458667ms
NoOpArray/get took 888.618223ms
NormalArray/set took 1232.554749ms
NormalArray/get took 1120.333771ms
TenArray/set took 1153.505578ms
TenArray/get took 1056.665337ms
NoOpArray/set took 955.812843ms
NoOpArray/get took 893.398847ms
NormalArray/set took 1237.358472ms
NormalArray/get took 1125.100537ms
TenArray/set took 1150.901231ms
TenArray/get took 1057.867936ms

Now whether you can in practice get speeds faster than an array I'm not sure; obviously this way you incur any overhead associated with the interface/class/methods.

This was something I thought about doing initially but the Entry objects are shared, they can be in different hash tables based on different values in their values array at the same time.
fwiw, I changed the sharing of the Entry objects since that was easy, and implemented this idea. Total time was reduced from 23% of the overall execution to a little over 12%. I would not have expected the cost of the bounds check to be that expensive. Any other ways to avoid bounds checks? I don't suppose they can be compiled out? =)
@Michael, The compiler does next to no optimising, the JVM does all the optmisation. What is the average lookup time that you see?
@Michael: I did some testing/analysis and added it to my answer. Nothing I'd suggest actually using, but I did find the results interesting.
@Peter, doing the math from the profiler results, knowing that it takes x% of y execution time over z iterations, its ~4.5ns. Writing a test program that actually times the accesses (hash table of 500k elements, 500k random accesses), its ~160ns, or ~140ns with Mark's suggested change. The test harness only tests hits, so that's the access time for a hit, the lower bound for a miss is probably only a ns or two, and there are probably lots of misses during a normal invocation, which is why the overall calcuated time is somewhat small compared to the avg time for a hit.

CollectivesTM on Stack Overflow

can array access be optimized?

4 Answers 4

Is there a random-access data type faster than an array?

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

CollectivesTM on Stack Overflow

4 Answers 4

Is there a random-access data type faster than an array?

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related