Hash-consing: running slow

Question 1

I've isolated this function, cons, for hash-consing in C. It is where the bottleneck of my program is. How could its performance be improved?

var32 cons(var32 x, var32 y){
 var32 ptr;
 uid id, hashedId;
 map newMap, currentMap;
 u32 colls;
 #ifdef __DEBUG__
 ++consCount;
 colls = 0;
 #endif
 // Before anything, compute the unique id of that node.
 id = (((uid)(x.ptr))<<32) | (uid)y.ptr;
 // Try to find the node. If it already exists, return its ptr.
 hashedId = id % MAP_HASH_SIZE;
 currentMap = mapByHashedId[hashedId];
 if (currentMap) do {
 if (mapId[currentMap] == id)
 return mapPtr[currentMap];
 #ifdef __DEBUG__
 ++colls;
 #endif
 } while ((currentMap = mapNext[currentMap]));
 #ifdef __DEBUG__
 if (colls<16) ++collisionCount[colls];
 ++allocCount;
 #endif
 // If does not exist, then alloc a new ptr to it.
 ptr = nodeNewPtr[nodeNewPtrIndex--];
 // Add that new ptr to the map.
 newMap = mapNewPtr[mapNewPtrIndex--];
 if (mapByHashedId[hashedId]) 
 mapNext[newMap] = mapByHashedId[hashedId];
 mapId[newMap] = id;
 mapPtr[newMap] = ptr;
 mapByHashedId[hashedId] = newMap;
 // Fill that new ptr.
 nodeLeft[GETPTR(ptr)] = x;
 nodeRight[GETPTR(ptr)] = y;
 return ptr;
};

Here is a link for complete source with tests!

Question 2

Your code link has gone dead, so I cannot see the complete source.

Question 3

I'm sorry, @rolfl. Fixed it now.

Question 4

Your code here is hard to follow because your code style and format are inconvenient.

Structures like this:

currentMap = mapByHashedId[hashedId];
if (currentMap) do {
 if (mapId[currentMap] == id)
 return mapPtr[currentMap];
 #ifdef __DEBUG__
 ++colls;
 #endif
} while ((currentMap = mapNext[currentMap]));

are .... irregular. a better format would be:

currentMap = mapByHashedId[hashedId];
while (currentMap)
{ 
 if (mapId[currentMap] == id)
 return mapPtr[currentMap];
 #ifdef __DEBUG__
 ++colls;
 #endif
 currentMap = mapNext[currentMap]
}

Additionally, this is the only loop in your method. All the other code are simple array and pointer manipulations. This method can only be slow because of three reasons:

it is called many, many times
the while loop iterates many many times
there's something else you're not showing.

Assuming the number of calls are appropriate, and you're not missing anything, the only alternative is that the while loop loops often. This indicates there are many collisions. Collisions happen for two reasons:

your MAP_HASH_SIZE is too small
your hashing function is irregular

I am going to go with both of those. In fact, I am going to guess that MAP_HASH_SIZE is something like 8192, or some other power of 2. The reason this is my (educated) guess, is that your hash function is vulnerable to inconsistencies.

Consider how you build your function, using two 32-bit values with the x value shifted and or'd with the y value. Let's also use some test data:

 x y
--- ---
0x1 0x1
0x2 0x1
0x3 0x1
0x4 0x1

Now, in your function you would produce the ID by shifting the X, and the hash, by the modulo of the table size. Let's assume the table size is 32768 (a nice big value that should eliminate collisions, right)?

 x y id hashedId
--- --- ---------- --------
0x1 0x1 0x10000001 0x1
0x2 0x1 0x20000001 0x1
0x3 0x1 0x30000001 0x1
0x4 0x1 0x40000001 0x1

Because your MAP_HASH_SIZE is likely a power-of-2, the % MAP_HASH_SIZE is essentially the same as a bitmask of the low-order bits (or 0x7fff for size 32768)

If all your y values are similar, you will end up with many collisions in just a few buckets in the hash table.

For the hash function I would consider guaranteeing a minimum size (say 16 bits, or 64K), and then having a function which at least shifts some of the x value in to the range, and does an Xor....

hashedId = ((x << 7) ^ (y * 31)) % MAP_HASH_SIZE;

The above function will give you better distribution (perhaps not perfect) even if the hash size is a power of 2> about 16 bits.

Question 5

Great answer. Since this question is kinda old, I've learned some of those issues in the meanwhile, and it has improved a lot in too many (some obvious) ways. I'll be using your hash function. Thanks!

rolfl rolfl 98.1k17 gold badges219 silver badges419 bronze badges · Answer 1 · 2014-10-25 14:29:51Z

Your code here is hard to follow because your code style and format are inconvenient.

Structures like this:

currentMap = mapByHashedId[hashedId];
if (currentMap) do {
 if (mapId[currentMap] == id)
 return mapPtr[currentMap];
 #ifdef __DEBUG__
 ++colls;
 #endif
} while ((currentMap = mapNext[currentMap]));

are .... irregular. a better format would be:

currentMap = mapByHashedId[hashedId];
while (currentMap)
{ 
 if (mapId[currentMap] == id)
 return mapPtr[currentMap];
 #ifdef __DEBUG__
 ++colls;
 #endif
 currentMap = mapNext[currentMap]
}

Additionally, this is the only loop in your method. All the other code are simple array and pointer manipulations. This method can only be slow because of three reasons:

it is called many, many times
the while loop iterates many many times
there's something else you're not showing.

Assuming the number of calls are appropriate, and you're not missing anything, the only alternative is that the while loop loops often. This indicates there are many collisions. Collisions happen for two reasons:

your MAP_HASH_SIZE is too small
your hashing function is irregular

I am going to go with both of those. In fact, I am going to guess that MAP_HASH_SIZE is something like 8192, or some other power of 2. The reason this is my (educated) guess, is that your hash function is vulnerable to inconsistencies.

Consider how you build your function, using two 32-bit values with the x value shifted and or'd with the y value. Let's also use some test data:

 x y
--- ---
0x1 0x1
0x2 0x1
0x3 0x1
0x4 0x1

Now, in your function you would produce the ID by shifting the X, and the hash, by the modulo of the table size. Let's assume the table size is 32768 (a nice big value that should eliminate collisions, right)?

 x y id hashedId
--- --- ---------- --------
0x1 0x1 0x10000001 0x1
0x2 0x1 0x20000001 0x1
0x3 0x1 0x30000001 0x1
0x4 0x1 0x40000001 0x1

Because your MAP_HASH_SIZE is likely a power-of-2, the % MAP_HASH_SIZE is essentially the same as a bitmask of the low-order bits (or 0x7fff for size 32768)

If all your y values are similar, you will end up with many collisions in just a few buckets in the hash table.

For the hash function I would consider guaranteeing a minimum size (say 16 bits, or 64K), and then having a function which at least shifts some of the x value in to the range, and does an Xor....

hashedId = ((x << 7) ^ (y * 31)) % MAP_HASH_SIZE;

The above function will give you better distribution (perhaps not perfect) even if the hash size is a power of 2> about 16 bits.

Great answer. Since this question is kinda old, I've learned some of those issues in the meanwhile, and it has improved a lot in too many (some obvious) ways. I'll be using your hash function. Thanks!

Stack Exchange Network

Hash-consing: running slow

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Hash-consing: running slow

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions