Perfect Hashing Implementation

Question 1

I'm building a hash table for this problem:

Implement a class for storing a set of integers:
class FixedSet {
public:
 FixedSet();
 void Initialize(const vector<int>& numbers);
 bool Contains(int number) const;
};
After the call of the Initialize() function, FixedSet contains integers from the input vector. The set of numbers stored by FixedSet does not change (until the Initialize() function is called again).

The Contains() function returns true if number is in the set, or false otherwise. The Initialize() function should run in expected O(n), where n is the number of elements in numbers. The memory consumption should not exceed O(n). The Contains() function should run in guaranteed O(1).

Using this class, solve the following problem. The input has a set of different numbers followed by a set of requests, each of which is represented by an integer number. For each request, you need to determine whether or not the number is in the set.

The sizw of set is not bigger than 100000 elements; the absolute values of elements |val| <= 10⁹.

I've implemented perfect hashing from Cornmen:

#include <time.h>
#include <algorithm>
#include <iostream>
#include <vector>
#include <list>
#include <stdio.h>
using std::cout;
using std::cin;
using std::endl;
using std::vector;
using std::list;
typedef long long int long_int;
const int max_int = 1000000001; // value, that could't be in the table. Analog of NULL.
// function for calculation of hash 
inline int hash(long_int a_prime, long_int b_prime, int p_prime, int table_size, int key)
{
 return (((a_prime * key + b_prime) % p_prime) % table_size);
}
// class for mini-hash table in cells of main hash-table 
class Bucket
{
 vector<int> _cells;
 int size; // the size of mini-table should be greater then 4 
 long_int hash_a;
 long_int hash_b;
 int prime;
public:
 Bucket() {}
 void Initialize()
 {
 prime = 17;
 hash_a = std::rand() % prime;
 hash_b = 1 + std::rand() % (prime - 1);
 }
 // construct hash table from list of elements
 void Construct(list<int>& input)
 {
 if (input.empty())
 {
 size = 0;
 return;
 }
 size = input.size() * input.size();
 bool flag = true;
 // while there is no collisions in table 
 while (flag)
 {
 _cells.assign(size, max_int);
 Initialize();
 list<int>::iterator elem = input.begin();
 while (elem != input.end() && flag)
 {
 int hashKey = hash(hash_a, hash_b, prime, size, *elem);
 if (hashKey < 0) 
 hashKey = - hashKey;
 // if collision then construct hash table from the begining!
 if (_cells[hashKey] != max_int) 
 {
 flag = false;
 break;
 }
 _cells[hashKey] = *elem;
 ++elem;
 }
 if (!flag)
 flag = true;
 else 
 flag = false;
 }
 }
 bool Contains(int elem)
 {
 if (size == 0)
 return false;
 int hashKey = hash(hash_a, hash_b, prime, size, elem);
 if (hashKey < 0) 
 hashKey = - hashKey;
 if (_cells[hashKey] == elem)
 return true;
 return false;
 }
};
// class for main hash table
class FixedSet
{
 int _tableSize;
 long_int _hashFuncA;
 long_int _hashFuncB;
 int _primeNumber;
 vector<list<int> > _elementsInCells;
 vector<Bucket> _buckets;
public:
 FixedSet()
 {
 _primeNumber = 100013; // the maximum prime number
 _hashFuncA = std::rand() % _primeNumber;
 _hashFuncB = 1 + std::rand() % (_primeNumber - 1);
 }
 void setTableSize(int size)
 {
 _tableSize = size;
 _buckets.resize(size);
 }
 void Initialize(const vector<int>& numbers)
 { 
 _tableSize = numbers.size();
 _buckets.resize(numbers.size());
 _elementsInCells.resize(numbers.size());
 for (int i = 0; i < numbers.size(); ++i)
 {
 int hashKey = hash(_hashFuncA, _hashFuncB, _primeNumber, _tableSize, numbers[i]);
 if (hashKey < 0) 
 hashKey = - hashKey;
 _elementsInCells[hashKey].push_back(numbers[i]);
 }
 for (int i = 0; i < numbers.size(); ++i)
 {
 _buckets[i].Construct(_elementsInCells[i]);
 }
 }
 bool Contains(int number)
 {
 int hashKey = hash(_hashFuncA, _hashFuncB, _primeNumber, _tableSize, number);
 if (hashKey < 0) 
 hashKey = - hashKey;
 return _buckets[hashKey].Contains(number);
 }
};
int main(int argc, char* argv[])
{
 clock_t begin, end;
 double time_spent;
 std::srand (time(NULL));
 int numberOfElements;
 scanf("%i", &numberOfElements);
 FixedSet fs;
 begin = clock();
 vector<int> inputVector;
 fs.setTableSize(numberOfElements);
 for (int i = 0; i < numberOfElements; ++i)
 {
 int elemValue;
 scanf("%d", &elemValue);
 inputVector.push_back(elemValue);
 }
 fs.Initialize(inputVector);
 end = clock();
 int numberOfElementsForSearch;
 scanf("%i", &numberOfElementsForSearch);
 for (int i = 0; i < numberOfElementsForSearch; ++i)
 {
 int elem;
 scanf("%d", &elem);
 if (fs.Contains(elem))
 {
 cout << "Yes" << endl;
 }
 else
 {
 cout << "No" << endl;
 }
 }
 time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
 cout << time_spent << endl;
 return 0;
}

It works rather fast, but on a vector of 100000 elements, the function Initialize() works for 0,7 ms in Release. But on a vector of 50000, this function works for 1.8 ms.

Could you explain why this is so? How can I improve my code to make it work faster?

Question 2

Ann, I don't understand why the code of Bucket is more or less repeated in FixedSet. I guess I am being stupid, but can you please explain.

Question 3

These 2 classes have the same interface, but implementations are different. So I've decided to create 2 classes, for hash-table of first level and hash-table of second level. May be it would be better to make abstact class HashTable with virtual methods Initialize() and Contains() and inherit my classes from it.

Question 4

But why do you need a second level at all?

Question 5

This is the idea of perfect hashing - to use hash table of second level for elements that have the same hash value (in average, if I use good hash function it won't be greater than 2 elements with the same hash). In this way I can check if an element in the table in O(1) time. But if I use linked list for collisions in the cells it won't be O(1).

Question 6

I tried running your implementation on my machine but with the following input (input is 25 random integers within the interval [0; 10^9]) Initialize never completes (it's stuck in the bucket while (flag) loop):

Note: Never in this context means I killed it after 10 or 15 seconds.

So it's not surprising that a list with 50,000 random elements can easily result in a higher setup time than 100,000 depending on the distribution of the values (the algorithm attempts to re-create the table every time there is a collision). As per given example it's easy to generate inputs which will make it run for a long time for very few values.

Given the problem restrictions I'd say you can get away with the fastest perfect hash function there is: Identity.

You could simply use a bitset with the values as indices and flip the bits on if the value is present. This requires a bitset with 10^9 bits - approx. 125MB of memory which is not all that much these days (at least on desktops).

Resulting implementation:

class FixedSet
{
 vector<bool> _set;
public:
 FixedSet() : _set(1000000000) { }
 void Initialize(const vector<int>& numbers)
 { 
 for (int i = 0; i < numbers.size(); ++i)
 {
 _set[numbers[i]] = true;
 }
 }
 bool Contains(int number)
 {
 return _set[number];
 }
};

Initialize takes approx 45ms on my machine - it doesn't matter if it's 1 or 100,000 input elements. Apparently allocating the memory for the bitset takes the bulk of the time. Flipping the bits on seems to hardly take any time at all.

Searching for 10,000 random elements takes about 3ms

Some notes:

In case you are wondering: std::vector<bool> is a specialization of a vector packing bools as bits. This container is considered somewhat as a failure in the C++ standard but happens to do exactly what we need here.
This implementation of FixedSet has space complexity O(1) (albeit a fairly big constant)
Initializing it has time complexity O(n) although the difference between 1 and 100,000 input elements is not measurable with the provided timing method
Looking up an element has time complexity of O(1) (indexed container access)
Disadvantages:
- Fairly big initial upfront cost for setup (memory allocation)
- Might not have been in the spirit of the assignment

Update

I think the main problem with your implementation is that the bucket size is fairly small (from testing it looks like you rarely get more than 6 or 7 collisions). Especially in buckets with more than 4 collisions your hash function actually throws away all hashes larger than 17 (which is p_prime you pass in from bucket) so you reduce your available bucket space. Also my suspicion is that in your hash function a_prime and b_prime should actually be, well, prime but rand() % prime does not yield a prime number.

While reading up on hashing, a popular version seems to be Cuckoo Hashing which has a constant look up time (if you use two hash functions it will perform at most two lookups) and also a constant worst case insert time for an element (even if you have to rehash).

I threw together a quick hack implementation which does not do re-hashing and just relies on the sets being big enough:

class FixedSet
{
 // min 36500, sweetspot 73000
 static const int _size = 73000;
 int _noentry;
 std::vector<int> _set1;
 std::vector<int> _set2;
 std::vector<int> _set3;
public:
 FixedSet() : _noentry(std::numeric_limits<int>::min()), _set1(_size, _noentry), _set2(_size, _noentry), _set3(_size, _noentry) { }
 void Initialize(const vector<int>& numbers)
 { 
 for (int i = 0; i < numbers.size(); ++i)
 {
 if (!add(numbers[i]))
 {
 std::ostringstream o;
 o << "Failed to insert after " << i << " elements, rehashing not implemented";
 throw new std::exception(o.str().c_str());
 }
 }
 }
 bool Contains(int number)
 {
 for (int round = 0; round < 3; ++round)
 {
 std::vector<int>& _set = (round % 3 == 0) ? _set1 : ((round % 3 == 1) ? _set2 : _set3);
 int h = hash(round + 1, number);
 if (number == _set[h])
 return true;
 }
 return false;
 }
private:
 int hash(int rounds, int number)
 {
 int withOffset = number;
 srand(withOffset);
 int h = bigRand() % _size;
 while (--rounds)
 {
 h = bigRand() % _size;
 }
 return h;
 }
 inline int bigRand()
 {
 return (rand() << 15) | rand(); // RAND_MAX is 0x7FFF in VS2010
 }
 bool add(int number)
 {
 int toInsert = number;
 for (int i = 0; i < _size; ++i)
 {
 int h = hash(i % 3 + 1, toInsert);
 std::vector<int>& _set = (i % 3 == 0) ? _set1 : ((i % 3 == 1) ? _set2 : _set3);
 int current = _set[h];
 if (current == _noentry)
 {
 _set[h] = toInsert;
 return true;
 }
 _set[h] = toInsert;
 toInsert = current;
 }
 return false;
 }
};

Notes:

I'm using 3 hash functions instead of the classic 2. Can't get the load factor high enough with just 2 functions for some reason.
The above is tweaked for no more than 100,000 input elements
36500 seems to be the minimum set size which yields a 91% load factor of the sets which is pretty much the limit for Cuckoo with 3 hash functions according to wikipedia.
With 36500 set size it takes approx. 74ms on my machine to insert 100,000 elements.
Doubling to 73000 yields a 75% reduction in run time to 17ms.
Further increases in the set size do not seem to have much effect.

Question 7

Unfortunatly, this algorithm fails with "Memory Limit" error in our check system on the first test. And my variant failed on 84 test with "Time Limit" error. So I need to find the way to speed up my algorithm, or reduce space complexity in your algorithm.

Question 8

The bitset idea would not be O(n) memory. Rather, the memory needed depends on the range of values.

Question 9

@200_success: Yes but the range is defined fixed as part of the problem description and therefor it is considered constant. As general solution you are right.

Question 10

Haha! Loophole!

Question 11

@AnnOrlova: Updated my answer

Question 12

Chris has given a great answer on the hashing so here are some comments on the implementation.

My main observation is that you should avoid duplicating so much code, however you implement it.

Your hash function should return the absolute (unsigned) value of the hash, as each use of hash is followed by an if:

int hashKey = hash(hash_a, hash_b, prime, size, *elem);
if (hashKey < 0)
 hashKey = - hashKey;

And perhaps modify the return type to that expected of vector indices (and noting that the parameters don't seem to need to be long long):

typedef std::vector<int>::size_type Hashkey;
static Hashkey hash(int a, int b, int p, int size, int key)
{
 return (Hashkey) abs((((a * key + b) % p) % size));
}

Also on types, where you have a defined range for a variable, the types in stdint.h can be useful - eg int64_t is guaranteed to be 64 bits which suits your input range while your long long int might be longer (also the int is usually omitted, ie. just long long).

Your use of flag to continue the outer loop in Bucket::Construct is very strange. I'd prefer to see the inner loop extracted to a function which should use a for loop:

for (auto e = input.begin(); e != input.end(); ++e) {
 ...
 if (cells[hashKey] != max_int) {
 return false;
 }
 ...
}
return true;

and called, as in:

while (createHashTable(...) == false) {
 // nothing
}

I don't understand why you (and many other people posting here) use leading underscores on variable names. For me these are just noise and detract from otherwise nice looking code (for example, your _cells). Also on variable names, some of your names are too long. Where the scope of a variable is small, its name can (and I think should) also be small.

Finally, your prime numbers (17 and 100013) would be better not embedded in the code (perhaps use a #define or const at the top). Also a comment on how these numbers were determined would be useful (and the origin of the hash function?).

ChrisWue ChrisWue 20.6k4 gold badges43 silver badges107 bronze badges · Accepted Answer · 2013-11-13 09:51:28Z

I tried running your implementation on my machine but with the following input (input is 25 random integers within the interval [0; 10^9]) Initialize never completes (it's stuck in the bucket while (flag) loop):

Note: Never in this context means I killed it after 10 or 15 seconds.

So it's not surprising that a list with 50,000 random elements can easily result in a higher setup time than 100,000 depending on the distribution of the values (the algorithm attempts to re-create the table every time there is a collision). As per given example it's easy to generate inputs which will make it run for a long time for very few values.

Given the problem restrictions I'd say you can get away with the fastest perfect hash function there is: Identity.

You could simply use a bitset with the values as indices and flip the bits on if the value is present. This requires a bitset with 10^9 bits - approx. 125MB of memory which is not all that much these days (at least on desktops).

Resulting implementation:

class FixedSet
{
 vector<bool> _set;
public:
 FixedSet() : _set(1000000000) { }
 void Initialize(const vector<int>& numbers)
 { 
 for (int i = 0; i < numbers.size(); ++i)
 {
 _set[numbers[i]] = true;
 }
 }
 bool Contains(int number)
 {
 return _set[number];
 }
};

Initialize takes approx 45ms on my machine - it doesn't matter if it's 1 or 100,000 input elements. Apparently allocating the memory for the bitset takes the bulk of the time. Flipping the bits on seems to hardly take any time at all.

Searching for 10,000 random elements takes about 3ms

Some notes:

In case you are wondering: std::vector<bool> is a specialization of a vector packing bools as bits. This container is considered somewhat as a failure in the C++ standard but happens to do exactly what we need here.
This implementation of FixedSet has space complexity O(1) (albeit a fairly big constant)
Initializing it has time complexity O(n) although the difference between 1 and 100,000 input elements is not measurable with the provided timing method
Looking up an element has time complexity of O(1) (indexed container access)
Disadvantages:
- Fairly big initial upfront cost for setup (memory allocation)
- Might not have been in the spirit of the assignment

Update

I think the main problem with your implementation is that the bucket size is fairly small (from testing it looks like you rarely get more than 6 or 7 collisions). Especially in buckets with more than 4 collisions your hash function actually throws away all hashes larger than 17 (which is p_prime you pass in from bucket) so you reduce your available bucket space. Also my suspicion is that in your hash function a_prime and b_prime should actually be, well, prime but rand() % prime does not yield a prime number.

While reading up on hashing, a popular version seems to be Cuckoo Hashing which has a constant look up time (if you use two hash functions it will perform at most two lookups) and also a constant worst case insert time for an element (even if you have to rehash).

I threw together a quick hack implementation which does not do re-hashing and just relies on the sets being big enough:

class FixedSet
{
 // min 36500, sweetspot 73000
 static const int _size = 73000;
 int _noentry;
 std::vector<int> _set1;
 std::vector<int> _set2;
 std::vector<int> _set3;
public:
 FixedSet() : _noentry(std::numeric_limits<int>::min()), _set1(_size, _noentry), _set2(_size, _noentry), _set3(_size, _noentry) { }
 void Initialize(const vector<int>& numbers)
 { 
 for (int i = 0; i < numbers.size(); ++i)
 {
 if (!add(numbers[i]))
 {
 std::ostringstream o;
 o << "Failed to insert after " << i << " elements, rehashing not implemented";
 throw new std::exception(o.str().c_str());
 }
 }
 }
 bool Contains(int number)
 {
 for (int round = 0; round < 3; ++round)
 {
 std::vector<int>& _set = (round % 3 == 0) ? _set1 : ((round % 3 == 1) ? _set2 : _set3);
 int h = hash(round + 1, number);
 if (number == _set[h])
 return true;
 }
 return false;
 }
private:
 int hash(int rounds, int number)
 {
 int withOffset = number;
 srand(withOffset);
 int h = bigRand() % _size;
 while (--rounds)
 {
 h = bigRand() % _size;
 }
 return h;
 }
 inline int bigRand()
 {
 return (rand() << 15) | rand(); // RAND_MAX is 0x7FFF in VS2010
 }
 bool add(int number)
 {
 int toInsert = number;
 for (int i = 0; i < _size; ++i)
 {
 int h = hash(i % 3 + 1, toInsert);
 std::vector<int>& _set = (i % 3 == 0) ? _set1 : ((i % 3 == 1) ? _set2 : _set3);
 int current = _set[h];
 if (current == _noentry)
 {
 _set[h] = toInsert;
 return true;
 }
 _set[h] = toInsert;
 toInsert = current;
 }
 return false;
 }
};

Notes:

I'm using 3 hash functions instead of the classic 2. Can't get the load factor high enough with just 2 functions for some reason.
The above is tweaked for no more than 100,000 input elements
36500 seems to be the minimum set size which yields a 91% load factor of the sets which is pretty much the limit for Cuckoo with 3 hash functions according to wikipedia.
With 36500 set size it takes approx. 74ms on my machine to insert 100,000 elements.
Doubling to 73000 yields a 75% reduction in run time to 17ms.
Further increases in the set size do not seem to have much effect.

Unfortunatly, this algorithm fails with "Memory Limit" error in our check system on the first test. And my variant failed on 84 test with "Time Limit" error. So I need to find the way to speed up my algorithm, or reduce space complexity in your algorithm.
The bitset idea would not be O(n) memory. Rather, the memory needed depends on the range of values.
@200_success: Yes but the range is defined fixed as part of the problem description and therefor it is considered constant. As general solution you are right.

Stack Exchange Network

Perfect Hashing Implementation

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Perfect Hashing Implementation

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions