I'm building a hash table for this problem:
Implement a class for storing a set of integers:
class FixedSet { public: FixedSet(); void Initialize(const vector<int>& numbers); bool Contains(int number) const; };
After the call of the
Initialize()
function,FixedSet
contains integers from the input vector. The set of numbers stored byFixedSet
does not change (until theInitialize()
function is called again).The
Contains()
function returns true if number is in the set, or false otherwise. TheInitialize()
function should run in expected O(n), where n is the number of elements in numbers. The memory consumption should not exceed O(n). TheContains()
function should run in guaranteed O(1).Using this class, solve the following problem. The input has a set of different numbers followed by a set of requests, each of which is represented by an integer number. For each request, you need to determine whether or not the number is in the set.
The sizw of set is not bigger than 100000 elements; the absolute values of elements |val| <= 109.
I've implemented perfect hashing from Cornmen:
#include <time.h>
#include <algorithm>
#include <iostream>
#include <vector>
#include <list>
#include <stdio.h>
using std::cout;
using std::cin;
using std::endl;
using std::vector;
using std::list;
typedef long long int long_int;
const int max_int = 1000000001; // value, that could't be in the table. Analog of NULL.
// function for calculation of hash
inline int hash(long_int a_prime, long_int b_prime, int p_prime, int table_size, int key)
{
return (((a_prime * key + b_prime) % p_prime) % table_size);
}
// class for mini-hash table in cells of main hash-table
class Bucket
{
vector<int> _cells;
int size; // the size of mini-table should be greater then 4
long_int hash_a;
long_int hash_b;
int prime;
public:
Bucket() {}
void Initialize()
{
prime = 17;
hash_a = std::rand() % prime;
hash_b = 1 + std::rand() % (prime - 1);
}
// construct hash table from list of elements
void Construct(list<int>& input)
{
if (input.empty())
{
size = 0;
return;
}
size = input.size() * input.size();
bool flag = true;
// while there is no collisions in table
while (flag)
{
_cells.assign(size, max_int);
Initialize();
list<int>::iterator elem = input.begin();
while (elem != input.end() && flag)
{
int hashKey = hash(hash_a, hash_b, prime, size, *elem);
if (hashKey < 0)
hashKey = - hashKey;
// if collision then construct hash table from the begining!
if (_cells[hashKey] != max_int)
{
flag = false;
break;
}
_cells[hashKey] = *elem;
++elem;
}
if (!flag)
flag = true;
else
flag = false;
}
}
bool Contains(int elem)
{
if (size == 0)
return false;
int hashKey = hash(hash_a, hash_b, prime, size, elem);
if (hashKey < 0)
hashKey = - hashKey;
if (_cells[hashKey] == elem)
return true;
return false;
}
};
// class for main hash table
class FixedSet
{
int _tableSize;
long_int _hashFuncA;
long_int _hashFuncB;
int _primeNumber;
vector<list<int> > _elementsInCells;
vector<Bucket> _buckets;
public:
FixedSet()
{
_primeNumber = 100013; // the maximum prime number
_hashFuncA = std::rand() % _primeNumber;
_hashFuncB = 1 + std::rand() % (_primeNumber - 1);
}
void setTableSize(int size)
{
_tableSize = size;
_buckets.resize(size);
}
void Initialize(const vector<int>& numbers)
{
_tableSize = numbers.size();
_buckets.resize(numbers.size());
_elementsInCells.resize(numbers.size());
for (int i = 0; i < numbers.size(); ++i)
{
int hashKey = hash(_hashFuncA, _hashFuncB, _primeNumber, _tableSize, numbers[i]);
if (hashKey < 0)
hashKey = - hashKey;
_elementsInCells[hashKey].push_back(numbers[i]);
}
for (int i = 0; i < numbers.size(); ++i)
{
_buckets[i].Construct(_elementsInCells[i]);
}
}
bool Contains(int number)
{
int hashKey = hash(_hashFuncA, _hashFuncB, _primeNumber, _tableSize, number);
if (hashKey < 0)
hashKey = - hashKey;
return _buckets[hashKey].Contains(number);
}
};
int main(int argc, char* argv[])
{
clock_t begin, end;
double time_spent;
std::srand (time(NULL));
int numberOfElements;
scanf("%i", &numberOfElements);
FixedSet fs;
begin = clock();
vector<int> inputVector;
fs.setTableSize(numberOfElements);
for (int i = 0; i < numberOfElements; ++i)
{
int elemValue;
scanf("%d", &elemValue);
inputVector.push_back(elemValue);
}
fs.Initialize(inputVector);
end = clock();
int numberOfElementsForSearch;
scanf("%i", &numberOfElementsForSearch);
for (int i = 0; i < numberOfElementsForSearch; ++i)
{
int elem;
scanf("%d", &elem);
if (fs.Contains(elem))
{
cout << "Yes" << endl;
}
else
{
cout << "No" << endl;
}
}
time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
cout << time_spent << endl;
return 0;
}
It works rather fast, but on a vector of 100000 elements, the function Initialize()
works for 0,7 ms in Release. But on a vector of 50000, this function works for 1.8 ms.
Could you explain why this is so? How can I improve my code to make it work faster?
-
\$\begingroup\$ Ann, I don't understand why the code of Bucket is more or less repeated in FixedSet. I guess I am being stupid, but can you please explain. \$\endgroup\$William Morris– William Morris2013年11月13日 20:39:12 +00:00Commented Nov 13, 2013 at 20:39
-
\$\begingroup\$ These 2 classes have the same interface, but implementations are different. So I've decided to create 2 classes, for hash-table of first level and hash-table of second level. May be it would be better to make abstact class HashTable with virtual methods Initialize() and Contains() and inherit my classes from it. \$\endgroup\$Ann Orlova– Ann Orlova2013年11月14日 08:28:36 +00:00Commented Nov 14, 2013 at 8:28
-
\$\begingroup\$ But why do you need a second level at all? \$\endgroup\$William Morris– William Morris2013年11月14日 16:45:23 +00:00Commented Nov 14, 2013 at 16:45
-
\$\begingroup\$ This is the idea of perfect hashing - to use hash table of second level for elements that have the same hash value (in average, if I use good hash function it won't be greater than 2 elements with the same hash). In this way I can check if an element in the table in O(1) time. But if I use linked list for collisions in the cells it won't be O(1). \$\endgroup\$Ann Orlova– Ann Orlova2013年11月15日 05:16:31 +00:00Commented Nov 15, 2013 at 5:16
2 Answers 2
I tried running your implementation on my machine but with the following input (input is 25 random integers within the interval [0; 10^9]
) Initialize
never completes (it's stuck in the bucket while (flag)
loop):
25
882868245
264589055
955665379
570725902
186426836
425509062
780811177
528755197
921593609
210302061
162860187
237314629
771563954
716724339
500613765
749586096
118952462
708453275
530816792
697958285
841037949
796725013
123270367
470484394
578476359
1
252678354
Note: Never in this context means I killed it after 10 or 15 seconds.
So it's not surprising that a list with 50,000 random elements can easily result in a higher setup time than 100,000 depending on the distribution of the values (the algorithm attempts to re-create the table every time there is a collision). As per given example it's easy to generate inputs which will make it run for a long time for very few values.
Given the problem restrictions I'd say you can get away with the fastest perfect hash function there is: Identity.
You could simply use a bitset with the values as indices and flip the bits on if the value is present. This requires a bitset with 10^9 bits - approx. 125MB of memory which is not all that much these days (at least on desktops).
Resulting implementation:
class FixedSet
{
vector<bool> _set;
public:
FixedSet() : _set(1000000000) { }
void Initialize(const vector<int>& numbers)
{
for (int i = 0; i < numbers.size(); ++i)
{
_set[numbers[i]] = true;
}
}
bool Contains(int number)
{
return _set[number];
}
};
Initialize
takes approx 45ms on my machine - it doesn't matter if it's 1 or 100,000 input elements. Apparently allocating the memory for the bitset takes the bulk of the time. Flipping the bits on seems to hardly take any time at all.
Searching for 10,000 random elements takes about 3ms
Some notes:
- In case you are wondering:
std::vector<bool>
is a specialization of a vector packing bools as bits. This container is considered somewhat as a failure in the C++ standard but happens to do exactly what we need here. - This implementation of
FixedSet
has space complexityO(1)
(albeit a fairly big constant) - Initializing it has time complexity
O(n)
although the difference between 1 and 100,000 input elements is not measurable with the provided timing method - Looking up an element has time complexity of
O(1)
(indexed container access) - Disadvantages:
- Fairly big initial upfront cost for setup (memory allocation)
- Might not have been in the spirit of the assignment
Update
I think the main problem with your implementation is that the bucket size is fairly small (from testing it looks like you rarely get more than 6 or 7 collisions). Especially in buckets with more than 4 collisions your hash function actually throws away all hashes larger than 17 (which is p_prime
you pass in from bucket) so you reduce your available bucket space. Also my suspicion is that in your hash function a_prime
and b_prime
should actually be, well, prime but rand() % prime
does not yield a prime number.
While reading up on hashing, a popular version seems to be Cuckoo Hashing which has a constant look up time (if you use two hash functions it will perform at most two lookups) and also a constant worst case insert time for an element (even if you have to rehash).
I threw together a quick hack implementation which does not do re-hashing and just relies on the sets being big enough:
class FixedSet
{
// min 36500, sweetspot 73000
static const int _size = 73000;
int _noentry;
std::vector<int> _set1;
std::vector<int> _set2;
std::vector<int> _set3;
public:
FixedSet() : _noentry(std::numeric_limits<int>::min()), _set1(_size, _noentry), _set2(_size, _noentry), _set3(_size, _noentry) { }
void Initialize(const vector<int>& numbers)
{
for (int i = 0; i < numbers.size(); ++i)
{
if (!add(numbers[i]))
{
std::ostringstream o;
o << "Failed to insert after " << i << " elements, rehashing not implemented";
throw new std::exception(o.str().c_str());
}
}
}
bool Contains(int number)
{
for (int round = 0; round < 3; ++round)
{
std::vector<int>& _set = (round % 3 == 0) ? _set1 : ((round % 3 == 1) ? _set2 : _set3);
int h = hash(round + 1, number);
if (number == _set[h])
return true;
}
return false;
}
private:
int hash(int rounds, int number)
{
int withOffset = number;
srand(withOffset);
int h = bigRand() % _size;
while (--rounds)
{
h = bigRand() % _size;
}
return h;
}
inline int bigRand()
{
return (rand() << 15) | rand(); // RAND_MAX is 0x7FFF in VS2010
}
bool add(int number)
{
int toInsert = number;
for (int i = 0; i < _size; ++i)
{
int h = hash(i % 3 + 1, toInsert);
std::vector<int>& _set = (i % 3 == 0) ? _set1 : ((i % 3 == 1) ? _set2 : _set3);
int current = _set[h];
if (current == _noentry)
{
_set[h] = toInsert;
return true;
}
_set[h] = toInsert;
toInsert = current;
}
return false;
}
};
Notes:
- I'm using 3 hash functions instead of the classic 2. Can't get the load factor high enough with just 2 functions for some reason.
- The above is tweaked for no more than 100,000 input elements
36500
seems to be the minimum set size which yields a 91% load factor of the sets which is pretty much the limit for Cuckoo with 3 hash functions according to wikipedia.- With
36500
set size it takes approx. 74ms on my machine to insert 100,000 elements. - Doubling to
73000
yields a 75% reduction in run time to 17ms. - Further increases in the set size do not seem to have much effect.
-
1\$\begingroup\$ Unfortunatly, this algorithm fails with "Memory Limit" error in our check system on the first test. And my variant failed on 84 test with "Time Limit" error. So I need to find the way to speed up my algorithm, or reduce space complexity in your algorithm. \$\endgroup\$Ann Orlova– Ann Orlova2013年11月13日 18:05:25 +00:00Commented Nov 13, 2013 at 18:05
-
\$\begingroup\$ The bitset idea would not be O(n) memory. Rather, the memory needed depends on the range of values. \$\endgroup\$200_success– 200_success2013年11月14日 11:31:10 +00:00Commented Nov 14, 2013 at 11:31
-
2\$\begingroup\$ @200_success: Yes but the range is defined fixed as part of the problem description and therefor it is considered constant. As general solution you are right. \$\endgroup\$ChrisWue– ChrisWue2013年11月14日 19:05:25 +00:00Commented Nov 14, 2013 at 19:05
-
\$\begingroup\$ Haha! Loophole! \$\endgroup\$200_success– 200_success2013年11月14日 19:30:44 +00:00Commented Nov 14, 2013 at 19:30
-
\$\begingroup\$ @AnnOrlova: Updated my answer \$\endgroup\$ChrisWue– ChrisWue2013年11月16日 01:48:54 +00:00Commented Nov 16, 2013 at 1:48
Chris has given a great answer on the hashing so here are some comments on the implementation.
My main observation is that you should avoid duplicating so much code, however you implement it.
Your hash
function should return the absolute (unsigned) value of the hash,
as each use of hash
is followed by an if
:
int hashKey = hash(hash_a, hash_b, prime, size, *elem);
if (hashKey < 0)
hashKey = - hashKey;
And perhaps modify the return type to that expected of vector indices (and
noting that the parameters don't seem to need to be long long
):
typedef std::vector<int>::size_type Hashkey;
static Hashkey hash(int a, int b, int p, int size, int key)
{
return (Hashkey) abs((((a * key + b) % p) % size));
}
Also on types, where you have a defined range for a variable, the types in
stdint.h
can be useful - eg int64_t
is guaranteed to be 64 bits which
suits your input range while your long long int
might be longer (also the
int
is usually omitted, ie. just long long
).
Your use of flag
to continue the outer loop in Bucket::Construct
is very
strange. I'd prefer to see the inner loop extracted to a function which
should use a for
loop:
for (auto e = input.begin(); e != input.end(); ++e) {
...
if (cells[hashKey] != max_int) {
return false;
}
...
}
return true;
and called, as in:
while (createHashTable(...) == false) {
// nothing
}
I don't understand why you (and many other people posting here) use leading
underscores on variable names. For me these are just noise and detract from
otherwise nice looking code (for example, your _cells
). Also on variable
names, some of your names are too long. Where the scope of a variable is
small, its name can (and I think should) also be small.
Finally, your prime numbers (17 and 100013) would be better not embedded in
the code (perhaps use a #define
or const
at the top). Also a comment on
how these numbers were determined would be useful (and the origin of the hash
function?).