HashTable implementation
The Specifics:
It uses a simple hash function, it recieves a string and adds the ASCII values of each caracter, this can cause problems if the string is big, but for this implementation the string passed to the hash function will have no more then 5 caracters.
It has a load factor of 0.5, if reached i it will duplicate the hashtable size to the nearest prime number. To reduce collisions the hashtable sizes will only be prime numbers.
The hashtable is made of an array of "buckets" where each "bucket" is a struct storing the user information, in this case, name and nick.
The hashtable keeps track of its size and number of elements.
The hashtable uses linear probing to deal with collisions.
Implemented Methods
- Create
- Insert
- Delete a specific item
- Contains
- get_Item
- Display
This is my first attemp at making a hashtable from scratch, here is the code:
#define initial_size 5
typedef struct user{
char nick[6];
char name[26];
bool occupied;
}user;
typedef struct hashtable{
int size;
int elements;
user **buckets;
}hashtable;
hashtable * create() {
hashtable *htable = (hashtable*)malloc(sizeof(hashtable));
htable->size = initial_size;
htable->elements = 0;
htable->buckets = malloc(initial_size * sizeof(htable->buckets));
for(int i=0; i <initial_size; i++){
htable->buckets[i] = malloc(sizeof(user));
htable->buckets[i]->occupied = false;
}
return htable;
}
int hash(char *string) {
int hashVal = 0;
for( int i = 0; i < strlen(string);i++){
hashVal += (int)string[i];
}
return hashVal;
}
bool load_factor(hashtable *ht){
if(ht->elements >= ht->size/2){
return true;
}
return false;
}
bool isPrime(int num){
if(num==2)
return true;
if(num % 2==0)
return false;
for(int i=3; i*i<num; i+=2){
if(num%i==0)
return false;
}
return true;
}
int Prime_num(int old_size){
for(int i = old_size; i < old_size * 2 +10; i++){
if(isPrime(i) && i >= 2*old_size){
return i;
}
}
}
void insert(hashtable *HashTable, char *name, char *nick){
int hash_value = hash(nick);
int new_position = hash_value % HashTable->size;
if (new_position < 0) new_position += HashTable->size;
int position = new_position;
while (HashTable->buckets[position]->occupied && position != new_position - 1) {
position++;
position %= HashTable->size;
}
strcpy(HashTable->buckets[position]->name, name);
strcpy(HashTable->buckets[position]->nick, nick);
HashTable->buckets[position]->occupied = true;
HashTable->elements++;
}
hashtable *resize_HashTable(hashtable *HashTable){
user *tmp_array = malloc(sizeof(user) * HashTable->size);
int tmp_array_pos = 0;
for(int i = 0; i < HashTable->size; i++){
if(HashTable->buckets[i]->occupied){
strcpy(tmp_array[tmp_array_pos].nick, HashTable->buckets[i]->nick);
strcpy(tmp_array[tmp_array_pos].name, HashTable->buckets[i]->name);
tmp_array_pos++;
}
}
int new_size = Prime_num(HashTable->size);
HashTable = realloc(HashTable, new_size * sizeof(user));
for(int i = 0; i < new_size; i++){
HashTable->buckets[i] = malloc(sizeof(user));
HashTable->buckets[i]->occupied = false;
}
HashTable->size = new_size;
HashTable->elements = 0;
for(int i = 0; i < tmp_array_pos; i++){
insert(HashTable, tmp_array[i].name, tmp_array[i].nick);
}
return HashTable;
}
int contains(hashtable *HashTable, char *nick){
int hash_value = hash(nick);
int position = (hash_value % HashTable->size);
int fixed_p = position;
while(position < HashTable->size){
if(strcmp(HashTable->buckets[position]->nick, nick) == 0)
return position;
else
position++;
if(position == fixed_p)
return -1;
if(position == HashTable->size)
position = 0;
}
}
user *get_item(hashtable *HashTable, char *nick){
user *item = malloc(sizeof(user));
int position = contains(HashTable, nick);
if(position != -1){
item = HashTable->buckets[position];
}
return item;
}
void delete(hashtable *HashTable, char *nick){
int position = contains(HashTable, nick);
if(position != -1){
HashTable->buckets[position]->occupied = false;
}
}
void display(hashtable *HashTable){
for(int i = 0; i<HashTable->size; i++){
printf("i: %d ", i);
if(HashTable->buckets[i]->occupied)
printf("nick: %s, name: %s\n",HashTable->buckets[i]->nick, HashTable->buckets[i]->name);
else
printf("nick: -, name: -\n");
}
}
Note: I posted this some hours ago, but then i noticed the implementation had some big flaws in the resize method so i deleted the post. The problem should be fixed now.
Lastest Edit: Since my implementation got "popular" and the presented one has flaws, i will share a link to my github containning the updated version
Updated code version: https://github.com/MiguelDordio/Data-Structures-Implementations/blob/master/hashtable.c
2 Answers 2
I spotted some bugs:
The functions
malloc
andrealloc
might fail if there is not enough memory, and in this case they would returnNULL
, and since the code does not check the result, this would lead to undefined behaviour. It's necessary to check the results of these functions.Update: a comment was skeptical about this recommendation, noting that on some operating systems, trying to write through a null pointer will result in a segmentation fault, and for some programs this is acceptable behaviour. This might indeed be fine in some cases, but when writing a reusable component like a hash table library, it is important to do better than that. The problems are: (i) Writing through a null pointer does not result in a segmentation fault in all situations, for example the code might be running on an embedded controller using an operating system that does not protect memory at address zero, or on hardware that lacks memory protection at all. The world does not consist only of Windows and Linux. (ii) In the presence of an optimizing compiler, there is no guarantee that you will merely get a segmentation fault. Writing through a null pointer is undefined behaviour, and might lead to data corruption or a security vulnerability instead of (or as well as) a crash. See this article by Harry Lee for some examples. (iii) In many cases it's not acceptable to crash, for example in safety-critical software. There are techniques for coping with memory exhaustion such as refusing requests until memory pressure abates, or garbage collection.
If the hash table is full, and you call
insert
with a key that hashes to zero (modulo the size of the table) then thewhile
loop will never terminate and the program will hang.If the hash table is full, and you call
insert
with a key that hashes to something other than zero (modulo the size of the table) then the new key will overwrite an old key without any kind of warning or error.In
Prime_num
, ifold_size
is 57, then the function does not find a suitable new size, and so control runs off the end of the function and nothing is returned. This is because the next prime number after2*old_size
(114) is 127, which is more thanold_size * 2 + 10
(124). (If you compiled with warnings turned on then your compiler would have complained about this.)In
get_item
, there is no way for the caller to tell if the key was found or not. If it was found, then a bucket structure from the hash table is returned. But if it was not found, then a newly allocated bucket structure is returned. There is no way for the caller to tell these apart—the newly allocated bucket structure could contain any data whatsoever.In
get_item
, if the key is found, then a bucket structure is allocated and neither freed nor returned to the caller. This is a memory leak.In
resize_HashTable
, thetmp_array
is allocated but never freed. This is a memory leak.In
resize_HashTable
, the original bucket structures are not freed. This is a memory leak.In
contains
, there is no check against theoccupied
flag. This means that if a key is not present in the table then it will be compared against all the keys in the table, including keys that were deleted or never initialized.The algorithm in
delete
is incorrect: if the deleted key is in the middle of a chain of keys with identical hashes, then deleting it leaves all the following keys in the chain unfindable.This is a common mistake! In The Art of Computer Programming, Knuth comments, "Many computer programmers have great faith in algorithms, and they are surprised to find that the obvious way to delete records from a hash table doesn't work." (Vol. III, p. 533.)
If you want to get this right, Knuth gives an algorithm (pp. 533–4) for deletion in a open hash table with linear probing.
But a simpler approach is to mark the deleted key with a flag meaning "there was a key here but it was deleted" which you later remove when growing or shrinking the table.
-
1\$\begingroup\$ thank you for spotting this bugs and your detailed explanation on them, i will work on solving them. \$\endgroup\$MiguelD– MiguelD2018年05月20日 13:54:56 +00:00Commented May 20, 2018 at 13:54
-
\$\begingroup\$ Wrt #1: If
malloc
fails and returnsNULL
, any modern OS will generate a segfault, and the program will fail fast. In the general case, I don’t see what this program could be expected to do better than failing fast in an OOM scenario. What would you have the program do, OP? \$\endgroup\$gntskn– gntskn2018年05月20日 18:11:32 +00:00Commented May 20, 2018 at 18:11 -
1\$\begingroup\$ @gntskn well it was made to use in a project of my data-structures class, it should handle about 1000000 users \$\endgroup\$MiguelD– MiguelD2018年05月20日 18:44:55 +00:00Commented May 20, 2018 at 18:44
-
\$\begingroup\$ Note also: there's no need to cast the result of
malloc()
. \$\endgroup\$detly– detly2018年05月20日 19:41:44 +00:00Commented May 20, 2018 at 19:41 -
1\$\begingroup\$ Thanks for updating the answer; I think it’s a lot more instructive for the complexity of the issue and the possible solutions. \$\endgroup\$gntskn– gntskn2018年05月21日 00:13:07 +00:00Commented May 21, 2018 at 0:13
for(int i=3; i*i<num; i+=2){
may overflow.
for(int i=3; i<num/i; i+=2){
will not.
The cost of the division is negligible. A good compiler will compute num%i
and num/i
together.
Either way, it is off by one. Use <=
.
for(int i=3; i <= num/i; i+=2){
The Prime_num()
is unnecessarily slow. Start at old_size*2+1
. The +10
is strange.
int Prime_num(int old_size){
// for(int i = old_size; i < old_size * 2 +10; i++){
for(int i = old_size*2+1; i < INT_MAX; i++){
// if(isPrime(i) && i >= 2*old_size){
if(isPrime(i)) {
return i;
}
}
// Somehow handle "too big"
assert(0);
// Should not return "nothing".
}
-
\$\begingroup\$ thank you for your contribution, in fact this function had problems in which I should have devoted more attention. \$\endgroup\$MiguelD– MiguelD2018年05月21日 10:04:10 +00:00Commented May 21, 2018 at 10:04
hash("ac") == hash("ca") == hash("bb")
and many more. \$\endgroup\$