Hashtable Implementation - C

Question 1

HashTable implementation

The Specifics:

It uses a simple hash function, it recieves a string and adds the ASCII values of each caracter, this can cause problems if the string is big, but for this implementation the string passed to the hash function will have no more then 5 caracters.
It has a load factor of 0.5, if reached i it will duplicate the hashtable size to the nearest prime number. To reduce collisions the hashtable sizes will only be prime numbers.
The hashtable is made of an array of "buckets" where each "bucket" is a struct storing the user information, in this case, name and nick.
The hashtable keeps track of its size and number of elements.
The hashtable uses linear probing to deal with collisions.

Implemented Methods

Create
Insert
Delete a specific item
Contains
get_Item
Display

This is my first attemp at making a hashtable from scratch, here is the code:

#define initial_size 5
typedef struct user{
 char nick[6];
 char name[26];
 bool occupied;
}user;
typedef struct hashtable{
 int size;
 int elements;
 user **buckets;
}hashtable;
hashtable * create() {
 hashtable *htable = (hashtable*)malloc(sizeof(hashtable));
 htable->size = initial_size;
 htable->elements = 0;
 htable->buckets = malloc(initial_size * sizeof(htable->buckets));
 for(int i=0; i <initial_size; i++){
 htable->buckets[i] = malloc(sizeof(user));
 htable->buckets[i]->occupied = false;
 }
 return htable;
}
int hash(char *string) {
 int hashVal = 0;
 for( int i = 0; i < strlen(string);i++){
 hashVal += (int)string[i];
 }
 return hashVal;
}
bool load_factor(hashtable *ht){
 if(ht->elements >= ht->size/2){
 return true;
 }
 return false;
}
bool isPrime(int num){
 if(num==2)
 return true;
 if(num % 2==0)
 return false;
 for(int i=3; i*i<num; i+=2){
 if(num%i==0)
 return false;
 }
 return true;
}
int Prime_num(int old_size){
 for(int i = old_size; i < old_size * 2 +10; i++){
 if(isPrime(i) && i >= 2*old_size){
 return i;
 }
 }
}
void insert(hashtable *HashTable, char *name, char *nick){
 int hash_value = hash(nick);
 int new_position = hash_value % HashTable->size;
 if (new_position < 0) new_position += HashTable->size;
 int position = new_position;
 while (HashTable->buckets[position]->occupied && position != new_position - 1) {
 position++;
 position %= HashTable->size;
 }
 strcpy(HashTable->buckets[position]->name, name);
 strcpy(HashTable->buckets[position]->nick, nick);
 HashTable->buckets[position]->occupied = true;
 HashTable->elements++;
}
hashtable *resize_HashTable(hashtable *HashTable){
 user *tmp_array = malloc(sizeof(user) * HashTable->size);
 int tmp_array_pos = 0;
 for(int i = 0; i < HashTable->size; i++){
 if(HashTable->buckets[i]->occupied){
 strcpy(tmp_array[tmp_array_pos].nick, HashTable->buckets[i]->nick);
 strcpy(tmp_array[tmp_array_pos].name, HashTable->buckets[i]->name);
 tmp_array_pos++;
 }
 }
 int new_size = Prime_num(HashTable->size);
 HashTable = realloc(HashTable, new_size * sizeof(user));
 for(int i = 0; i < new_size; i++){
 HashTable->buckets[i] = malloc(sizeof(user));
 HashTable->buckets[i]->occupied = false;
 }
 HashTable->size = new_size;
 HashTable->elements = 0;
 for(int i = 0; i < tmp_array_pos; i++){
 insert(HashTable, tmp_array[i].name, tmp_array[i].nick);
 }
 return HashTable;
}
int contains(hashtable *HashTable, char *nick){
 int hash_value = hash(nick);
 int position = (hash_value % HashTable->size);
 int fixed_p = position;
 while(position < HashTable->size){
 if(strcmp(HashTable->buckets[position]->nick, nick) == 0)
 return position;
 else
 position++;
 if(position == fixed_p)
 return -1;
 if(position == HashTable->size)
 position = 0;
 }
}
user *get_item(hashtable *HashTable, char *nick){
 user *item = malloc(sizeof(user));
 int position = contains(HashTable, nick);
 if(position != -1){
 item = HashTable->buckets[position];
 }
 return item;
}
void delete(hashtable *HashTable, char *nick){
 int position = contains(HashTable, nick);
 if(position != -1){
 HashTable->buckets[position]->occupied = false;
 }
}
void display(hashtable *HashTable){
 for(int i = 0; i<HashTable->size; i++){
 printf("i: %d ", i);
 if(HashTable->buckets[i]->occupied)
 printf("nick: %s, name: %s\n",HashTable->buckets[i]->nick, HashTable->buckets[i]->name);
 else
 printf("nick: -, name: -\n");
 }
}

Note: I posted this some hours ago, but then i noticed the implementation had some big flaws in the resize method so i deleted the post. The problem should be fixed now.

Lastest Edit: Since my implementation got "popular" and the presented one has flaws, i will share a link to my github containning the updated version

Updated code version: https://github.com/MiguelDordio/Data-Structures-Implementations/blob/master/hashtable.c

Question 2

"this can cause problems if the string is big, but for this implementation the string passed to the hash function will have no more then 5 caracters." hash("ac") == hash("ca") == hash("bb") and many more.

Question 3

@tkausl i know the hash function isnt good, but given the way the hashtable was implemented it can be replaced fairly easily and that is the objetive

Question 4

I spotted some bugs:

The functions malloc and realloc might fail if there is not enough memory, and in this case they would return NULL, and since the code does not check the result, this would lead to undefined behaviour. It's necessary to check the results of these functions.

Update: a comment was skeptical about this recommendation, noting that on some operating systems, trying to write through a null pointer will result in a segmentation fault, and for some programs this is acceptable behaviour. This might indeed be fine in some cases, but when writing a reusable component like a hash table library, it is important to do better than that. The problems are: (i) Writing through a null pointer does not result in a segmentation fault in all situations, for example the code might be running on an embedded controller using an operating system that does not protect memory at address zero, or on hardware that lacks memory protection at all. The world does not consist only of Windows and Linux. (ii) In the presence of an optimizing compiler, there is no guarantee that you will merely get a segmentation fault. Writing through a null pointer is undefined behaviour, and might lead to data corruption or a security vulnerability instead of (or as well as) a crash. See this article by Harry Lee for some examples. (iii) In many cases it's not acceptable to crash, for example in safety-critical software. There are techniques for coping with memory exhaustion such as refusing requests until memory pressure abates, or garbage collection.
If the hash table is full, and you call insert with a key that hashes to zero (modulo the size of the table) then the while loop will never terminate and the program will hang.
If the hash table is full, and you call insert with a key that hashes to something other than zero (modulo the size of the table) then the new key will overwrite an old key without any kind of warning or error.
In Prime_num, if old_size is 57, then the function does not find a suitable new size, and so control runs off the end of the function and nothing is returned. This is because the next prime number after 2*old_size (114) is 127, which is more than old_size * 2 + 10 (124). (If you compiled with warnings turned on then your compiler would have complained about this.)
In get_item, there is no way for the caller to tell if the key was found or not. If it was found, then a bucket structure from the hash table is returned. But if it was not found, then a newly allocated bucket structure is returned. There is no way for the caller to tell these apart—the newly allocated bucket structure could contain any data whatsoever.
In get_item, if the key is found, then a bucket structure is allocated and neither freed nor returned to the caller. This is a memory leak.
In resize_HashTable, the tmp_array is allocated but never freed. This is a memory leak.
In resize_HashTable, the original bucket structures are not freed. This is a memory leak.
In contains, there is no check against the occupied flag. This means that if a key is not present in the table then it will be compared against all the keys in the table, including keys that were deleted or never initialized.
The algorithm in delete is incorrect: if the deleted key is in the middle of a chain of keys with identical hashes, then deleting it leaves all the following keys in the chain unfindable.

This is a common mistake! In The Art of Computer Programming, Knuth comments, "Many computer programmers have great faith in algorithms, and they are surprised to find that the obvious way to delete records from a hash table doesn't work." (Vol. III, p. 533.)

If you want to get this right, Knuth gives an algorithm (pp. 533–4) for deletion in a open hash table with linear probing.

But a simpler approach is to mark the deleted key with a flag meaning "there was a key here but it was deleted" which you later remove when growing or shrinking the table.

Question 5

thank you for spotting this bugs and your detailed explanation on them, i will work on solving them.

Question 6

Wrt #1: If malloc fails and returns NULL, any modern OS will generate a segfault, and the program will fail fast. In the general case, I don’t see what this program could be expected to do better than failing fast in an OOM scenario. What would you have the program do, OP?

Question 7

@gntskn well it was made to use in a project of my data-structures class, it should handle about 1000000 users

Question 8

Note also: there's no need to cast the result of malloc().

Question 9

Thanks for updating the answer; I think it’s a lot more instructive for the complexity of the issue and the possible solutions.

Question 10

for(int i=3; i*i<num; i+=2){ may overflow.

for(int i=3; i<num/i; i+=2){ will not.

The cost of the division is negligible. A good compiler will compute num%i and num/i together.

Either way, it is off by one. Use <=.

for(int i=3; i <= num/i; i+=2){

The Prime_num() is unnecessarily slow. Start at old_size*2+1. The +10 is strange.

int Prime_num(int old_size){
 // for(int i = old_size; i < old_size * 2 +10; i++){
 for(int i = old_size*2+1; i < INT_MAX; i++){
 // if(isPrime(i) && i >= 2*old_size){
 if(isPrime(i)) {
 return i;
 }
 }
 // Somehow handle "too big"
 assert(0);
 // Should not return "nothing".
}

Question 11

thank you for your contribution, in fact this function had problems in which I should have devoted more attention.

Gareth Rees Gareth Rees 50.1k3 gold badges130 silver badges210 bronze badges · Accepted Answer · 2018-05-20 13:45:13Z

I spotted some bugs:

The functions malloc and realloc might fail if there is not enough memory, and in this case they would return NULL, and since the code does not check the result, this would lead to undefined behaviour. It's necessary to check the results of these functions.

Update: a comment was skeptical about this recommendation, noting that on some operating systems, trying to write through a null pointer will result in a segmentation fault, and for some programs this is acceptable behaviour. This might indeed be fine in some cases, but when writing a reusable component like a hash table library, it is important to do better than that. The problems are: (i) Writing through a null pointer does not result in a segmentation fault in all situations, for example the code might be running on an embedded controller using an operating system that does not protect memory at address zero, or on hardware that lacks memory protection at all. The world does not consist only of Windows and Linux. (ii) In the presence of an optimizing compiler, there is no guarantee that you will merely get a segmentation fault. Writing through a null pointer is undefined behaviour, and might lead to data corruption or a security vulnerability instead of (or as well as) a crash. See this article by Harry Lee for some examples. (iii) In many cases it's not acceptable to crash, for example in safety-critical software. There are techniques for coping with memory exhaustion such as refusing requests until memory pressure abates, or garbage collection.
If the hash table is full, and you call insert with a key that hashes to zero (modulo the size of the table) then the while loop will never terminate and the program will hang.
If the hash table is full, and you call insert with a key that hashes to something other than zero (modulo the size of the table) then the new key will overwrite an old key without any kind of warning or error.
In Prime_num, if old_size is 57, then the function does not find a suitable new size, and so control runs off the end of the function and nothing is returned. This is because the next prime number after 2*old_size (114) is 127, which is more than old_size * 2 + 10 (124). (If you compiled with warnings turned on then your compiler would have complained about this.)
In get_item, there is no way for the caller to tell if the key was found or not. If it was found, then a bucket structure from the hash table is returned. But if it was not found, then a newly allocated bucket structure is returned. There is no way for the caller to tell these apart—the newly allocated bucket structure could contain any data whatsoever.
In get_item, if the key is found, then a bucket structure is allocated and neither freed nor returned to the caller. This is a memory leak.
In resize_HashTable, the tmp_array is allocated but never freed. This is a memory leak.
In resize_HashTable, the original bucket structures are not freed. This is a memory leak.
In contains, there is no check against the occupied flag. This means that if a key is not present in the table then it will be compared against all the keys in the table, including keys that were deleted or never initialized.
The algorithm in delete is incorrect: if the deleted key is in the middle of a chain of keys with identical hashes, then deleting it leaves all the following keys in the chain unfindable.

This is a common mistake! In The Art of Computer Programming, Knuth comments, "Many computer programmers have great faith in algorithms, and they are surprised to find that the obvious way to delete records from a hash table doesn't work." (Vol. III, p. 533.)

If you want to get this right, Knuth gives an algorithm (pp. 533–4) for deletion in a open hash table with linear probing.

But a simpler approach is to mark the deleted key with a flag meaning "there was a key here but it was deleted" which you later remove when growing or shrinking the table.

thank you for spotting this bugs and your detailed explanation on them, i will work on solving them.
Wrt #1: If malloc fails and returns NULL, any modern OS will generate a segfault, and the program will fail fast. In the general case, I don’t see what this program could be expected to do better than failing fast in an OOM scenario. What would you have the program do, OP?
@gntskn well it was made to use in a project of my data-structures class, it should handle about 1000000 users
Note also: there's no need to cast the result of malloc().
Thanks for updating the answer; I think it’s a lot more instructive for the complexity of the issue and the possible solutions.

Stack Exchange Network

Hashtable Implementation - C

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Hashtable Implementation - C

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions