quadratic probing hashmap and its time efficiency

Question 1

I am writing small a hashmap library (not full featured yet) but I have some questions about the algorithm I wrote, the formula I hope I got right from wikipedia and general things.
Measured ~14.3 million inserts in ~5.3 seconds with a load factor of 0.75, 2.1 s at 0.5.

From the sparsehash readme:

RESOURCE USAGE

sparse_hash_map has memory overhead of about 4 to 10 bits per hash-map entry, assuming a typical average occupancy of 50%.

dense_hash_map has a factor of 2-3 memory overhead: if your hashtable data takes X bytes, dense_hash_map will use 3X-4X memory total.

Why are they talking about bits here? Don't we work with bytes?

First of all I tested against rockyou.txt as it's known to me and has a lot entries. And that's my output:

Function	Time
hashdb_test_append_rockyou	5.309431 sec (Appending 14.344.392 keys to hashdb)
hashdb_test_get_key	0.000005 sec (Get one key)
hashdb_test_get_1000_keys	0.000221 sec (Get 1000 keys)
hashdb_test_get_rockyou	3.330763 sec (Get all keys)

My buckets only hold a const char *key pointer and a void *value pointer to store data. Is this common practice or do I need to allocate extra memory for the keys to keep a copy?

EDIT:

I added the source code below. The timebot is a header I've written because I wanted to compare functions that do the same operation to keep track of the time, compare them against each other and now I added some normal timing function to it for hashdb testing.

Default capacity is 7 for testing reasons. Any call on hashdb_add calls hashdb_grow which tests if it should grow (also checking member hashdb::constant). Same for hashdb_del and hashdb_shrink.

Default load factor of 0.75 used by hashdb_grow and hashdb_shrink to check if should grow or shrink.

This is my first hash table implmentation. I now use double hashing + quadratic probing.

I've tested if my formula hit every index and I think it does because I don't get any segmentation fault if I calculate the hash in range of the capacity.

Hope this makes things more clear.

misc.h (included by hashdb_internal)

#ifndef _HASHDB_MISC_H
#define _HASHDB_MISC_H
#include <stdio.h>
#include <stdbool.h>
/*! @function
 @param n number to test if prime
 @return true if prime, false if not
*/
bool IsPrime(size_t n);
/*! @function
 @param n number to get next bigger prime
 @return next bigger prime number
*/
size_t GetHigherPrime(size_t n);
/*! @function
 @param n number to get next bigger prime
 @return next lower prime number
*/
size_t GetLowerPrime(size_t n);
/*! @function
 @param str string to hash
 @return fnv hash
*/
size_t fnv_hash_string(const char *str);
/*! @function
 @param data data to hash
 @param size size of data to hash
 @return fnv hash
*/
size_t fnv_hash(unsigned char *data, size_t size);
/*! @function
 @param str string to hash
 @return elf hash
*/
size_t elfhash(const unsigned char *str);
#endif

misc.c

#include "misc.h"
#include <stdbool.h>
#include <limits.h>
#define FNV_OFFSET 14695981039346656037UL
#define FNV_PRIME 1099511628211UL
/* Definition of prime number functions */
/*! @function
 @param n number to test
 @return ture if is prime, false if is not prime
*/
bool IsPrime(size_t n)
{
 if((n & 1) != 0) {
 for(size_t divisor = 3; divisor <= n / divisor; divisor += 2) {
 if((n % divisor) == 0) {
 return 0;
 }
 }
 return 1;
 }
 return (n == 2);
}
/*! @function
 @param n number to get next higher prime from
 @return next higher prime from n
*/
size_t GetHigherPrime(size_t n) {
 for(size_t i = (n | 1); i < LONG_MAX; i += 2) {
 if(IsPrime(i)) {
 return i;
 }
 }
 return n;
}
/*! @function
 @param n number to get next lower prime from
 @return next lower prime from n
*/
size_t GetLowerPrime(size_t n) {
 for(size_t i = (n | 1); i < LONG_MAX; i -= 2) {
 if(IsPrime(i)) {
 return i;
 }
 }
 return n;
}
/* hashing functions */
/*! @function
 @param str null terminated character string to hash
 @return hash in form of size_t
*/
size_t fnv_hash_string(const char *str)
{
 size_t hash;
 int i;
 for(i = 0, hash = FNV_OFFSET; *str != '0円'; str++, i++) {
 hash = hash ^ *str;
 hash = hash * FNV_PRIME;
 }
 return hash;
}
/*! @function
 @param data unsigned byte array
 @param size size of byte array
 @return hash in form of size_t
*/
size_t fnv_hash(unsigned char *data, size_t size)
{
 size_t hash;
 int i;
 for(i = 0, hash = FNV_OFFSET; size != 0; size--, i++) {
 hash = hash ^ *data;
 hash = hash * FNV_PRIME;
 }
 return hash;
}
/*! @function
 @param str null terminated character string to hash
 @return hash in form of size_t
*/
unsigned long elfhash(const unsigned char *str)
{
 unsigned long h = 0, high;
 while (*str)
 {
 h = (h << 4) + *str++;
 if ((high = h & 0xF0000000))
 h ^= high >> 24;
 h &= ~high;
 }
 return h;
}

hashdb_internal.h (included by hashdb)

#ifndef _HASHDB_INTERNAL_H
#define _HASHBD_INTERNAL_H
#include "misc.h"
/* definitions */
#define HASHDB_DEFAULT_CAPACITY 7 /* prime number */
#define HASHDB_DEFAULT_LOADFACTOR 0.90 /* */
/* get index by this ... formular */
#define hash_probe(hash, i, cap) ((hash + (-1) * (i * i + 1) * (i / 2) * (i / 2)) % cap)
/* simple bucket that holds an pointer key as string and pointer to value */
typedef struct bucket {
 const char *key;
 void *value;
} bucket;
/* hashdb holding capacity (number of elements), n (number of elements inside), map (memory location of entries) */
typedef struct hashdb {
 size_t cap;
 size_t n;
 bucket *map;
 bool constant;
} hashdb;
/*! @function
 @param list pointer to bucket list
 @param capacity capacity of bucket list
 @param key key to store
 @param value value to store by key
 @return 0 if succes, -1 if failure 
*/
int hashdb_insert(bucket *list, size_t capacity, const char *key, void *value);
/*! @function
 @param list pointer to list to return from
 @param capacity capacity of list
 @param key key to get data from
 @return location of bucket 
*/
bucket* hashdb_return(bucket *list, size_t capacity, const char *key);
/*! @function
 @param old_list old list with data
 @param old_cap old_list's capacity
 @param new_list new_list to insert to
 @param new_cap new_list's capacity
*/
void hashdb_rehash(bucket *old_list, size_t old_cap, bucket *new_list, size_t new_cap);
/*! @function
 @param db hashdb pointer
 @param new_cap new capacity
 @return 0 on succes, ENOMEM on failure
*/
int hashdb_realloc(hashdb *db, size_t new_cap);
/*! @function
 @param db hashdb pointer
 @return 0 if nothing to allocate or reallocated successfully, ENOMEM on failure
*/
int hashdb_grow(hashdb *db);
/*! @function
 @param db hashdb pointer
 @return 0 if nothing to allocate or reallocated successfully, ENOMEM on failure
*/
int hashdb_shrink(hashdb *db);
/*! @function
 @param str null terminated string to hash
*/
size_t double_hash(const char *str);
#endif

hashdb_internal.c

#include "hashdb_internal.h"
#include <stdlib.h>
#include <string.h>
#include <stdbool.h>
#include <errno.h>
/*! @function
 @param list pointer to bucket list
 @param capacity capacity of bucket list
 @param key key to store
 @param value value to store by key
 @return 0 if succes, -1 if failure 
*/
int hashdb_insert(bucket *list, size_t capacity, const char *key, void *value)
{
 size_t i = 0;
 size_t hash = double_hash(key);
 size_t probe = 0;
 /* starting at 1 because 1 and 0 would get the same probe so we can safe that step */
 for(i = 1; i <= capacity; i++) {
 probe = hash_probe(hash, i, capacity);
 if(list[probe].key == NULL)
 {
 list[probe].key = key;
 list[probe].value = value;
 return 0;
 }
 }
 return -1; 
}
/*! @function
 @param list pointer to list to return from
 @param capacity capacity of list
 @param key key to get data from
 @return location of bucket 
*/
bucket* hashdb_return(bucket *list, size_t capacity, const char *key)
{
 size_t i = 0;
 size_t probe = 0;
 size_t hash = double_hash(key);
 for(i = 1; i <= capacity; i++) {
 probe = hash_probe(hash, i, capacity);
 if(list[probe].key != NULL)
 {
 if(strcmp(key, list[probe].key) == 0) {
 return &list[probe];
 }
 }
 }
 
 return NULL;
}
/*! @function
 @param old_list old list with data
 @param old_cap old_list's capacity
 @param new_list new_list to insert to
 @param new_cap new_list's capacity
*/
void hashdb_rehash(bucket *old_list, size_t old_cap, bucket *new_list, size_t new_cap)
{
 for(size_t i = 0; i < old_cap; i++) {
 if(old_list[i].key != NULL) {
 hashdb_insert(new_list, new_cap, old_list[i].key, old_list[i].value);
 }
 }
}
/*! @function
 @param db hashdb pointer
 @param new_cap new capacity
 @return 0 on succes, ENOMEM on failure
*/
int hashdb_realloc(hashdb *db, size_t new_cap)
{
 bucket *new_list = calloc(new_cap, sizeof *db->map);
 if(new_list == NULL) {
 return ENOMEM;
 }
 hashdb_rehash(db->map, db->cap, new_list, new_cap);
 free(db->map);
 db->map = new_list;
 db->cap = new_cap;
 return 0;
}
void hashdb_enable_constant(hashdb *db, bool newval)
{
 db->constant = newval;
}
/*! @function
 @param db hashdb pointer
 @return 0 if nothing to allocate or reallocated successfully, ENOMEM on failure
*/
int hashdb_grow(hashdb *db)
{
 if((db->n != (size_t) (db->cap * HASHDB_DEFAULT_LOADFACTOR)) || db->constant == true) {
 return 0;
 }
 size_t new_cap = GetHigherPrime(db->cap * 2);
 if(hashdb_realloc(db, new_cap) != 0) {
 return ENOMEM;
 }
 return 0;
}
/*! @function
 @param db hashdb pointer
 @return 0 if nothing to allocate or reallocated successfully, ENOMEM on failure
*/
int hashdb_shrink(hashdb *db)
{
 if(db->cap <= HASHDB_DEFAULT_CAPACITY){
 db->cap = HASHDB_DEFAULT_CAPACITY;
 return 0;
 }
 if((db->n != (size_t) (db->cap * (1 - HASHDB_DEFAULT_LOADFACTOR))) || db->constant == true) {
 return 0;
 }
 size_t new_cap = GetLowerPrime(db->cap / 2);
 if(hashdb_realloc(db, new_cap) != 0) {
 return ENOMEM;
 }
 return 0;
}
size_t double_hash(const char *str)
{
 return elfhash((const unsigned char*) str) + fnv_hash_string(str);
}

hashdb.h

#ifndef _HMAP_H
#define _HMAP_H
#include <stdio.h>
#include <stdbool.h>
typedef void hashdb;
hashdb* hashdb_create(void);
void hashdb_destroy(hashdb *db);
void hashdb_enable_constant(hashdb *db, bool newval);
size_t hashdb_cap(hashdb *db);
size_t hashdb_len(hashdb *db);
double hashdb_filled_percent(hashdb *db);
int hashdb_add(hashdb *db, const char *key, const void *data);
void* hashdb_get(hashdb *db, const char *key);
void* hashdb_del(hashdb *db, const char *key);
int hashdb_ensure_capacity(hashdb *db, size_t new_capacity);
/* int hashdb_clear(hashdb *db);
int hashdb_reset(hashdb *db);
int hashdb_hardreset(hashdb *db);
int hashdb_to_json(hashdb *db, const char *path);
int hashdb_from_json(hashdb *db, const char *path);
int hashdb_to_http(hashdb *db, char *buffer, size_t bufsiz);
int hashdb_from_http(hashdb *db, char *buffer, size_t bufsiz);
*/
void hashdb_printall(hashdb *db);
#endif

hashdb.c


#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <errno.h>
#include <limits.h>
#include <ctype.h>
#include <time.h>
#include "hashdb_internal.h"
#define HASHDB_DEFAULT_CAP 7
/*! @function
 @return hashdb pointer
*/
hashdb* hashdb_create()
{
 hashdb *db = calloc(1, sizeof *db);
 if(db == NULL) {
 return NULL;
 }
 db->cap = HASHDB_DEFAULT_CAP;
 db->map = calloc(db->cap, sizeof *db->map);
 if(db->map == NULL) {
 free(db);
 return NULL;
 }
 return db;
}
/*! @function
 @param db hashdb pointer to destroy
*/
void hashdb_destroy(hashdb *db)
{
 free(db->map);
 free(db);
}
/*! @function
 @param db hashdb pointer
 @param key key
 @param value value associated with key
 @return 0 on success, ENOMEM on failure
*/
int hashdb_add(hashdb *db, const char *key, void *value)
{
 if(db->n == db->cap) {
 return -1; /* full, should not happen */
 }
 /* check if it should grow */
 if(hashdb_grow(db) != 0) {
 return ENOMEM;
 }
 if(hashdb_insert(db->map, db->cap, key, value)) {
 return -1; /* internal error */
 }
 db->n++;
 return 0;
}
/*! @function
 @param db hashdb pointer
 @param key key to retrieve date from
 @return data associated with key
*/
void* hashdb_get(hashdb *db, const char *key)
{
 bucket *item = hashdb_return(db->map, db->cap, key);
 if(item == NULL) {
 return item; /* not found */
 }
 return item->value;
}
/*! @function
 @param db hashdb pointer
 @param key key to retrieve date from
 @return data associated with key
*/
void* hashdb_del(hashdb *db, const char *key)
{
 bucket *item = hashdb_return(db->map, db->cap, key);
 if(item == NULL) {
 return item;
 }
 void *value = item->value;
 item->key = NULL;
 item->value = NULL;
 db->n--;
 if(hashdb_shrink(db) != 0) {
 return NULL;
 }
 return value;
}
/*! @function
 @param db hashdb pointer
 @param new_capacity new number of elements
 @return 0 on succes, ENOMEM on failure
*/
int hashdb_ensure_capacity(hashdb *db, size_t new_capacity)
{
 size_t new_cap = GetHigherPrime(new_capacity);
 if(hashdb_realloc(db, new_cap) != 0) {
 return ENOMEM;
 }
 return 0;
}
size_t hashdb_cap(hashdb *db)
{
 return db->cap;
}
size_t hashdb_len(hashdb *db)
{
 return db->n;
}
double hashdb_filled_percent(hashdb *db)
{
 return (double) db->n / db->cap * 100;
}
void hashdb_printall(hashdb *db)
{
 for(size_t i = 0; i < db->cap; i++) {
 printf("%s\n", db->map[i].key);
 }
}

hashdb_test.c

#include "../../include/hashdb.h"
#include "timebot.h"
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define ROCKYOU_PATH "../src/test/data/rockyou.txt"
#define ROCKYOU_LINES 14344392 /* number of lines in rockyou.txt */
#define HASHDB_START_GET_FROM 121512 /* start point for hashdb_test_get_1000_keys and hashdb_test_get_key*/
char **test_keys = NULL;
/* initialize keys from rockyou.txt */
int init_test_keys(const char *path)
{
 FILE *fp = fopen(path, "r");
 if(fp == NULL) {
 perror("open file");
 return -1;
 }
 test_keys = calloc(ROCKYOU_LINES, sizeof *test_keys);
 if(test_keys == NULL) {
 perror("init_test_keys");
 return -1;
 }
 char key_buf[128];
 for(size_t i = 0; i < ROCKYOU_LINES; i++) {
 fgets(key_buf, sizeof(key_buf), fp);
 key_buf[strlen(key_buf)-1] = '0円';
 test_keys[i] = strdup(key_buf);
 }
 fclose(fp);
 return 0;
}
void destroy_keys()
{
 for(size_t i = 0; i < ROCKYOU_LINES; i++) {
 free(test_keys[i]);
 }
 free(test_keys);
}
void hashdb_test_fullfill(hashdb *db)
{
 for(size_t i = 0; i < hashdb_cap(db); i++) {
 if(hashdb_add(db, test_keys[i], test_keys[i]) != 0) {
 fprintf(stderr, "error: hashdb_add failed\n");
 abort();
 }
 //printf("adding Key: %s - %s\n", test_keys[i], test_keys[i]);
 }
 char *res = NULL;
 for(size_t i = 0; i < hashdb_cap(db); i++) {
 res = hashdb_get(db, test_keys[i]);
 if(res == NULL) {
 fprintf(stderr, "error: hashdb_get failed\n");
 abort();
 }
 //printf("getting Key: %s - %s\n", test_keys[i], res);
 }
}
void test_not_found_cap_rockyou(hashdb *db) {
 char *res = hashdb_get(db, "definitely not inside");
 if(res != NULL) {
 fprintf(stderr, "definitely not inside - found (failure/must not happen)\n");
 abort();
 }
 //printf("definitely not inside - not found (success) rockyou\n");
}
void test_not_found_default_cap(hashdb *db)
{
 char *res = hashdb_get(db, "definitely not inside");
 if(res != NULL) {
 fprintf(stderr, "definitely not inside - found (failure/must not happen)\n");
 abort();
 }
 //printf("definitely not inside - not found (success) default\n");
}
/* test append 14344392 keys */
void hashdb_test_append_rockyou(hashdb *db)
{
 size_t x = ROCKYOU_LINES-1;
 size_t i;
 for(i = 0; i < ROCKYOU_LINES; i++, x--) {
 if(hashdb_add(db, test_keys[i], test_keys[x]) != 0) {
 fprintf(stderr, "error: hashdb_test_append_rockyou\n");
 abort();
 }
 }
}
/* test get 1000 keys */
void hashdb_test_get_1000_keys(hashdb *db) {
 for(size_t i = HASHDB_START_GET_FROM; i < HASHDB_START_GET_FROM + 1000; i++) {
 const char *res = hashdb_get(db, test_keys[i]);
 if(res == NULL) {
 fprintf(stderr, "error: hashdb_test_get_1000_keys");
 abort();
 }
 }
}
/* test get single keys */
void hashdb_test_get_key(hashdb *db)
{
 const char *res = hashdb_get(db, test_keys[HASHDB_START_GET_FROM]);
 if(res == NULL) {
 fprintf(stderr, "error: hashdb_test_get_key");
 abort();
 }
}
/* test get all keys */
void hashdb_test_get_rockyou(hashdb *db)
{
 for(size_t i = 0; i < ROCKYOU_LINES; i++) {
 const char *res = hashdb_get(db, test_keys[i]);
 if(res == NULL) {
 fprintf(stderr, "error: hashdb_test_get_rockyou");
 abort();
 }
 }
}
void hashdb_test_del_rockyou(hashdb *db)
{
 for(size_t i = 0; i < ROCKYOU_LINES; i++) {
 const char *res = hashdb_del(db, test_keys[i]);
 if(res == NULL) {
 fprintf(stderr, "error: hashdb_test_del_rockyou");
 abort();
 }
 }
}
void hashdb_test_del_default(hashdb *db)
{
 for(size_t i = 0; i < hashdb_cap(db); i++) {
 const char *res = hashdb_del(db, test_keys[i]);
 if(res == NULL) {
 fprintf(stderr, "error: hashdb_test_del_rockyou");
 abort();
 }
 }
}
int main(void)
{
 timebot *bot = timebot_create(20);
 if(init_test_keys(ROCKYOU_PATH) != 0) {
 return EXIT_SUCCESS;
 }
 hashdb *db = hashdb_create();
 /* enable constant so it don't execute the grow function */
 hashdb_enable_constant(db, true);
 timebot_standalone(bot, hashdb_test_fullfill, db, "test if hashdb is filled completely if growing is disabled");
 timebot_standalone(bot, test_not_found_default_cap, db, "test if not found works with default capacity");
 timebot_standalone(bot, hashdb_test_del_default, db, "delete HASHDB_DEFAULT_CAP entries");
 hashdb_enable_constant(db, false);
 /* disable constant so it executes the grow function */
 printf("after creation and adding 7 (HASHDB_DEFAULT_CAPACITY)\n");
 printf("\ncap: %1lu (should be 7)\n", hashdb_cap(db));
 printf("len: %1lu (should be 0)\n", hashdb_len(db));
 printf("filled: %2.1f (should be 0.0)\n", hashdb_filled_percent(db));
 // ensure capacity fits with lines in rockyou.txt
 hashdb_ensure_capacity(db, ROCKYOU_LINES * 2);
 
 timebot_append(bot, hashdb_test_append_rockyou, db, "adding 14.344.392 keys to hashdb");
 timebot_append(bot, hashdb_test_get_key, db, "get one key");
 timebot_append(bot, hashdb_test_get_1000_keys, db, "get 1000 keys");
 timebot_append(bot, hashdb_test_get_rockyou, db, "get all keys");
 timebot_append(bot, test_not_found_cap_rockyou, db, "key is not found with cap of 14.3 million entries");
 
 timebot_run(bot);
 printf("\nafter appending rockyou.txt\n");
 printf("cap: %lu\n", hashdb_cap(db));
 printf("len: %lu\n", hashdb_len(db));
 printf("filled: %f\n", hashdb_filled_percent(db));
 timebot_standalone(bot, hashdb_test_del_rockyou, db, "deleting all rockyou keys");
 hashdb_add(db, "testkey", "testdata");
 printf("\nafter deleting all and adding one\n");
 printf("cap: %lu\n", hashdb_cap(db));
 printf("len: %lu\n", hashdb_len(db));
 printf("filled: %f\n", hashdb_filled_percent(db));
 
 timebot_print_summery(bot);
 hashdb_destroy(db);
 destroy_keys();
 timebot_destroy(bot);
}

timebot.h (used by hashdb_test.c)

#ifndef _TIMEBOT_H
#define _TIMEBOT_H
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#include <errno.h>
typedef struct excecute_bundle {
 const char *description;
 const char *function_name;
 void (*run)(void*);
 void *arg;
 double runtime;
} excecute_bundle;
typedef struct timebot {
 excecute_bundle *exec_pool;
 int n;
 double capacity;
 double total_time;
 double longest_rtime;
 const char *longest_rtime_fname;
 int longest_function_name_length;
} timebot;
/*! @function
 @param n number of preallocated executions
 @return timebot pointer
*/
static timebot *timebot_create(size_t n)
{
 timebot *bot = (timebot *) calloc(1, sizeof *bot);
 if(bot == NULL) {
 perror("timebot_create"); /* this must not happen */
 abort();
 }
 bot->exec_pool = (excecute_bundle *) calloc(n, sizeof *bot->exec_pool);
 if(bot->exec_pool == NULL) {
 perror("timebot_create execution pool");
 abort();
 }
 return bot;
}
static void timebot_destroy(timebot *tb)
{
 free(tb->exec_pool);
 free(tb);
}
/* runs function with a argument and logs its runtime */
static void _get_runtime(excecute_bundle *bundle)
{
 clock_t start = clock();
 bundle->run(bundle->arg);
 bundle->runtime = (double) (clock() - start) / CLOCKS_PER_SEC;
}
static void _tb_append(timebot *tb, void (*func)(void *), void *arg, const char *name, const char *description)
{
 excecute_bundle *current = &tb->exec_pool[tb->n];
 size_t fname_len = strlen(name);
 if(fname_len > tb->longest_function_name_length) {
 tb->longest_function_name_length = fname_len;
 }
 current->function_name = name;
 current->run = func;
 current->arg = arg;
 current->description = description;
 tb->n++;
}
static void _tb_standalone(timebot *tb, void (*func)(void *), void *arg, const char *name, const char *description)
{
 _tb_append(tb, func, arg, name, description);
 excecute_bundle *current = &tb->exec_pool[tb->n-1];
 _get_runtime(current);
 current->run = NULL;
 current->arg = NULL; 
}
/* append to timebot's execution list */
#define timebot_append(tb, func, arg, desc) _tb_append(tb, func, arg, #func, desc)
#define timebot_standalone(tb, func, arg, desc) _tb_standalone(tb, func, arg, #func, desc)
/* runs all appended functions */
static void timebot_run(timebot *tb) {
 excecute_bundle *current;
 for(size_t i = 0; i < tb->n; i++) {
 current = &tb->exec_pool[i];
 if(current->run != NULL) {
 _get_runtime(current);
 if(current->runtime > tb->longest_rtime) {
 tb->longest_rtime = current->runtime;
 tb->longest_rtime_fname = current->function_name;
 }
 }
 }
}
/* compare functions execution time */
static void timebot_compare(timebot *tb)
{
 excecute_bundle *current;
 printf("\nTimebot Summery:\n%*s\tTime\n", -tb->longest_function_name_length, "Function");
 for(size_t i = 0; i < tb->n; i++) {
 current = &tb->exec_pool[i];
 printf("%*s\t%f sec (faster than %s by %f%%)\n", -tb->longest_function_name_length, current->function_name, current->runtime, tb->longest_rtime_fname, 100 - (current->runtime / tb->longest_rtime * 100));
 }
}
/* prints a summery of all executions */
static void timebot_print_summery(timebot *tb)
{
 excecute_bundle *current;
 printf("\nTimebot Summery:\n %*s | Time\n", -tb->longest_function_name_length, "Function");
 printf(" --------------------------------------------------------------------\n");
 for(size_t i = 0; i < tb->n; i++) {
 current = &tb->exec_pool[i];
 if(current->run == NULL) {
 printf(" %*s | %f sec STANDALONE (%s)\n", -tb->longest_function_name_length, current->function_name, current->runtime, current->description);
 } else {
 printf(" %*s | %f sec (%s)\n", -tb->longest_function_name_length, current->function_name, current->runtime, current->description);
 }
 }
}
#endif

Question 2

What has been your point coding this, benchmarking this? Shouldn't hashdb_get()(map, hashdb_return()) stop probing on NULL? What about deletes?

Question 3

"overhead of about 4 to ...", "why bits?" The point of the sparse library is it can get away with allocating less memory for a "boring" ("sparse") datastructure that mostly has empty values. Also, you probably want to track occupancy, and either signal fatal error or do a reallocation when there's more than ~ 80% occupancy. Otherwise probing is going to take a very, very long time to report that we no longer obey the sparsity assumption and nearly all cells are occupied.

Question 4

@J_H well that makes sense now from that perspective. I use a loadfactor of 0.75 and every call to hashdb_add checks if n entries are 0.75 of capacity and if so then it reallocates and rehashes the map. hashdb_del call checks if it is 1 - 0.75 and then shrinks the map.

Question 5

There is a probability distribution on probing. Sometimes you win after a single probe, or after two probes, and rarely it takes ten probes. But as load factor gets tighter, that whole distribution gets pushed out to the right, so "in expectation" the average number of probes might jump from 2.3 to 12.4 or something similarly ugly. Keeping the load factor sensible prevents probing overhead from getting out of hand.

Question 6

Thank you for the edit, I have deleted my other comment. I am analyzing the code now.

Question 7

General Observations

Interesting question! As the comments show, a lot of room for learning!

Consistency! Some header files are using Doxygen Comments, other headers are not, some header files don't even have comments. Pick a style and stay with it. When you use Doxygen for one function in a file use it for all functions in the file.

Header files with commented out function prototypes (hashdb.h) are not ready for code review. I recommend using a configuration management system such as Git. You can store your files on GitHub. This allows you to delete code and be able to restore it if you want to.

When I was starting out as a software engineer it was common to run lint on the code after it compiled, this found a lot of errors in the code. This isn't really necessary now if you use the proper compiler flags (for gcc -pedantic, -Wall, -Wextra, and -Wconversion for Visual Studio Warning Level3 or Level4). This allowed me to find some problems in the code such as a type mismatch between the prototype declarations of elfhash(const unsigned char *str) and the implementation of elfhash(const unsigned char *str).

misc.h:

size_t elfhash(const unsigned char *str);

misc.c

unsigned long elfhash(const unsigned char *str)

This particular mismatch problem is important because size_t is implementation dependent and may not use a unsigned long on all systems.

Other type mismatches:

timebot.h(71,44): warning C4267: '=': conversion from 'size_t' to 'int', possible loss of data
hashdb_test.c.c(187,12): warning C4477: 'printf' : format string '%1lu' requires an argument of type 'unsigned long', but variadic argument 1 has type 'size_t'
hashdb_test.c(187,12): message : consider using '%zu' in the format string
hashdb_test.c(188,12): warning C4477: 'printf' : format string '%1lu' requires an argument of type 'unsigned long', but variadic argument 1 has type 'size_t'
hashdb_test.c(188,12): message : consider using '%zu' in the format string
hashdb_test.c(204,12): warning C4477: 'printf' : format string '%lu' requires an argument of type 'unsigned long', but variadic argument 1 has type 'size_t'
hashdb_test.c(204,12): message : consider using '%zu' in the format string
hashdb_test.c(205,12): warning C4477: 'printf' : format string '%lu' requires an argument of type 'unsigned long', but variadic argument 1 has type 'size_t'
hashdb_test.c(205,12): message : consider using '%zu' in the format string
hashdb_test.c(214,12): warning C4477: 'printf' : format string '%lu' requires an argument of type 'unsigned long', but variadic argument 1 has type 'size_t'
hashdb_test.c(214,12): message : consider using '%zu' in the format string
hashdb_test.c(215,12): warning C4477: 'printf' : format string '%lu' requires an argument of type 'unsigned long', but variadic argument 1 has type 'size_t'
hashdb_test.c(215,12): message : consider using '%zu' in the format string

In the function main() should this really be the code?

 if(init_test_keys(ROCKYOU_PATH) != 0) {
 return EXIT_SUCCESS;
 }

or should it be

 if(init_test_keys(ROCKYOU_PATH) != 0) {
 return EXIT_FAILURE;
 }

Doxygen Comments

This is a great system for documenting the code, I reccomend that you either use it in the header files or the source files but not both. It is very hard to maintain 2 different sets of comments and keep them in sync.

timebot.h

I recommend putting the actual C source code in timebot.c and only putting the data structure declarations and prototype functions in in timebot.h. While the code does a good job of protecting the functions with static declarations C source code generally isn't put into header files.

Magic Numbers

The code already has a lot of symbolic constants and this is really good, but there is code where there are still raw numbers that need explanation:

misc.c

bool IsPrime(size_t n)
{
 if((n & 1) != 0) {
 for(size_t divisor = 3; divisor <= n / divisor; divisor += 2) {
 if((n % divisor) == 0) {
 return 0;
 }
 }
 return 1;
 }
 return (n == 2);
}
size_t GetHigherPrime(size_t n) {
 for(size_t i = (n | 1); i < LONG_MAX; i += 2) {
 if(IsPrime(i)) {
 return i;
 }
 }
 return n;
}
size_t GetLowerPrime(size_t n) {
 for(size_t i = (n | 1); i < LONG_MAX; i -= 2) {
 if(IsPrime(i)) {
 return i;
 }
 }
 return n;
}
size_t elfhash(const unsigned char *str)
{
 size_t h = 0, high;
 while (*str)
 {
 h = (h << 4) + *str++;
 if ((high = h & 0xF0000000))
 h ^= high >> 24;
 h &= ~high;
 }
 return h;
}

Variable Declarations

A best practice in most programming languages is to declare a single variable per line. In the C and C++ programming languages this best practice also suggests that each variable be initiated on that line as well since neither language provides default values for variables. It turns out that the lack of initialization is a source of many bugs (I have had to trace this error through someone else's code when I was merging libraries into an executable).

In the elfhash function above the code:

 size_t h = 0, high;

would be better written if it was:

 size_t h = 0;
 size_t high = 0;

One of the reasons for putting each variable on its own line is that it improves readability and maintainability.

While Loops Versus For Loops

The for loops in fnv_hash_string(const char *str) and fnv_hash(unsigned char *data, size_t size) should be while loops, at no point is the loop counter variable i used. Loop invariants should be moved out of the loop, and i is a loop invariant. An optimizing compiler might fix the performance in this code and it might not, there are also times that you can't use the optimization of the compiler such as when you are working on embedded systems.

size_t fnv_hash_string(const char *str)
{
 size_t hash;
 int i;
 for(i = 0, hash = FNV_OFFSET; *str != '0円'; str++, i++) {
 hash = hash ^ *str;
 hash = hash * FNV_PRIME;
 }
 return hash;
}

size_t fnv_hash_string(const char *str)
{
 size_t hash = FNV_OFFSET;
 while (*str)
 {
 hash = hash ^ *str;
 hash = hash * FNV_PRIME;
 str++;
 }
 return hash;
}

DRY Code

There is a programming principle called the Don't Repeat Yourself Principle sometimes referred to as DRY code. If you find yourself repeating the same code multiple times it is better to encapsulate it in a function. If it is possible to loop through the code that can reduce repetition as well.

The 2 hash functions in misc.c, size_t fnv_hash_string(const char *str) and size_t fnv_hash(unsigned char *data, size_t size) appear to be a duplication of the code, this may be a problem in the future, however, fnv_hash() is never called and can probably be removed. There doesn't seem to be a reason to ever use fnv_hash() since a size shouldn't ever be needed.

The functions size_t GetHigherPrime(size_t n) and GetLowerPrime(size_t n) could also be merged if a second parameter, direction is added.

#define HIGHERPRIME 2
#define LOWERPRIME -2
size_t GetOtherPrime(size_t n, int direction)
{
 for (size_t i = (n | 1); i < LONG_MAX; i += direction) {
 if (IsPrime(i)) {
 return i;
 }
 }
 return n;
}

Question 8

So I decided to put the comments inside the header files for the users opportunity to read it if they dont use doxygen. I did't not know that these are doxygen QT style comments. I saw them in some other code and noticed they are nicely shown by hovering over them in VS Code. I documented the magic numbers now, changed every the variable declarations to single lines and also change the hash code to use the while loop. Nice example for the DRY Code as I did not think about that i.e. 10 += -2 = 8. I also changed the return value from the hash functions to uint64.

Question 9

Can you explain when to use which type? I.e. size_t? Can I keep size_t for the prime functions?

Question 10

@mortytheshorty The recommended uses of size_t are loop counters, indexes in arrays and memory allocation sizes. Since the hash functions return an index into a dynamic array that is a good use. I don't see any reason to not to use size_t for the prime functions as well. Depending on the implementation size_t may be unsigned long or unsigned int.

Question 11

okay so I changed the hash value type to uint64_t and keep the the size_t for those. Do you know a better approach for getting the index by quadratic probing? As @greybeard said this hash_probe macro has nothing to do with the formular but I hit everything with it as long as capacity is a prime number. But these are may mulitpliactions and divisions. I'm currently not familiar with those math orders and formulas. Any help with that?

Question 12

1) As I use your function above I get a warning for i += direction because of sign-conversion. If I cast the direction to size_t it would be ULONG_MAX-2 and that's not good. What should I do in this case? 2) Shouldn't i < ULONG_MAX be better as I use size_t as loop counter?

Question 13

Quadratic probing uses as the \$i\$th offset from the first probe index \$\pm 1^ic_2i^2+c_1i+c_0, c_2\ne 0\$ with parentheses inserted as deemed useful.
(-1) * (i * i + 1) * (i / 2) * (i / 2) does not implement this.
In C integer arithmetic disregarding overflows, this should be \$-i^2\lfloor\frac i2\rfloor^2\$, growing like \$-i^4\$.
Inline functions are less error prone than intricate macros - "non-branching":

/*! Return the probe index for the ith probe 
 * for an item hashing to hash in a table of size capacity. */
size_t hash_probe(size_t hash, size_t i, size_t cap) {
 const int 
 c2[] = {-1, 1}, c1[] = {1, 1}, c0 = 0,
 odd = i & 1;
 return (hash + (c2[odd] * i + c1[odd]) * i + c0) % cap;
}

Neither the formula quoted nor what I tinkered with in c2, c1 and c0 in size_t hash_probe() gave me 4001 different results for i from 1 to 4001, inclusive, not even close.

Question 14

It works only until capacity of 11. After that I get many results twice. Why is it not working as expected?

Question 15

Please tell what you refer to using it in above comment. Details on not working welcome.

Question 16

I wrote my own hash_probe function and thought mine an yours do the same thing but somehow yours returns doublicate entries on increasing size_t i until capacity and mine does this but wth cap >= 13. Capacity i.e. ` 7 or 11` work as expected that every result is in range 0-6 (for cap == 7) and 0-10 (for cap == 11)` if I loop from `i == 0 to i == cap). How can I show my hash_probe function? Edit my question or edit your answer?

Question 17

How about an SO question?

Question 18

I created an question on StackOverflow:

Question 19

Minor stuff:

IsPrime(1) is usually considered false

bool IsPrime(size_t n) {
 ...
 // return 1;
 return n > 1;

Unneeded L

The L to insure that the constant is at least long is not needed.

//#define FNV_OFFSET 14695981039346656037UL
//#define FNV_PRIME 1099511628211UL
#define FNV_OFFSET 14695981039346656037U
#define FNV_PRIME 1099511628211U

IMO, such magic numbers are a tad more informative when written in hex as it readily shows 14695981039346656037 is a 64-bit number.

Iffy code

~~(削除) calloc(1, sizeof *db); is like calloc(1, sizeof void);. Sure you want that? (削除ここまで)~~

Code typedefs hashdb 2 different ways. This leads to undefined behavior as the pointer types are not specified as being compatible.

Instead of:

typedef void hashdb; // hashdb.h
typedef struct hashdb { ... } hashdb; // hashdb_internal.h

Better and simpler to use typedef struct hashdb hashdb; and have hashdb_internal.h include hashdb.h.

typedef struct hashdb hashdb;

Note that the definition of struct hashdb is only needed in hashdb_internal.h.

This makes for correct code and better type checking in public user usage.

Question 20

I fixed IsPrime and changes the macros. But calloc(1, sizeof *db) is correct since it does only include ` hashdb_internal.h` where hashdb is defined as struct hashdb.

Question 21

@mortytheshorty Answer updated. Do not define hashdb 2 different ways. You can avoid UB and hide struct hashdb from the public by using typedef struct hashdb hashdb; in hashdb.h.

Question 22

Ah didn‘t know that's allowed.

pacmaninbw ♦pacmaninbw 26.2k13 gold badges47 silver badges113 bronze badges · Accepted Answer · 2023-03-08 18:28:31Z

General Observations

Interesting question! As the comments show, a lot of room for learning!

Consistency! Some header files are using Doxygen Comments, other headers are not, some header files don't even have comments. Pick a style and stay with it. When you use Doxygen for one function in a file use it for all functions in the file.

Header files with commented out function prototypes (hashdb.h) are not ready for code review. I recommend using a configuration management system such as Git. You can store your files on GitHub. This allows you to delete code and be able to restore it if you want to.

When I was starting out as a software engineer it was common to run lint on the code after it compiled, this found a lot of errors in the code. This isn't really necessary now if you use the proper compiler flags (for gcc -pedantic, -Wall, -Wextra, and -Wconversion for Visual Studio Warning Level3 or Level4). This allowed me to find some problems in the code such as a type mismatch between the prototype declarations of elfhash(const unsigned char *str) and the implementation of elfhash(const unsigned char *str).

misc.h:

size_t elfhash(const unsigned char *str);

misc.c

unsigned long elfhash(const unsigned char *str)

This particular mismatch problem is important because size_t is implementation dependent and may not use a unsigned long on all systems.

Other type mismatches:

timebot.h(71,44): warning C4267: '=': conversion from 'size_t' to 'int', possible loss of data
hashdb_test.c.c(187,12): warning C4477: 'printf' : format string '%1lu' requires an argument of type 'unsigned long', but variadic argument 1 has type 'size_t'
hashdb_test.c(187,12): message : consider using '%zu' in the format string
hashdb_test.c(188,12): warning C4477: 'printf' : format string '%1lu' requires an argument of type 'unsigned long', but variadic argument 1 has type 'size_t'
hashdb_test.c(188,12): message : consider using '%zu' in the format string
hashdb_test.c(204,12): warning C4477: 'printf' : format string '%lu' requires an argument of type 'unsigned long', but variadic argument 1 has type 'size_t'
hashdb_test.c(204,12): message : consider using '%zu' in the format string
hashdb_test.c(205,12): warning C4477: 'printf' : format string '%lu' requires an argument of type 'unsigned long', but variadic argument 1 has type 'size_t'
hashdb_test.c(205,12): message : consider using '%zu' in the format string
hashdb_test.c(214,12): warning C4477: 'printf' : format string '%lu' requires an argument of type 'unsigned long', but variadic argument 1 has type 'size_t'
hashdb_test.c(214,12): message : consider using '%zu' in the format string
hashdb_test.c(215,12): warning C4477: 'printf' : format string '%lu' requires an argument of type 'unsigned long', but variadic argument 1 has type 'size_t'
hashdb_test.c(215,12): message : consider using '%zu' in the format string

In the function main() should this really be the code?

 if(init_test_keys(ROCKYOU_PATH) != 0) {
 return EXIT_SUCCESS;
 }

or should it be

 if(init_test_keys(ROCKYOU_PATH) != 0) {
 return EXIT_FAILURE;
 }

Doxygen Comments

This is a great system for documenting the code, I reccomend that you either use it in the header files or the source files but not both. It is very hard to maintain 2 different sets of comments and keep them in sync.

timebot.h

I recommend putting the actual C source code in timebot.c and only putting the data structure declarations and prototype functions in in timebot.h. While the code does a good job of protecting the functions with static declarations C source code generally isn't put into header files.

Magic Numbers

The code already has a lot of symbolic constants and this is really good, but there is code where there are still raw numbers that need explanation:

misc.c

bool IsPrime(size_t n)
{
 if((n & 1) != 0) {
 for(size_t divisor = 3; divisor <= n / divisor; divisor += 2) {
 if((n % divisor) == 0) {
 return 0;
 }
 }
 return 1;
 }
 return (n == 2);
}
size_t GetHigherPrime(size_t n) {
 for(size_t i = (n | 1); i < LONG_MAX; i += 2) {
 if(IsPrime(i)) {
 return i;
 }
 }
 return n;
}
size_t GetLowerPrime(size_t n) {
 for(size_t i = (n | 1); i < LONG_MAX; i -= 2) {
 if(IsPrime(i)) {
 return i;
 }
 }
 return n;
}
size_t elfhash(const unsigned char *str)
{
 size_t h = 0, high;
 while (*str)
 {
 h = (h << 4) + *str++;
 if ((high = h & 0xF0000000))
 h ^= high >> 24;
 h &= ~high;
 }
 return h;
}

Variable Declarations

A best practice in most programming languages is to declare a single variable per line. In the C and C++ programming languages this best practice also suggests that each variable be initiated on that line as well since neither language provides default values for variables. It turns out that the lack of initialization is a source of many bugs (I have had to trace this error through someone else's code when I was merging libraries into an executable).

In the elfhash function above the code:

 size_t h = 0, high;

would be better written if it was:

 size_t h = 0;
 size_t high = 0;

One of the reasons for putting each variable on its own line is that it improves readability and maintainability.

While Loops Versus For Loops

The for loops in fnv_hash_string(const char *str) and fnv_hash(unsigned char *data, size_t size) should be while loops, at no point is the loop counter variable i used. Loop invariants should be moved out of the loop, and i is a loop invariant. An optimizing compiler might fix the performance in this code and it might not, there are also times that you can't use the optimization of the compiler such as when you are working on embedded systems.

size_t fnv_hash_string(const char *str)
{
 size_t hash;
 int i;
 for(i = 0, hash = FNV_OFFSET; *str != '0円'; str++, i++) {
 hash = hash ^ *str;
 hash = hash * FNV_PRIME;
 }
 return hash;
}

size_t fnv_hash_string(const char *str)
{
 size_t hash = FNV_OFFSET;
 while (*str)
 {
 hash = hash ^ *str;
 hash = hash * FNV_PRIME;
 str++;
 }
 return hash;
}

DRY Code

There is a programming principle called the Don't Repeat Yourself Principle sometimes referred to as DRY code. If you find yourself repeating the same code multiple times it is better to encapsulate it in a function. If it is possible to loop through the code that can reduce repetition as well.

The 2 hash functions in misc.c, size_t fnv_hash_string(const char *str) and size_t fnv_hash(unsigned char *data, size_t size) appear to be a duplication of the code, this may be a problem in the future, however, fnv_hash() is never called and can probably be removed. There doesn't seem to be a reason to ever use fnv_hash() since a size shouldn't ever be needed.

The functions size_t GetHigherPrime(size_t n) and GetLowerPrime(size_t n) could also be merged if a second parameter, direction is added.

#define HIGHERPRIME 2
#define LOWERPRIME -2
size_t GetOtherPrime(size_t n, int direction)
{
 for (size_t i = (n | 1); i < LONG_MAX; i += direction) {
 if (IsPrime(i)) {
 return i;
 }
 }
 return n;
}

So I decided to put the comments inside the header files for the users opportunity to read it if they dont use doxygen. I did't not know that these are doxygen QT style comments. I saw them in some other code and noticed they are nicely shown by hovering over them in VS Code. I documented the magic numbers now, changed every the variable declarations to single lines and also change the hash code to use the while loop. Nice example for the DRY Code as I did not think about that i.e. 10 += -2 = 8. I also changed the return value from the hash functions to uint64.
Can you explain when to use which type? I.e. size_t? Can I keep size_t for the prime functions?
@mortytheshorty The recommended uses of size_t are loop counters, indexes in arrays and memory allocation sizes. Since the hash functions return an index into a dynamic array that is a good use. I don't see any reason to not to use size_t for the prime functions as well. Depending on the implementation size_t may be unsigned long or unsigned int.
okay so I changed the hash value type to uint64_t and keep the the size_t for those. Do you know a better approach for getting the index by quadratic probing? As @greybeard said this hash_probe macro has nothing to do with the formular but I hit everything with it as long as capacity is a prime number. But these are may mulitpliactions and divisions. I'm currently not familiar with those math orders and formulas. Any help with that?
1) As I use your function above I get a warning for i += direction because of sign-conversion. If I cast the direction to size_t it would be ULONG_MAX-2 and that's not good. What should I do in this case? 2) Shouldn't i < ULONG_MAX be better as I use size_t as loop counter?

Stack Exchange Network

quadratic probing hashmap and its time efficiency

3 Answers 3

General Observations

Doxygen Comments

timebot.h

Magic Numbers

Variable Declarations

While Loops Versus For Loops

DRY Code

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

quadratic probing hashmap and its time efficiency

3 Answers 3

General Observations

Doxygen Comments

timebot.h

Magic Numbers

Variable Declarations

While Loops Versus For Loops

DRY Code

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions