I'm looking to make my code faster for loading a hash table. Any pointers would be appreciated. Below are my includes, the node def and the loading code.
The dictionary loaded could be any *.txt file. Its pre processed to contains just lower case words sans punctuation.
#include <stdbool.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <ctype.h>
#include "dictionary.h"
// define hash table structure
typedef struct node
{
char word[LENGTH + 1];
struct node* next;
}
node;
bool load(const char* dictionary)
{
// open dictionary and check for NULL
FILE* fpdict = fopen(dictionary, "r");
if (fpdict == NULL)
{
printf("Could not open %s.\n", dictionary);
unload();
return false;
}
// read in txt file
else
{
node* new_node = malloc(sizeof(node));
for ( ; fscanf(fpdict, "%s", new_node->word) != EOF;
new_node = malloc(sizeof(node)) )
{
// call hash function to get key
key = hash(new_node->word, strlen(new_node->word), TABLESIZE);
// check if hashtable key is NULL
if (hashtable[key] == NULL)
{
hashtable[key] = new_node;
new_node->next = NULL;
}
// if linked list exists, insert it in order
else if (hashtable[key] != NULL)
{
// point to first hashtable entry
node* head = hashtable[key];
// set crawler to head
node* crawler = head;
// check where to insert
// insert word at end and set pointer to NULL
while (crawler->next != NULL)
{
crawler = crawler->next;
}
new_node->next = NULL;
crawler->next = new_node;
}
wordCount += 1;
}
fclose(fpdict);
free(new_node);
return true;
}
return false;
}
3 Answers 3
An ultimate speedup is achieved by preprocessing the dictionary one step further, into the binary format. Group the words of the same hashcode together, and prepare a TOC as an array of file offsets of each such group at the beginning of the file. This way the dictionary loads in a single fread()
(you may also want to account for endianness with bit more sophisticated format).
I also found strange that you add nodes at the end of the list. Unless you have a very strong reason of doing so, crawling lists is a pure waste of time. Inserting node at the head unconditionally may - or may not - gain some performance, depending of the quality of the hash.
That said, I may only reiterate on the @Edward comment: profile the code.
Insert at head
Currently, when you get a hash collision, you append the new word to the tail of the hash bucket. It would be faster and the code would be shorter to prepend to the head instead.
Current code:
// check if hashtable key is NULL if (hashtable[key] == NULL) { hashtable[key] = new_node; new_node->next = NULL; } // if linked list exists, insert it in order else if (hashtable[key] != NULL) { // point to first hashtable entry node* head = hashtable[key]; // set crawler to head node* crawler = head; // check where to insert // insert word at end and set pointer to NULL while (crawler->next != NULL) { crawler = crawler->next; } new_node->next = NULL; crawler->next = new_node; }
New code:
new_node->next = hashtable[key];
hashtable[key] = new_node;
If there is a particular reason that you need to keep your buckets in alphabetical order, then you could just preprocess the input file to list the words in reverse and then you'd end up with the same ordering in the buckets.
Preprocess hash
Most likely, your hash function is the slowest part of the process (although you should profile to make sure). What you could do is to precompute the hash of each word and save the hash to the input file next to each word. For example:
apple 12376
banana 9384
Now you can just read the hash in directly instead of computing it. Without knowing your hash function, it's not certain that this will be of any benefit, though.
You can probably increase the I/O performance of your code by reading from disk in chunks. The macro BUFSIZ
, declared in stdio.h is the size of the internal I/O buffers and will probably be the most efficient read size. Use fread
to fill the whole buffer in one call. Handling the seams can be a bit tricky if a word spans two buffers, but it can be done with a bit of care.
const int size = BUFSIZ;
char buf[size];
size_t count = 0;
do {
count = fread(buf, 1, size, fpdict);
if (count < 0) {
// Error - check errno
break;
}
// Read words from buf
}
} while (count > 0);
hash()
it's not possible to give a useful answer. \$\endgroup\$