Faster method for loading a hash table

Question 1

I'm looking to make my code faster for loading a hash table. Any pointers would be appreciated. Below are my includes, the node def and the loading code.

The dictionary loaded could be any *.txt file. Its pre processed to contains just lower case words sans punctuation.

#include <stdbool.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <ctype.h>
#include "dictionary.h"
// define hash table structure 
typedef struct node
{
 char word[LENGTH + 1];
 struct node* next;
}
node;
bool load(const char* dictionary)
{
 // open dictionary and check for NULL
 FILE* fpdict = fopen(dictionary, "r");
 if (fpdict == NULL)
 {
 printf("Could not open %s.\n", dictionary);
 unload();
 return false; 
 }
 // read in txt file
 else
 {
 node* new_node = malloc(sizeof(node));
 for ( ; fscanf(fpdict, "%s", new_node->word) != EOF;
 new_node = malloc(sizeof(node)) ) 
 { 
 // call hash function to get key
 key = hash(new_node->word, strlen(new_node->word), TABLESIZE);
 // check if hashtable key is NULL
 if (hashtable[key] == NULL)
 { 
 hashtable[key] = new_node;
 new_node->next = NULL;
 }
 // if linked list exists, insert it in order 
 else if (hashtable[key] != NULL)
 {
 // point to first hashtable entry
 node* head = hashtable[key];
 // set crawler to head
 node* crawler = head;
 // check where to insert
 // insert word at end and set pointer to NULL
 while (crawler->next != NULL)
 {
 crawler = crawler->next;
 }
 new_node->next = NULL;
 crawler->next = new_node;
 }
 wordCount += 1;
 }
 fclose(fpdict);
 free(new_node);
 return true; 
 } 
 return false;
}

Question 2

Welcome to Code Review! Could you please include some more information in the question? How is the dictionary file laid out? How do you use it? You can edit your post by pressing the small "edit" button under the post.

Question 3

Without the code for hash() it's not possible to give a useful answer.

Question 4

I'm trying to isolate the speed issue related to loading from what particular algorithm I'm implementing to populate the hash table. I'm well aware different algorithms will affect the speed of the load but that is not what I'm asking. I'm asking if the code for loading the hash table itself is optimal or are you saying it should be tailored to the type of algorithm I'm using?

Question 5

Generally, such code is either I/O bound or compute bound. If it's compute bound (meaning that it's the computation limiting the speed), it's likely to be the hash function implementation that's using most of the time. If it's I/O bound, significant speed improvements are unlikely to be obtained by rearranging the code.

Question 6

An ultimate speedup is achieved by preprocessing the dictionary one step further, into the binary format. Group the words of the same hashcode together, and prepare a TOC as an array of file offsets of each such group at the beginning of the file. This way the dictionary loads in a single fread() (you may also want to account for endianness with bit more sophisticated format).

I also found strange that you add nodes at the end of the list. Unless you have a very strong reason of doing so, crawling lists is a pure waste of time. Inserting node at the head unconditionally may - or may not - gain some performance, depending of the quality of the hash.

That said, I may only reiterate on the @Edward comment: profile the code.

Question 7

Insert at head

Currently, when you get a hash collision, you append the new word to the tail of the hash bucket. It would be faster and the code would be shorter to prepend to the head instead.

Current code:

 // check if hashtable key is NULL
 if (hashtable[key] == NULL)
 { 
 hashtable[key] = new_node;
 new_node->next = NULL;
 }
 // if linked list exists, insert it in order 
 else if (hashtable[key] != NULL)
 {
 // point to first hashtable entry
 node* head = hashtable[key];
 // set crawler to head
 node* crawler = head;
 // check where to insert
 // insert word at end and set pointer to NULL
 while (crawler->next != NULL)
 {
 crawler = crawler->next;
 }
 new_node->next = NULL;
 crawler->next = new_node;
 }

New code:

 new_node->next = hashtable[key];
 hashtable[key] = new_node;

If there is a particular reason that you need to keep your buckets in alphabetical order, then you could just preprocess the input file to list the words in reverse and then you'd end up with the same ordering in the buckets.

Preprocess hash

Most likely, your hash function is the slowest part of the process (although you should profile to make sure). What you could do is to precompute the hash of each word and save the hash to the input file next to each word. For example:

apple 12376
banana 9384

Now you can just read the hash in directly instead of computing it. Without knowing your hash function, it's not certain that this will be of any benefit, though.

Question 8

You can probably increase the I/O performance of your code by reading from disk in chunks. The macro BUFSIZ, declared in stdio.h is the size of the internal I/O buffers and will probably be the most efficient read size. Use fread to fill the whole buffer in one call. Handling the seams can be a bit tricky if a word spans two buffers, but it can be done with a bit of care.

const int size = BUFSIZ;
char buf[size];
size_t count = 0;
do {
 count = fread(buf, 1, size, fpdict);
 if (count < 0) {
 // Error - check errno
 break;
 }
 // Read words from buf
 }
} while (count > 0);

vnp vnpvnp 58.5k4 gold badges55 silver badges144 bronze badges · Answer 1 · 2015-07-10 17:14:12Z

An ultimate speedup is achieved by preprocessing the dictionary one step further, into the binary format. Group the words of the same hashcode together, and prepare a TOC as an array of file offsets of each such group at the beginning of the file. This way the dictionary loads in a single fread() (you may also want to account for endianness with bit more sophisticated format).

I also found strange that you add nodes at the end of the list. Unless you have a very strong reason of doing so, crawling lists is a pure waste of time. Inserting node at the head unconditionally may - or may not - gain some performance, depending of the quality of the hash.

That said, I may only reiterate on the @Edward comment: profile the code.

JS1 JS1JS1 28.8k3 gold badges41 silver badges83 bronze badges · Answer 2 · 2015-07-10 17:23:41Z

Insert at head

Currently, when you get a hash collision, you append the new word to the tail of the hash bucket. It would be faster and the code would be shorter to prepend to the head instead.

Current code:

 // check if hashtable key is NULL
 if (hashtable[key] == NULL)
 { 
 hashtable[key] = new_node;
 new_node->next = NULL;
 }
 // if linked list exists, insert it in order 
 else if (hashtable[key] != NULL)
 {
 // point to first hashtable entry
 node* head = hashtable[key];
 // set crawler to head
 node* crawler = head;
 // check where to insert
 // insert word at end and set pointer to NULL
 while (crawler->next != NULL)
 {
 crawler = crawler->next;
 }
 new_node->next = NULL;
 crawler->next = new_node;
 }

New code:

 new_node->next = hashtable[key];
 hashtable[key] = new_node;

If there is a particular reason that you need to keep your buckets in alphabetical order, then you could just preprocess the input file to list the words in reverse and then you'd end up with the same ordering in the buckets.

Preprocess hash

Most likely, your hash function is the slowest part of the process (although you should profile to make sure). What you could do is to precompute the hash of each word and save the hash to the input file next to each word. For example:

apple 12376
banana 9384

Now you can just read the hash in directly instead of computing it. Without knowing your hash function, it's not certain that this will be of any benefit, though.

jacwah jacwahjacwah 2,69118 silver badges42 bronze badges · Answer 3 · 2015-07-10 14:44:48Z

You can probably increase the I/O performance of your code by reading from disk in chunks. The macro BUFSIZ, declared in stdio.h is the size of the internal I/O buffers and will probably be the most efficient read size. Use fread to fill the whole buffer in one call. Handling the seams can be a bit tricky if a word spans two buffers, but it can be done with a bit of care.

const int size = BUFSIZ;
char buf[size];
size_t count = 0;
do {
 count = fread(buf, 1, size, fpdict);
 if (count < 0) {
 // Error - check errno
 break;
 }
 // Read words from buf
 }
} while (count > 0);

Stack Exchange Network

Faster method for loading a hash table

3 Answers 3

Insert at head

Preprocess hash

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Faster method for loading a hash table

3 Answers 3

Insert at head

Preprocess hash

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions