1
\$\begingroup\$

I'm looking to make my code faster for loading a hash table. Any pointers would be appreciated. Below are my includes, the node def and the loading code.

The dictionary loaded could be any *.txt file. Its pre processed to contains just lower case words sans punctuation.

#include <stdbool.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <ctype.h>
#include "dictionary.h"
// define hash table structure 
typedef struct node
{
 char word[LENGTH + 1];
 struct node* next;
}
node;
bool load(const char* dictionary)
{
 // open dictionary and check for NULL
 FILE* fpdict = fopen(dictionary, "r");
 if (fpdict == NULL)
 {
 printf("Could not open %s.\n", dictionary);
 unload();
 return false; 
 }
 // read in txt file
 else
 {
 node* new_node = malloc(sizeof(node));
 for ( ; fscanf(fpdict, "%s", new_node->word) != EOF;
 new_node = malloc(sizeof(node)) ) 
 { 
 // call hash function to get key
 key = hash(new_node->word, strlen(new_node->word), TABLESIZE);
 // check if hashtable key is NULL
 if (hashtable[key] == NULL)
 { 
 hashtable[key] = new_node;
 new_node->next = NULL;
 }
 // if linked list exists, insert it in order 
 else if (hashtable[key] != NULL)
 {
 // point to first hashtable entry
 node* head = hashtable[key];
 // set crawler to head
 node* crawler = head;
 // check where to insert
 // insert word at end and set pointer to NULL
 while (crawler->next != NULL)
 {
 crawler = crawler->next;
 }
 new_node->next = NULL;
 crawler->next = new_node;
 }
 wordCount += 1;
 }
 fclose(fpdict);
 free(new_node);
 return true; 
 } 
 return false;
}
Quill
12k5 gold badges41 silver badges93 bronze badges
asked Jul 10, 2015 at 9:35
\$\endgroup\$
4
  • \$\begingroup\$ Welcome to Code Review! Could you please include some more information in the question? How is the dictionary file laid out? How do you use it? You can edit your post by pressing the small "edit" button under the post. \$\endgroup\$ Commented Jul 10, 2015 at 9:53
  • \$\begingroup\$ Without the code for hash() it's not possible to give a useful answer. \$\endgroup\$ Commented Jul 10, 2015 at 13:28
  • \$\begingroup\$ I'm trying to isolate the speed issue related to loading from what particular algorithm I'm implementing to populate the hash table. I'm well aware different algorithms will affect the speed of the load but that is not what I'm asking. I'm asking if the code for loading the hash table itself is optimal or are you saying it should be tailored to the type of algorithm I'm using? \$\endgroup\$ Commented Jul 10, 2015 at 13:35
  • \$\begingroup\$ Generally, such code is either I/O bound or compute bound. If it's compute bound (meaning that it's the computation limiting the speed), it's likely to be the hash function implementation that's using most of the time. If it's I/O bound, significant speed improvements are unlikely to be obtained by rearranging the code. \$\endgroup\$ Commented Jul 10, 2015 at 14:02

3 Answers 3

3
\$\begingroup\$

An ultimate speedup is achieved by preprocessing the dictionary one step further, into the binary format. Group the words of the same hashcode together, and prepare a TOC as an array of file offsets of each such group at the beginning of the file. This way the dictionary loads in a single fread() (you may also want to account for endianness with bit more sophisticated format).

I also found strange that you add nodes at the end of the list. Unless you have a very strong reason of doing so, crawling lists is a pure waste of time. Inserting node at the head unconditionally may - or may not - gain some performance, depending of the quality of the hash.

That said, I may only reiterate on the @Edward comment: profile the code.

answered Jul 10, 2015 at 17:14
\$\endgroup\$
3
\$\begingroup\$

Insert at head

Currently, when you get a hash collision, you append the new word to the tail of the hash bucket. It would be faster and the code would be shorter to prepend to the head instead.

Current code:

 // check if hashtable key is NULL
 if (hashtable[key] == NULL)
 { 
 hashtable[key] = new_node;
 new_node->next = NULL;
 }
 // if linked list exists, insert it in order 
 else if (hashtable[key] != NULL)
 {
 // point to first hashtable entry
 node* head = hashtable[key];
 // set crawler to head
 node* crawler = head;
 // check where to insert
 // insert word at end and set pointer to NULL
 while (crawler->next != NULL)
 {
 crawler = crawler->next;
 }
 new_node->next = NULL;
 crawler->next = new_node;
 }

New code:

 new_node->next = hashtable[key];
 hashtable[key] = new_node;

If there is a particular reason that you need to keep your buckets in alphabetical order, then you could just preprocess the input file to list the words in reverse and then you'd end up with the same ordering in the buckets.

Preprocess hash

Most likely, your hash function is the slowest part of the process (although you should profile to make sure). What you could do is to precompute the hash of each word and save the hash to the input file next to each word. For example:

apple 12376
banana 9384

Now you can just read the hash in directly instead of computing it. Without knowing your hash function, it's not certain that this will be of any benefit, though.

answered Jul 10, 2015 at 17:23
\$\endgroup\$
1
\$\begingroup\$

You can probably increase the I/O performance of your code by reading from disk in chunks. The macro BUFSIZ, declared in stdio.h is the size of the internal I/O buffers and will probably be the most efficient read size. Use fread to fill the whole buffer in one call. Handling the seams can be a bit tricky if a word spans two buffers, but it can be done with a bit of care.

const int size = BUFSIZ;
char buf[size];
size_t count = 0;
do {
 count = fread(buf, 1, size, fpdict);
 if (count < 0) {
 // Error - check errno
 break;
 }
 // Read words from buf
 }
} while (count > 0);
answered Jul 10, 2015 at 14:44
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.