Spell Check and Trie implementation

Question 1

I have written this code for an Edx course called CS50x (a beginner course).

This problem set required me to write a program which:

loaded a dictionary into some sort of data structure (I choose to implement a trie) and which could take an input text and search for misspellings
uses a file with an already implemented spell checker that lacked certain functions,
one skeleton file to use as the base to create the functions that the first file required
and a problem spec

It's hard to include info from the spec in this post (since a lot of it doesn't make since out of context), but I will try.

Also, a library is included with the virtual machine that is used for the course, so any function that you see which is not defined in the base libraries and you don't see defined in my two code files, then it is probably in the library.

Finally, this code works. This code probably won't compile for you since it is specifically made for one development environment (The CS50 "Appliance").

TLDR Here is the important thing to know: This is a code probably won't work for you since it is staged in a highly customized environment and comes with a custom library (Not sure if I used it in this program)

dictionary.c

The first file is called "dictionary.c" and the goal of it, per the spec, is to implement check, load, size, and unload which will, respectively, check for a word in the data structure, load a word in the data structure, check the size of the data structure, and unload the data structure (free the memory used for it). Dictionary.c contains no main function and is just composed of functions which are used in the second file, "speller.c".

speller.c

Speller.c's purpose in life is to check the spelling of a provided file against the spelling of words in a dictionary. In order to make this file work (it was provided as I stated in the second paragraph) I had to implement dictionary.c. I will include it as it may help your understanding of my code, but it is not important.

TLDR Dictionary is where I implement the trie and speller isn't important (don't review it) because it is pre-provided and is only here to show an implementation of Dictionary.

I should also say that I want to learn how to make this code faster, better formatted, and just general better.

dictionary.c

/****************************************************************************
 * dictionary.c
 *
 * Computer Science 50
 * Problem Set 6
 *
 * Implements a dictionary's functionality.
 ***************************************************************************/
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include "dictionary.h"
#include <ctype.h>
//Defines the trie data structure which is used to store the words
struct trieNode{
 char letter;
 bool wordComp;
 bool isRoot;
 struct trieNode* previous;
 struct trieNode* childNodes[27];
 }trie;
 struct trieNode* root;
 //The trie that will be used for the rest of the program
 //Stores wordCount after intialization of dictionary
 int sizeOfDic = 0;
/**
* Returns true if word is in dictionary else false.
*/
bool check(const char* word)
{
 char currChar = ' ';
 currChar = word[0];
 struct trieNode* currNode = root;
 for(int i = 0; currChar != '0円'; i++)
 {
 currChar = word[i];
 if(currChar == '0円')
 {
 break;
 }
 else if(!isalpha(currChar)&&currChar != 39)
 {
 return false;
 }
 else if(currChar != 39)
 {
 currChar = tolower(currChar);
 if(currNode->childNodes[currChar - 'a'] != NULL)
 {
 currNode = currNode->childNodes[currChar - 'a'];
 }
 else
 {
 return false;
 }
 
 }
 else if(currChar == 39)
 {
 if(currNode->childNodes[26] != NULL)
 {
 currNode = currNode->childNodes[26];
 }
 else
 {
 return false;
 }
 }
 }
 if(currChar == '0円' && currNode->wordComp){
 return true;
 }else{
 return false;
 }
}
/**
 * Loads dictionary into memory. Returns true if successful else false.
 */
bool load(const char* dictionary)
{
 root = calloc(1,sizeof(struct trieNode));
 //Counts to see whether the word was just an \n or it was really a word
 int cLettCount = 0;
 //counts the number of total words used
 int tWordCount = 0;
 //basic setup stuff
 root->isRoot = true;
 //Pointer to the current node (current node starts as root node).
 struct trieNode* currNode = root;
 struct trieNode* prevNode;
 FILE* dic = fopen(dictionary, "r");
 char currChar = 1;
 //main loop (Goes till end of file)
 while(currChar != EOF){
 //resets letter count
 cLettCount = 0;
 //resets currNode to the root node
 currNode = root;
 currChar = 1;
 //Loop gathers current word and pushes the letters to the trie
 for(int j = 0; currChar != '\n'; j++){
 currChar = fgetc(dic);
 //Makes sure we're not at the end of a word or at EOF
 if(currChar == '\n'|| currChar == EOF)
 {
 break;
 }
 //Makes sure we only get alpha chars
 else if(!isalpha(currChar) && currChar != 39)
 {
 printf("Nonalpha char %c detected. Quiting...\n", currChar);
 return false;
 }
 //Meat and bones. Adds letters onto the trie
 else if(isalpha(currChar) || currChar== 39)
 {
 bool isNull = false;
 if(currChar == 39){
 if(currNode->childNodes[26] == NULL)
 {
 isNull = true;
 currNode->childNodes[26] = calloc (1,sizeof (trie));
 }
 prevNode = currNode;
 currNode = currNode->childNodes[26];
 }
 else
 {
 //Finds the node that corrosponds with the letter
 if(currNode->childNodes[currChar - 'a'] == NULL)
 {
 isNull = true;
 currNode->childNodes[currChar - 'a'] = calloc (1,sizeof (trie));
 }
 prevNode = currNode;
 currNode = currNode->childNodes[currChar - 'a'];
 }
 if(isNull)
 {
 currNode->previous = prevNode;
 currNode->isRoot = false;
 currNode->letter = currChar;
 currNode-> wordComp = false;
 }else if(!isNull){
 }
 }
 }
 if(currChar == EOF){
 break;
 }
 else if(currChar == '\n'){
 currNode -> wordComp = true;
 tWordCount++;
 }
 }
 fclose(dic);
 sizeOfDic = tWordCount;
 return true;
}
/**
 * Returns number of words in dictionary if loaded else 0 if not yet loaded.
 */
unsigned int size(void)
{
 // TODO
 return sizeOfDic;
}
void unloadHelper(struct trieNode* currNode)
{
 for(int i = 0; i < 27; i++)
 {
 if(currNode->childNodes[i] !=NULL)
 {
 unloadHelper(currNode->childNodes[i]);
 }
 }
 free(currNode);
}
/**
 * Unloads dictionary from memory. Returns true if successful else false.
 */
bool unload(void)
{
 unloadHelper(root);
 return true;
}

speller.c

/****************************************************************************
 * speller.c
 *
 * Computer Science 50
 * Problem Set 6
 *
 * Implements a spell-checker.
 ***************************************************************************/
#include <ctype.h>
#include <stdio.h>
#include <sys/resource.h>
#include <sys/time.h>
#include "dictionary.h"
#undef calculate
#undef getrusage
// default dictionary
#define DICTIONARY "/home/cs50/pset6/dictionaries/large"
// prototype
double calculate(const struct rusage* b, const struct rusage* a);
int main(int argc, char* argv[])
{
 // check for correct number of args
 if (argc != 2 && argc != 3)
 {
 printf("Usage: speller [dictionary] text\n");
 return 1;
 }
 // structs for timing data
 struct rusage before, after;
 // benchmarks
 double ti_load = 0.0, ti_check = 0.0, ti_size = 0.0, ti_unload = 0.0;
 // determine dictionary to use
 char* dictionary = (argc == 3) ? argv[1] : DICTIONARY;
 // load dictionary
 getrusage(RUSAGE_SELF, &before);
 bool loaded = load(dictionary);
 getrusage(RUSAGE_SELF, &after);
 // abort if dictionary not loaded
 if (!loaded)
 {
 printf("Could not load %s.\n", dictionary);
 return 1;
 }
 // calculate time to load dictionary
 ti_load = calculate(&before, &after);
 // try to open text
 char* text = (argc == 3) ? argv[2] : argv[1];
 FILE* fp = fopen(text, "r");
 if (fp == NULL)
 {
 printf("Could not open %s.\n", text);
 unload();
 return 1;
 }
 // prepare to report misspellings
 printf("\nMISSPELLED WORDS\n\n");
 // prepare to spell-check
 int index = 0, misspellings = 0, words = 0;
 char word[LENGTH+1];
 // spell-check each word in text
 for (int c = fgetc(fp); c != EOF; c = fgetc(fp))
 {
 // allow only alphabetical characters and apostrophes
 if (isalpha(c) || (c == '\'' && index > 0))
 {
 // append character to word
 tolower(c);
 word[index] = c;
 index++;
 // ignore alphabetical strings too long to be words
 if (index > LENGTH)
 {
 // consume remainder of alphabetical string
 while ((c = fgetc(fp)) != EOF && isalpha(c));
 // prepare for new word
 index = 0;
 }
 }
 // ignore words with numbers (like MS Word can)
 else if (isdigit(c))
 {
 // consume remainder of alphanumeric string
 while ((c = fgetc(fp)) != EOF && isalnum(c));
 // prepare for new word
 index = 0;
 }
 // we must have found a whole word
 else if (index > 0)
 {
 // terminate current word
 word[index] = '0円';
 // update counter
 words++;
 // check word's spelling
 getrusage(RUSAGE_SELF, &before);
 bool misspelled = !check(word);
 getrusage(RUSAGE_SELF, &after);
 // update benchmark
 ti_check += calculate(&before, &after);
 // print word if misspelled
 if (misspelled)
 {
 printf("%s\n", word);
 misspellings++;
 }
 // prepare for next word
 index = 0;
 }
 }
 // check whether there was an error
 if (ferror(fp))
 {
 fclose(fp);
 printf("Error reading %s.\n", text);
 unload();
 return 1;
 }
 // close text
 fclose(fp);
 // determine dictionary's size
 getrusage(RUSAGE_SELF, &before);
 unsigned int n = size();
 getrusage(RUSAGE_SELF, &after);
 // calculate time to determine dictionary's size
 ti_size = calculate(&before, &after);
 // unload dictionary
 getrusage(RUSAGE_SELF, &before);
 bool unloaded = unload();
 getrusage(RUSAGE_SELF, &after);
 // abort if dictionary not unloaded
 if (!unloaded)
 {
 printf("Could not unload %s.\n", dictionary);
 return 1;
 }
 // calculate time to unload dictionary
 ti_unload = calculate(&before, &after);
 // report benchmarks
 printf("\nWORDS MISSPELLED: %d\n", misspellings);
 printf("WORDS IN DICTIONARY: %d\n", n);
 printf("WORDS IN TEXT: %d\n", words);
 printf("TIME IN load: %.2f\n", ti_load);
 printf("TIME IN check: %.2f\n", ti_check);
 printf("TIME IN size: %.2f\n", ti_size);
 printf("TIME IN unload: %.2f\n", ti_unload);
 printf("TIME IN TOTAL: %.2f\n\n", 
 ti_load + ti_check + ti_size + ti_unload);
 // that's all folks
 return 0;
}
/**
 * Returns number of seconds between b and a.
 */
double calculate(const struct rusage* b, const struct rusage* a)
{
 if (b == NULL || a == NULL)
 {
 return 0.0;
 }
 else
 {
 return ((((a->ru_utime.tv_sec * 1000000 + a->ru_utime.tv_usec) -
 (b->ru_utime.tv_sec * 1000000 + b->ru_utime.tv_usec)) +
 ((a->ru_stime.tv_sec * 1000000 + a->ru_stime.tv_usec) -
 (b->ru_stime.tv_sec * 1000000 + b->ru_stime.tv_usec)))
 / 1000000.0);
 }
}

Question 2

Naming

Many names in your program are not very meaningful. check can mean anything. In your context it means that the word is known, valid, correct (pick one), so the function shall be named along the lines of isWordKnown. Pretty much the same can be said about calculate.
main

It does too much. As a rule of thumb, main shall parse a command line and call appropriate functions. A heavy duty for loop must be replaced by a higher-level actions. As much as I can tell, what it's supposed to be doing is
```
read_a_word;
validate_the_word;
```
Try to accommodate this pattern.
Global variables

Try to avoid them at all costs. The wider is the scope, the harder is reasoning about the variable. At least restrict them to a file scope:
```
static struct trieNode * root;
```
check() is unnecessary complicated.

Consider instead
```
bool check(struct trieNode * root, const char* word)
{
 char currChar;
 for(int i = 0; (currChar = word[i]) != '0円'; i++)
 {
 if (currChar == '\'') {
 index = 26;
 } else if (isalpha(currChar)) {
 index = tolower(currChar) - 'a';
 } else {
 return false;
 }
 if ((node = node->childNodes[index]) == NULL)
 return false;
 }
 return (currChar == '0円' && node->wordComp);
}
```
You may notice that the loop body naturally disintegrates into 2 distinct operations: calculating the index and navigating the tree. I would seriously consider to factor out index calculation into a separate function. This would help immensely once you'd decide to change the alphabet (like, accept accents, or switch to Croatian).

PS: you may also notice that I decided to pass a root node as a parameter. This enables a complete separation of code from data.
Data; performance; etc.

Nothing is certain until it is measured; however the gut feeling is that too much space is wasted. Consider how many English words end with ing - and each ing takes 80+ pointers just to encode less than 3 bytes (ed and s are equally wasteful). Side note: English spelling is very regular; look at rispell for hoops a Russian speller has to jump through!

vnp vnp 58.7k4 gold badges55 silver badges144 bronze badges · Accepted Answer · 2014-07-01 05:51:25Z

Naming

Many names in your program are not very meaningful. check can mean anything. In your context it means that the word is known, valid, correct (pick one), so the function shall be named along the lines of isWordKnown. Pretty much the same can be said about calculate.
main

It does too much. As a rule of thumb, main shall parse a command line and call appropriate functions. A heavy duty for loop must be replaced by a higher-level actions. As much as I can tell, what it's supposed to be doing is
```
read_a_word;
validate_the_word;
```
Try to accommodate this pattern.
Global variables

Try to avoid them at all costs. The wider is the scope, the harder is reasoning about the variable. At least restrict them to a file scope:
```
static struct trieNode * root;
```
check() is unnecessary complicated.

Consider instead
```
bool check(struct trieNode * root, const char* word)
{
 char currChar;
 for(int i = 0; (currChar = word[i]) != '0円'; i++)
 {
 if (currChar == '\'') {
 index = 26;
 } else if (isalpha(currChar)) {
 index = tolower(currChar) - 'a';
 } else {
 return false;
 }
 if ((node = node->childNodes[index]) == NULL)
 return false;
 }
 return (currChar == '0円' && node->wordComp);
}
```
You may notice that the loop body naturally disintegrates into 2 distinct operations: calculating the index and navigating the tree. I would seriously consider to factor out index calculation into a separate function. This would help immensely once you'd decide to change the alphabet (like, accept accents, or switch to Croatian).

PS: you may also notice that I decided to pass a root node as a parameter. This enables a complete separation of code from data.
Data; performance; etc.

Nothing is certain until it is measured; however the gut feeling is that too much space is wasted. Consider how many English words end with ing - and each ing takes 80+ pointers just to encode less than 3 bytes (ed and s are equally wasteful). Side note: English spelling is very regular; look at rispell for hoops a Russian speller has to jump through!

Stack Exchange Network

Spell Check and Trie implementation

dictionary.c

speller.c

dictionary.c

speller.c

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Spell Check and Trie implementation

dictionary.c

speller.c

dictionary.c

speller.c

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions