Two-letter word dictionary - array index implementation

Question 1

Here is the code for a two-letter word dictionary (example: aa) using array indexing:

/* dictionary.h */
#include<stdlib.h>
#include<string.h>
#include<stddef.h>
#define WORDSIZE 2
#define LETTERS 26
#define WORDS LETTERS*LETTERS
#define EOS '0円'
#define MAXDEFINITIONSIZE 512
typedef struct {
 char word[WORDSIZE+1];
}Word;
typedef struct{
 char definition[MAXDEFINITIONSIZE];
}Definition;
typedef struct {
 Definition **array;
}Dictionary;
Word* createWord(char *);
Definition* createDefinition(char *);
Dictionary* createDictionary(void);
void insert(Word*, Definition*, Dictionary*);
Definition* find(Word*, Dictionary*);
int hashCode(Word*);

/* dictionary.c */
#include"dictionary.h"
int hashCode(Word *w){
 return LETTERS*(w->word[0]-'a') + (w->word[1]-'a');
}
Word* createWord(char * str){
 if(strlen(str) == 2){
 Word *w = malloc(sizeof(Word));
 strcpy(w->word, str);
 return w;
 }
 return (Word *)NULL;
}
Definition* createDefinition(char *def){
 Definition *d = malloc(sizeof(Definition));
 strcpy(d->definition, def);
 return d;
}
Dictionary* createDictionary(){
 Dictionary *d = malloc(sizeof(Dictionary));
 d->array = malloc(WORDS * sizeof(Definition*));
 return d;
}
void insert(Word *w, Definition *def, Dictionary *d){
 d->array[hashCode(w)] = def;
}
Definition* find(Word *w, Dictionary *d){
 return d->array[hashCode(w)];
}

/* main.c */
#include"dictionary.h"
#include<stdio.h>
int main(void){
 Dictionary *d = createDictionary();
 Word *w = createWord("aa");
 if(w){
 Definition *def = createDefinition("Lava");
 insert(w, def, d);
 }else{
 printf("Invalid input");
 }
}

In this code, the number of buckets is equivalent to all possible two-letter words (\26ドル*26\$), where each Word definition is placed on a 1-1 basis in an array index. Due to huge space complexity, this program is the motivation for thinking about hashtables that map huge sets of \26ドル^{45}\$ possible words in a dictionary to \$N\$ number of buckets.

Is it required to further split files for better maintenance?
Can this code get further refactored?

Question 2

Why would you need a two letter word dictionary? What's the usecase? This thing makes no sense if it's not put into context. Please edit your question to explain what context this code is in, why you need it.

Question 3

@Vogel612 Updated the query

Question 4

@Vogel612 is pointing out that there are only 105 two-letter words in English (probably fewer, in most people's opinion), which makes this a weird task.

Question 5

@200_success Maintaining 105 buckets will crash the program, because my understanding of direct array indexing impl mean, each word is mapped to its index, to ensure O(1) time for find() & insert()

Question 6

In this code, the number of buckets equivalent to all possible two letter words(26*26), where each Word definition is placed on 1-1 basis in array index. Due to huge space complexity, this program is the motivation for thinking about hash table, that maps huge set of 26^45 possible words in dictionary to N number of buckets.

Your English is a bit obscure, but I think you're saying:

Here I have some code that implements a dictionary (that is, a mapping from strings of length k to strings of arbitrary length) as a huge, mostly empty, array of size 26^k. This is a very inefficient way to implement a dictionary, even when k=2 as in this case! Clearly, a hashtable-based implementation would be better. Therefore, this code here might be a good starting point to understand why hashtables are such useful data structures.

You're all correct except for that last sentence. Yes, a hashtable would be better than this code. But practically anything would be better than this code! This code itself is so bad (big-O-wise) that it's just not a plausible motivation for anything.

Not to mention, what use is a dictionary that can only store words of length k? Surely what the user really wants is a dictionary whose keys are strings of arbitrary length. Your array-based implementation can never provide that functionality.

As it happens, about 12 years ago I needed a dictionary that could store words of (up to) length k, where k=9, because I was doing a lot of crossword construction. The data structure that I used was a very simple one: an array!

struct WordAndDefinition {
 char *word; // malloc'ed string
 char *defn; // malloc'ed string
};
struct WordAndDefinition *dict; // malloc'ed array

Inserting a word into the dictionary meant a realloc. Searching for a word in the dictionary meant a linear scan — O(n). However, since n is small — on the order of 200,000 — this was blazingly fast (for my own interactive-crossword-dictionary and crossword-grid-filling purposes) and so there was no point in any further optimization.

$ wc -l /usr/share/dict/words
235886 /usr/share/dict/words

If you want to motivate the use of technology X (for any X), your first job is to think of the best technology that is not X, and then come up with ways that X is even better than that. With the code you posted, it seems like you deliberately came up with the worst possible technology; so saying "look, X is better than this thing!" isn't impressive or interesting, it's just a tautology.

Question 7

So, By definition, Does array index implementation of dictionary does not mean to only have word definition in array that can be searched in O(1) time?

Question 8

For up to 9 letter words I would go for N size bucket with compressing hashCode(), which is nothing but hash table. Your approach would efficiently use memory, but not time, which is not the idea of hash table.

Question 9

Putting aside the relative merits of your approach, there are several things about your code that could do with improvement.

Function Prototypes

Your header file is essentially the public interface to your code. One of the things that bugs me about header files is function prototypes that contain no parameter names. They tell you nothing about what the parameters are use for / mean. Compare:

void insert(Word*, Definition*, Dictionary*);

With:

void insert(Word *wordToInsert, Definition *wordDefinition, Dictionary *destination);

Giving the parameters names helps to document the expectation of the functions and saves the developer having to read documentation comments (which your header doesn't contain), refer to the documentation (which you may or may not of written), or read the code of the function they're calling and figure it out. The lower the barrier of entry, the more likely people are to use your code rather than finding an alternate implementation / writing their own.

malloc can fail

You're not checking the return codes from any of your malloc calls. It's perfectly possible for it to return null. You should be checking it's value and behaving sensibly where possible, particularly if you're writing code that might end up in a library.

malloc isn't calloc

createDictionary allocates memory for pointers to the word definitions:

d->array = malloc(WORDS * sizeof(Definition*));

malloc doesn't automatically zero the memory. Since you're allocating pointers, there is a good chance that any clients will assume that non-null pointers are actually present. You should be zeroing the pointers by either using calloc, or calling memset afterwards.

Be safe with pointers

insert doesn't check to see if a definition already exists, it simply assigns into the array. This means that any previous definition / pointer is lost. This may lead to memory leaks if called twice for the same word.

Be clear about pointer ownership

Your createXXX methods all use malloc to allocate memory for the objects returned. When you call insert you pass in three pointers (word,definition,dictionary). As a client of the method, I think it would be reasonable to assume that by calling insert I was passing responsibility for the pointers to the dictionary. This is the case for the definition, which is inserted into array member of the dictionary. The word however is only used to calculate the hashCode, so is consequently not stored anywhere. This is confusing and is liable to result in memory leaks.

Provide a cleanup mechanism

You don't have any cleanup functionality. This may be acceptable for you, but I'd consider adding a method destroyDictionary, that cleans up the dictionary and any pointers that it owns.

Header Guards

It's standard practice to use header guards to prevent multiple definition errors (and pointless include cycles). Typically you'd wrap the entire contents of the header in something like (guard name styles can vary, however will usually be related to the name of the file):

#ifndef __Dictionary_h__
#define __Dictionary_h__
/* Rest of header code */
#endif

Alternately, depending on your compiler support you may want to use:

#pragma once

N.B. Out of historic habit, I tend to use header guards that have leading and trailing underscores. As pointed by @Emily this isn't an approved practice, so you should exclude them from your own guards. Technically, using underscore prefixed names is reserved and hence could cause naming conflicts. That said, across many different compilers and environments, I've never encountered any conflicts caused by header guards and the risk can be largely mitigated through include order.

Include order

This is somewhat subjective, however I tend to include files in a particular order:

#include <standard headers>
#include <external library headers>
#include <internal library headers>
#include <project headers>

This tends to mean that if you do end up with any kind of conflicts with existing headers that the errors are more likely to be flagged up in your projects header where the issue is probably caused, rather than in the standard library. This is the reverse of the order that you've included them in your 'main.c'.

Quuxplusone Quuxplusone 19.7k2 gold badges44 silver badges91 bronze badges · Answer 1 · 2016-11-29 01:25:05Z

In this code, the number of buckets equivalent to all possible two letter words(26*26), where each Word definition is placed on 1-1 basis in array index. Due to huge space complexity, this program is the motivation for thinking about hash table, that maps huge set of 26^45 possible words in dictionary to N number of buckets.

Your English is a bit obscure, but I think you're saying:

Here I have some code that implements a dictionary (that is, a mapping from strings of length k to strings of arbitrary length) as a huge, mostly empty, array of size 26^k. This is a very inefficient way to implement a dictionary, even when k=2 as in this case! Clearly, a hashtable-based implementation would be better. Therefore, this code here might be a good starting point to understand why hashtables are such useful data structures.

You're all correct except for that last sentence. Yes, a hashtable would be better than this code. But practically anything would be better than this code! This code itself is so bad (big-O-wise) that it's just not a plausible motivation for anything.

Not to mention, what use is a dictionary that can only store words of length k? Surely what the user really wants is a dictionary whose keys are strings of arbitrary length. Your array-based implementation can never provide that functionality.

As it happens, about 12 years ago I needed a dictionary that could store words of (up to) length k, where k=9, because I was doing a lot of crossword construction. The data structure that I used was a very simple one: an array!

struct WordAndDefinition {
 char *word; // malloc'ed string
 char *defn; // malloc'ed string
};
struct WordAndDefinition *dict; // malloc'ed array

Inserting a word into the dictionary meant a realloc. Searching for a word in the dictionary meant a linear scan — O(n). However, since n is small — on the order of 200,000 — this was blazingly fast (for my own interactive-crossword-dictionary and crossword-grid-filling purposes) and so there was no point in any further optimization.

$ wc -l /usr/share/dict/words
235886 /usr/share/dict/words

If you want to motivate the use of technology X (for any X), your first job is to think of the best technology that is not X, and then come up with ways that X is even better than that. With the code you posted, it seems like you deliberately came up with the worst possible technology; so saying "look, X is better than this thing!" isn't impressive or interesting, it's just a tautology.

So, By definition, Does array index implementation of dictionary does not mean to only have word definition in array that can be searched in O(1) time?
For up to 9 letter words I would go for N size bucket with compressing hashCode(), which is nothing but hash table. Your approach would efficiently use memory, but not time, which is not the idea of hash table.

forsvarir forsvarir 11.8k7 gold badges39 silver badges72 bronze badges · Answer 2 · 2016-11-29 22:02:52Z

Putting aside the relative merits of your approach, there are several things about your code that could do with improvement.

Function Prototypes

Your header file is essentially the public interface to your code. One of the things that bugs me about header files is function prototypes that contain no parameter names. They tell you nothing about what the parameters are use for / mean. Compare:

void insert(Word*, Definition*, Dictionary*);

With:

void insert(Word *wordToInsert, Definition *wordDefinition, Dictionary *destination);

Giving the parameters names helps to document the expectation of the functions and saves the developer having to read documentation comments (which your header doesn't contain), refer to the documentation (which you may or may not of written), or read the code of the function they're calling and figure it out. The lower the barrier of entry, the more likely people are to use your code rather than finding an alternate implementation / writing their own.

malloc can fail

You're not checking the return codes from any of your malloc calls. It's perfectly possible for it to return null. You should be checking it's value and behaving sensibly where possible, particularly if you're writing code that might end up in a library.

malloc isn't calloc

createDictionary allocates memory for pointers to the word definitions:

d->array = malloc(WORDS * sizeof(Definition*));

malloc doesn't automatically zero the memory. Since you're allocating pointers, there is a good chance that any clients will assume that non-null pointers are actually present. You should be zeroing the pointers by either using calloc, or calling memset afterwards.

Be safe with pointers

insert doesn't check to see if a definition already exists, it simply assigns into the array. This means that any previous definition / pointer is lost. This may lead to memory leaks if called twice for the same word.

Be clear about pointer ownership

Your createXXX methods all use malloc to allocate memory for the objects returned. When you call insert you pass in three pointers (word,definition,dictionary). As a client of the method, I think it would be reasonable to assume that by calling insert I was passing responsibility for the pointers to the dictionary. This is the case for the definition, which is inserted into array member of the dictionary. The word however is only used to calculate the hashCode, so is consequently not stored anywhere. This is confusing and is liable to result in memory leaks.

Provide a cleanup mechanism

You don't have any cleanup functionality. This may be acceptable for you, but I'd consider adding a method destroyDictionary, that cleans up the dictionary and any pointers that it owns.

Header Guards

It's standard practice to use header guards to prevent multiple definition errors (and pointless include cycles). Typically you'd wrap the entire contents of the header in something like (guard name styles can vary, however will usually be related to the name of the file):

#ifndef __Dictionary_h__
#define __Dictionary_h__
/* Rest of header code */
#endif

Alternately, depending on your compiler support you may want to use:

#pragma once

N.B. Out of historic habit, I tend to use header guards that have leading and trailing underscores. As pointed by @Emily this isn't an approved practice, so you should exclude them from your own guards. Technically, using underscore prefixed names is reserved and hence could cause naming conflicts. That said, across many different compilers and environments, I've never encountered any conflicts caused by header guards and the risk can be largely mitigated through include order.

Include order

This is somewhat subjective, however I tend to include files in a particular order:

#include <standard headers>
#include <external library headers>
#include <internal library headers>
#include <project headers>

This tends to mean that if you do end up with any kind of conflicts with existing headers that the errors are more likely to be flagged up in your projects header where the issue is probably caused, rather than in the standard library. This is the reverse of the order that you've included them in your 'main.c'.

Stack Exchange Network

Two-letter word dictionary - array index implementation

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Two-letter word dictionary - array index implementation

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions