Here is the code for a two-letter word dictionary (example: aa
) using array indexing:
/* dictionary.h */
#include<stdlib.h>
#include<string.h>
#include<stddef.h>
#define WORDSIZE 2
#define LETTERS 26
#define WORDS LETTERS*LETTERS
#define EOS '0円'
#define MAXDEFINITIONSIZE 512
typedef struct {
char word[WORDSIZE+1];
}Word;
typedef struct{
char definition[MAXDEFINITIONSIZE];
}Definition;
typedef struct {
Definition **array;
}Dictionary;
Word* createWord(char *);
Definition* createDefinition(char *);
Dictionary* createDictionary(void);
void insert(Word*, Definition*, Dictionary*);
Definition* find(Word*, Dictionary*);
int hashCode(Word*);
/* dictionary.c */
#include"dictionary.h"
int hashCode(Word *w){
return LETTERS*(w->word[0]-'a') + (w->word[1]-'a');
}
Word* createWord(char * str){
if(strlen(str) == 2){
Word *w = malloc(sizeof(Word));
strcpy(w->word, str);
return w;
}
return (Word *)NULL;
}
Definition* createDefinition(char *def){
Definition *d = malloc(sizeof(Definition));
strcpy(d->definition, def);
return d;
}
Dictionary* createDictionary(){
Dictionary *d = malloc(sizeof(Dictionary));
d->array = malloc(WORDS * sizeof(Definition*));
return d;
}
void insert(Word *w, Definition *def, Dictionary *d){
d->array[hashCode(w)] = def;
}
Definition* find(Word *w, Dictionary *d){
return d->array[hashCode(w)];
}
/* main.c */
#include"dictionary.h"
#include<stdio.h>
int main(void){
Dictionary *d = createDictionary();
Word *w = createWord("aa");
if(w){
Definition *def = createDefinition("Lava");
insert(w, def, d);
}else{
printf("Invalid input");
}
}
In this code, the number of buckets is equivalent to all possible two-letter words (\26ドル*26\$), where each Word
definition is placed on a 1-1 basis in an array
index. Due to huge space complexity, this program is the motivation for thinking about hashtables that map huge sets of \26ドル^{45}\$ possible words in a dictionary to \$N\$ number of buckets.
- Is it required to further split files for better maintenance?
- Can this code get further refactored?
2 Answers 2
In this code, the number of buckets equivalent to all possible two letter words(26*26), where each Word definition is placed on 1-1 basis in array index. Due to huge space complexity, this program is the motivation for thinking about hash table, that maps huge set of 26^45 possible words in dictionary to N number of buckets.
Your English is a bit obscure, but I think you're saying:
Here I have some code that implements a dictionary (that is, a mapping from strings of length k to strings of arbitrary length) as a huge, mostly empty, array of size 26k. This is a very inefficient way to implement a dictionary, even when k=2 as in this case! Clearly, a hashtable-based implementation would be better. Therefore, this code here might be a good starting point to understand why hashtables are such useful data structures.
You're all correct except for that last sentence. Yes, a hashtable would be better than this code. But practically anything would be better than this code! This code itself is so bad (big-O-wise) that it's just not a plausible motivation for anything.
Not to mention, what use is a dictionary that can only store words of length k? Surely what the user really wants is a dictionary whose keys are strings of arbitrary length. Your array-based implementation can never provide that functionality.
As it happens, about 12 years ago I needed a dictionary that could store words of (up to) length k, where k=9, because I was doing a lot of crossword construction. The data structure that I used was a very simple one: an array!
struct WordAndDefinition {
char *word; // malloc'ed string
char *defn; // malloc'ed string
};
struct WordAndDefinition *dict; // malloc'ed array
Inserting a word into the dictionary meant a realloc
. Searching for a word in the dictionary meant a linear scan — O(n). However, since n is small — on the order of 200,000 — this was blazingly fast (for my own interactive-crossword-dictionary and crossword-grid-filling purposes) and so there was no point in any further optimization.
$ wc -l /usr/share/dict/words
235886 /usr/share/dict/words
If you want to motivate the use of technology X (for any X), your first job is to think of the best technology that is not X, and then come up with ways that X is even better than that. With the code you posted, it seems like you deliberately came up with the worst possible technology; so saying "look, X is better than this thing!" isn't impressive or interesting, it's just a tautology.
-
\$\begingroup\$ So, By definition, Does array index implementation of dictionary does not mean to only have word
definition
in array that can be searched inO(1)
time? \$\endgroup\$overexchange– overexchange2016年11月29日 01:53:02 +00:00Commented Nov 29, 2016 at 1:53 -
\$\begingroup\$ For up to 9 letter words I would go for
N
size bucket with compressinghashCode()
, which is nothing but hash table. Your approach would efficiently use memory, but not time, which is not the idea of hash table. \$\endgroup\$overexchange– overexchange2016年11月29日 02:00:32 +00:00Commented Nov 29, 2016 at 2:00
Putting aside the relative merits of your approach, there are several things about your code that could do with improvement.
Function Prototypes
Your header file is essentially the public interface to your code. One of the things that bugs me about header files is function prototypes that contain no parameter names. They tell you nothing about what the parameters are use for / mean. Compare:
void insert(Word*, Definition*, Dictionary*);
With:
void insert(Word *wordToInsert, Definition *wordDefinition, Dictionary *destination);
Giving the parameters names helps to document the expectation of the functions and saves the developer having to read documentation comments (which your header doesn't contain), refer to the documentation (which you may or may not of written), or read the code of the function they're calling and figure it out. The lower the barrier of entry, the more likely people are to use your code rather than finding an alternate implementation / writing their own.
malloc can fail
You're not checking the return codes from any of your malloc
calls. It's perfectly possible for it to return null. You should be checking it's value and behaving sensibly where possible, particularly if you're writing code that might end up in a library.
malloc isn't calloc
createDictionary
allocates memory for pointers to the word definitions:
d->array = malloc(WORDS * sizeof(Definition*));
malloc
doesn't automatically zero the memory. Since you're allocating pointers, there is a good chance that any clients will assume that non-null pointers are actually present. You should be zeroing the pointers by either using calloc
, or calling memset
afterwards.
Be safe with pointers
insert
doesn't check to see if a definition already exists, it simply assigns into the array. This means that any previous definition / pointer is lost. This may lead to memory leaks if called twice for the same word.
Be clear about pointer ownership
Your createXXX
methods all use malloc
to allocate memory for the objects returned. When you call insert
you pass in three pointers (word,definition,dictionary). As a client of the method, I think it would be reasonable to assume that by calling insert
I was passing responsibility for the pointers to the dictionary. This is the case for the definition, which is inserted into array
member of the dictionary. The word however is only used to calculate the hashCode, so is consequently not stored anywhere. This is confusing and is liable to result in memory leaks.
Provide a cleanup mechanism
You don't have any cleanup functionality. This may be acceptable for you, but I'd consider adding a method destroyDictionary
, that cleans up the dictionary and any pointers that it owns.
Header Guards
It's standard practice to use header guards to prevent multiple definition errors (and pointless include cycles). Typically you'd wrap the entire contents of the header in something like (guard name styles can vary, however will usually be related to the name of the file):
#ifndef __Dictionary_h__
#define __Dictionary_h__
/* Rest of header code */
#endif
Alternately, depending on your compiler support you may want to use:
#pragma once
N.B. Out of historic habit, I tend to use header guards that have leading and trailing underscores. As pointed by @Emily this isn't an approved practice, so you should exclude them from your own guards. Technically, using underscore prefixed names is reserved and hence could cause naming conflicts. That said, across many different compilers and environments, I've never encountered any conflicts caused by header guards and the risk can be largely mitigated through include order.
Include order
This is somewhat subjective, however I tend to include files in a particular order:
#include <standard headers>
#include <external library headers>
#include <internal library headers>
#include <project headers>
This tends to mean that if you do end up with any kind of conflicts with existing headers that the errors are more likely to be flagged up in your projects header where the issue is probably caused, rather than in the standard library. This is the reverse of the order that you've included them in your 'main.c'.
O(1)
time forfind()
&insert()
\$\endgroup\$