Searching a dictionary of words using difflib

Question 1

I am trying to write a spellchecker and I wanted to use difflib to implement it. Basically I have a list of technical terms that I added to the standard Unix dictionary (/usr/share/dict/words) that I'm storing in a file I call dictionaryFile.py.

I have another script just called stringSim.py where I import the dictionary and test sample strings against it:

import os, sys
import difflib
import time
from dictionaryFile import wordList
inputString = "dictiunary" 
print "Search query: "+inputString 
startTime = time.time()
inputStringSplit = inputString.split()
for term in inputStringSplit:
 termL = term.lower()
 print "Search term: "+term
 closeMatches = difflib.get_close_matches(termL,wordList)
 if closeMatches[0] == termL: 
 print "Perfect Match"
 else:
 print "Possible Matches"
 print "\n".join(closeMatches)
print time.time() - startTime, "seconds"

It returns the following:

$ python stringSim.py 
Search query: dictiunary
Search term: dictiunary
Possible Matches
dictionary
dictionary's
discretionary
0.492614984512 seconds

I'm wondering if there are better strategies I could be using for looking up similar matches (assuming a word is misspelled). This is for a web application so I am trying to optimize this part of the code to be a little snappier. Is there a better way I could structure the wordList variable (right now it is just a list of words)?

Question 2

I had a similar issue with efficiency and used the Levenshtein module on pypi: stackoverflow.com/a/26894959/4085463

Question 3

Using difflib is probably the best choice. So the problem comes down to the size of your word list. If we can pare down the number of words difflib needs to compare, then we could get a faster time.

One idea to implement, is take only words that are near-enough in length:

# Returns the min and max thresholds for word length
def get_thresholds(word, min=3, max=3):
 length = len(word)
 return max(1, length-min), length+max 
# Returns only words whose length is within a certain range
def filter_word_list(min, max):
 return [word for word in words_list if min <= len(word) <= max]

So the call to get_close_matches() would be:

closeMatches = difflib.get_close_matches(termL,
 dictionaryFile.filter_word_list(*get_thresholds(termL)))

Another idea would be to filter words that begin with a letter that is spatially related to the word's first letter on the keyboard. However, this suggestion is not as plausible due to different keyboard layouts.

As some general comments:

Take a look at your variable names. They are descriptive (for the most part) but the official Python style guide recommends using underscores_between_words instead of camelCaseNames.
Same basic idea for module names. Pythonic module names are written as MyModule instead of myModule. Also, each different module import gets its own line:
```
# Not `import os, sys`
import os
import sys
```

BeetDemGuise BeetDemGuise 4,21612 silver badges29 bronze badges · Answer 1 · 2014-05-16 19:05:15Z

Using difflib is probably the best choice. So the problem comes down to the size of your word list. If we can pare down the number of words difflib needs to compare, then we could get a faster time.

One idea to implement, is take only words that are near-enough in length:

# Returns the min and max thresholds for word length
def get_thresholds(word, min=3, max=3):
 length = len(word)
 return max(1, length-min), length+max 
# Returns only words whose length is within a certain range
def filter_word_list(min, max):
 return [word for word in words_list if min <= len(word) <= max]

So the call to get_close_matches() would be:

closeMatches = difflib.get_close_matches(termL,
 dictionaryFile.filter_word_list(*get_thresholds(termL)))

Another idea would be to filter words that begin with a letter that is spatially related to the word's first letter on the keyboard. However, this suggestion is not as plausible due to different keyboard layouts.

As some general comments:

Take a look at your variable names. They are descriptive (for the most part) but the official Python style guide recommends using underscores_between_words instead of camelCaseNames.
Same basic idea for module names. Pythonic module names are written as MyModule instead of myModule. Also, each different module import gets its own line:
```
# Not `import os, sys`
import os
import sys
```

Stack Exchange Network

Searching a dictionary of words using difflib

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Searching a dictionary of words using difflib

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions