I am trying to write a spellchecker and I wanted to use difflib to implement it. Basically I have a list of technical terms that I added to the standard Unix dictionary (/usr/share/dict/words
) that I'm storing in a file I call dictionaryFile.py.
I have another script just called stringSim.py where I import the dictionary and test sample strings against it:
import os, sys
import difflib
import time
from dictionaryFile import wordList
inputString = "dictiunary"
print "Search query: "+inputString
startTime = time.time()
inputStringSplit = inputString.split()
for term in inputStringSplit:
termL = term.lower()
print "Search term: "+term
closeMatches = difflib.get_close_matches(termL,wordList)
if closeMatches[0] == termL:
print "Perfect Match"
else:
print "Possible Matches"
print "\n".join(closeMatches)
print time.time() - startTime, "seconds"
It returns the following:
$ python stringSim.py Search query: dictiunary Search term: dictiunary Possible Matches dictionary dictionary's discretionary 0.492614984512 seconds
I'm wondering if there are better strategies I could be using for looking up similar matches (assuming a word is misspelled). This is for a web application so I am trying to optimize this part of the code to be a little snappier. Is there a better way I could structure the wordList
variable (right now it is just a list of words)?
-
\$\begingroup\$ I had a similar issue with efficiency and used the Levenshtein module on pypi: stackoverflow.com/a/26894959/4085463 \$\endgroup\$John– John2014年11月12日 19:32:13 +00:00Commented Nov 12, 2014 at 19:32
1 Answer 1
Using difflib
is probably the best choice. So the problem comes down to the size of your word list. If we can pare down the number of words difflib
needs to compare, then we could get a faster time.
One idea to implement, is take only words that are near-enough in length:
# Returns the min and max thresholds for word length
def get_thresholds(word, min=3, max=3):
length = len(word)
return max(1, length-min), length+max
# Returns only words whose length is within a certain range
def filter_word_list(min, max):
return [word for word in words_list if min <= len(word) <= max]
So the call to get_close_matches()
would be:
closeMatches = difflib.get_close_matches(termL,
dictionaryFile.filter_word_list(*get_thresholds(termL)))
Another idea would be to filter words that begin with a letter that is spatially related to the word's first letter on the keyboard. However, this suggestion is not as plausible due to different keyboard layouts.
As some general comments:
- Take a look at your variable names. They are descriptive (for the most part) but the official Python style guide recommends using
underscores_between_words
instead ofcamelCaseNames
. Same basic idea for module names. Pythonic module names are written as
MyModule
instead ofmyModule
. Also, each different module import gets its own line:# Not `import os, sys` import os import sys