4
\$\begingroup\$

I am trying to write a spellchecker and I wanted to use difflib to implement it. Basically I have a list of technical terms that I added to the standard Unix dictionary (/usr/share/dict/words) that I'm storing in a file I call dictionaryFile.py.

I have another script just called stringSim.py where I import the dictionary and test sample strings against it:

import os, sys
import difflib
import time
from dictionaryFile import wordList
inputString = "dictiunary" 
print "Search query: "+inputString 
startTime = time.time()
inputStringSplit = inputString.split()
for term in inputStringSplit:
 termL = term.lower()
 print "Search term: "+term
 closeMatches = difflib.get_close_matches(termL,wordList)
 if closeMatches[0] == termL: 
 print "Perfect Match"
 else:
 print "Possible Matches"
 print "\n".join(closeMatches)
print time.time() - startTime, "seconds"

It returns the following:

$ python stringSim.py 
Search query: dictiunary
Search term: dictiunary
Possible Matches
dictionary
dictionary's
discretionary
0.492614984512 seconds

I'm wondering if there are better strategies I could be using for looking up similar matches (assuming a word is misspelled). This is for a web application so I am trying to optimize this part of the code to be a little snappier. Is there a better way I could structure the wordList variable (right now it is just a list of words)?

Jamal
35.2k13 gold badges134 silver badges238 bronze badges
asked May 16, 2014 at 18:04
\$\endgroup\$
1
  • \$\begingroup\$ I had a similar issue with efficiency and used the Levenshtein module on pypi: stackoverflow.com/a/26894959/4085463 \$\endgroup\$ Commented Nov 12, 2014 at 19:32

1 Answer 1

3
\$\begingroup\$

Using difflib is probably the best choice. So the problem comes down to the size of your word list. If we can pare down the number of words difflib needs to compare, then we could get a faster time.

One idea to implement, is take only words that are near-enough in length:

# Returns the min and max thresholds for word length
def get_thresholds(word, min=3, max=3):
 length = len(word)
 return max(1, length-min), length+max 
# Returns only words whose length is within a certain range
def filter_word_list(min, max):
 return [word for word in words_list if min <= len(word) <= max]

So the call to get_close_matches() would be:

closeMatches = difflib.get_close_matches(termL,
 dictionaryFile.filter_word_list(*get_thresholds(termL)))

Another idea would be to filter words that begin with a letter that is spatially related to the word's first letter on the keyboard. However, this suggestion is not as plausible due to different keyboard layouts.

As some general comments:

  1. Take a look at your variable names. They are descriptive (for the most part) but the official Python style guide recommends using underscores_between_words instead of camelCaseNames.
  2. Same basic idea for module names. Pythonic module names are written as MyModule instead of myModule. Also, each different module import gets its own line:

    # Not `import os, sys`
    import os
    import sys
    
answered May 16, 2014 at 19:05
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.