3
\$\begingroup\$

I wrote a program that should check a dictionary that contains about 50000 other dictionaries. In those dictionaries, one of the keys has a list of words as value. Now, I iterate over those words, finding the words most similar to a queried input. However, it takes quite a while for it to finish. How can I speed up this process?

import pickle,sys
def levenshtein_distance(first, second):
 if first == second: return 0
 elif len(first) == 0: return len(second)
 elif len(second) == 0: return len(first)
 v0 = [None] * (len(second) + 1)
 v1 = [None] * (len(second) + 1)
 for i in range(len(v0)):
 v0[i] = i
 for i in range(len(first)):
 v1[0] = i + 1
 for j in range(len(second)):
 cost = 0 if first[i] == second[j] else 1
 v1[j + 1] = min(v1[j] + 1, v0[j + 1] + 1, v0[j] + cost)
 for j in range(len(v0)):
 v0[j] = v1[j]
 return v1[len(second)]
def remove_duplicates(seq):
 seen = set()
 seen_add = seen.add
 return [ x for x in seq if not (x in seen or seen_add(x))]
def main():
 dict_pkld = pickle.load(open('woordjes2.pkl', 'rb'))
 postlist = pickle.load(open('postlist.pkl', 'rb'))
 lowest_distance = []
 words = []
 f = sys.stdin.readlines()
 for line in f:
 woorden = line.rstrip()
 query1 = woorden
 print ('Your query is: "' + query1 + '"')
 print ('A word that is similar to your query is: \n')
 for sub in dict_pkld.values():
 woorden = sub['termen']
 if not query1 in woorden:
 for woord in woorden:
 x = levenshtein_distance(query1, woord)
 temp_list_number = []
 temp_list_word = []
 if lowest_distance:
 for number, word in zip(lowest_distance, words):
 if number > x:
 #print ('x is kleiner')
 loc = lowest_distance.index(number)
 lowest_distance[loc] = x
 words[loc] = woord
 elif number == x:
 #print ('x is gelijk')
 temp_list_number.append(x)
 temp_list_word.append(woord)
 else:
 #print ('Niks')
 pass
 else:
 #print ('lijst is leeg')
 lowest_distance.append(x)
 words.append(woord)
 for item, woordje in zip(temp_list_number, temp_list_word):
 if not woordje in words:
 lowest_distance.append(item)
 words.append(woordje)
 else:
 pass
 words_new = remove_duplicates(words)
 for woordje in words_new:
 print ('-' + woordje)
 print ('\nThese words have a Levenshtein Distance of ' + str(lowest_distance[0]))
 print ('\nWe are now looking for Tweets that contain the first word of this list: \n')
 result = postlist[words_new[0]]
 for r in result:
 print (dict_pkld[r]['tekst'] + '\n')
main()

Input: Random word as query, like "Ajax"

Output: Is all Tweets in the dictionary that contain a word that is most similar to the query "Ajax"

Quill
12k5 gold badges41 silver badges93 bronze badges
asked Oct 20, 2015 at 22:30
\$\endgroup\$
2
  • 1
    \$\begingroup\$ Could you please add some sample input, and sample output, to help us understand what the code is actually doing? It's kind of hard to read Dutch(?), for someone like me... Even though it's somewhat similar to a combination of German (and English) that some of it is understandable... \$\endgroup\$ Commented Oct 20, 2015 at 22:49
  • \$\begingroup\$ The open calls leak without surrounding with, although in a one-off script it won't matter. Also, Python 2 or 3? \$\endgroup\$ Commented Oct 21, 2015 at 23:21

2 Answers 2

3
\$\begingroup\$
  • The largest performance problem I see is in the for woord in woorden loop. The code iterates over zip(lowest_distance, words) - which is growing - over and over again, effectively getting a quadratic performance. I recommend to calculate distances to each woord, and sort the resulting list:

    woord_distance_list = [ (levenstein_distance(woord), woord) for woord in woorden ].sorted()
    
  • The words list undergoes remove_duplicates, which is a strong indication that it should be set in the first place.

  • I would consider delegating levenstein_distance to a C function. Python is not sell-suited for such kind of computations.

  • Not related to performance:

    • postlist is declared way early. Move it down to where it is used:

       postlist = pickle.load(open('postlist.pkl', 'rb'))
       result = postlist[words_new[0]]
      
    • A massive main with 4 levels of loop nesting is a strong indication that something is not right. Try to make each loop in a function, just to give it a meaningful name.

answered Oct 20, 2015 at 23:50
\$\endgroup\$
1
\$\begingroup\$
  • Stay consistent on your indentations levels. Mixing 4 and 8 spaces indent in the same file is not really a good idea. Worse is to do it in the same function.
  • Do not call your main function freely at the end of the file, guard it with:

    if __name__ == "__main__":
     main()
    
  • Make better use of higher-level python functions to reduce cluttering of your code: copying a list is done with new_list = old_list[:], iterating over elements and getting their index is done with enumerate and so on.
  • Do not write else: pass there is no added value, it’s just noise.

Combine that with @vnp words of wisdom to use set and order your woorden before computing anything, you get:

import pickle,sys
def levenshtein_distance(reference, study):
 if not reference: return len(study)
 if not study: return len(reference)
 if first == second: return 0
 v0 = list(range(len(second) + 1))
 v1 = v0[:]
 for i, ref_elem in enumerate(reference):
 v1[0] = i + 1
 for j, study_elem in enumerate(study):
 v1[j+1] = min(v1[j] + 1,
 v0[j+1] + 1,
 v0[j] + int(ref_elem != study_elem))
 v0[:] = v1[:]
 return v1[-1]
def levenshtein_query(query):
 def wrapped(word):
 return levenshtein_distance(query, word), word
 return wrapped
def main(words_filename='woordjes2.pkl', tweets_filename='postlist.pkl'):
 dict_pkld = pickle.load(open(words_filename, 'rb'))
 postlist = pickle.load(open(tweets_filename, 'rb'))
 for line in sys.stdin:
 woorden = line.rstrip()
 print('Your query is: "' + woorden + '"')
 words = sorted({
 sorted(map(levenshtein_query(woorden), sub['termen']))[0]
 for sub in dict_pkld.values()
 if woorden not in sub['termen']}) # I wonder why you put this condition in the first place
 print('A word that is similar to your query is:')
 min_distance, most_similar_word = words[0]
 for distance, woordje in words:
 if distance != min_distance:
 break
 print('-', woordje)
 print('These words have a Levenshtein Distance of', min_distance)
 print('Tweets that contain the first word of this list:')
 result = postlist[most_similar_word]
 for r in result:
 print(dict_pkld[r]['tekst'])
if __name__ == "__main__":
 main()
answered Oct 21, 2015 at 22:58
\$\endgroup\$
0

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.