I wrote a program that should check a dictionary that contains about 50000 other dictionaries. In those dictionaries, one of the keys has a list of words as value. Now, I iterate over those words, finding the words most similar to a queried input. However, it takes quite a while for it to finish. How can I speed up this process?
import pickle,sys
def levenshtein_distance(first, second):
if first == second: return 0
elif len(first) == 0: return len(second)
elif len(second) == 0: return len(first)
v0 = [None] * (len(second) + 1)
v1 = [None] * (len(second) + 1)
for i in range(len(v0)):
v0[i] = i
for i in range(len(first)):
v1[0] = i + 1
for j in range(len(second)):
cost = 0 if first[i] == second[j] else 1
v1[j + 1] = min(v1[j] + 1, v0[j + 1] + 1, v0[j] + cost)
for j in range(len(v0)):
v0[j] = v1[j]
return v1[len(second)]
def remove_duplicates(seq):
seen = set()
seen_add = seen.add
return [ x for x in seq if not (x in seen or seen_add(x))]
def main():
dict_pkld = pickle.load(open('woordjes2.pkl', 'rb'))
postlist = pickle.load(open('postlist.pkl', 'rb'))
lowest_distance = []
words = []
f = sys.stdin.readlines()
for line in f:
woorden = line.rstrip()
query1 = woorden
print ('Your query is: "' + query1 + '"')
print ('A word that is similar to your query is: \n')
for sub in dict_pkld.values():
woorden = sub['termen']
if not query1 in woorden:
for woord in woorden:
x = levenshtein_distance(query1, woord)
temp_list_number = []
temp_list_word = []
if lowest_distance:
for number, word in zip(lowest_distance, words):
if number > x:
#print ('x is kleiner')
loc = lowest_distance.index(number)
lowest_distance[loc] = x
words[loc] = woord
elif number == x:
#print ('x is gelijk')
temp_list_number.append(x)
temp_list_word.append(woord)
else:
#print ('Niks')
pass
else:
#print ('lijst is leeg')
lowest_distance.append(x)
words.append(woord)
for item, woordje in zip(temp_list_number, temp_list_word):
if not woordje in words:
lowest_distance.append(item)
words.append(woordje)
else:
pass
words_new = remove_duplicates(words)
for woordje in words_new:
print ('-' + woordje)
print ('\nThese words have a Levenshtein Distance of ' + str(lowest_distance[0]))
print ('\nWe are now looking for Tweets that contain the first word of this list: \n')
result = postlist[words_new[0]]
for r in result:
print (dict_pkld[r]['tekst'] + '\n')
main()
Input: Random word as query, like "
Ajax
"Output: Is all Tweets in the dictionary that contain a word that is most similar to the query "
Ajax
"
2 Answers 2
The largest performance problem I see is in the
for woord in woorden
loop. The code iterates overzip(lowest_distance, words)
- which is growing - over and over again, effectively getting a quadratic performance. I recommend to calculate distances to eachwoord
, and sort the resulting list:woord_distance_list = [ (levenstein_distance(woord), woord) for woord in woorden ].sorted()
The
words
list undergoesremove_duplicates
, which is a strong indication that it should beset
in the first place.I would consider delegating
levenstein_distance
to a C function. Python is not sell-suited for such kind of computations.Not related to performance:
postlist
is declared way early. Move it down to where it is used:postlist = pickle.load(open('postlist.pkl', 'rb')) result = postlist[words_new[0]]
A massive
main
with 4 levels of loop nesting is a strong indication that something is not right. Try to make each loop in a function, just to give it a meaningful name.
- Stay consistent on your indentations levels. Mixing 4 and 8 spaces indent in the same file is not really a good idea. Worse is to do it in the same function.
Do not call your
main
function freely at the end of the file, guard it with:if __name__ == "__main__": main()
- Make better use of higher-level python functions to reduce cluttering of your code: copying a list is done with
new_list = old_list[:]
, iterating over elements and getting their index is done withenumerate
and so on. - Do not write
else: pass
there is no added value, it’s just noise.
Combine that with @vnp words of wisdom to use set
and order your woorden
before computing anything, you get:
import pickle,sys
def levenshtein_distance(reference, study):
if not reference: return len(study)
if not study: return len(reference)
if first == second: return 0
v0 = list(range(len(second) + 1))
v1 = v0[:]
for i, ref_elem in enumerate(reference):
v1[0] = i + 1
for j, study_elem in enumerate(study):
v1[j+1] = min(v1[j] + 1,
v0[j+1] + 1,
v0[j] + int(ref_elem != study_elem))
v0[:] = v1[:]
return v1[-1]
def levenshtein_query(query):
def wrapped(word):
return levenshtein_distance(query, word), word
return wrapped
def main(words_filename='woordjes2.pkl', tweets_filename='postlist.pkl'):
dict_pkld = pickle.load(open(words_filename, 'rb'))
postlist = pickle.load(open(tweets_filename, 'rb'))
for line in sys.stdin:
woorden = line.rstrip()
print('Your query is: "' + woorden + '"')
words = sorted({
sorted(map(levenshtein_query(woorden), sub['termen']))[0]
for sub in dict_pkld.values()
if woorden not in sub['termen']}) # I wonder why you put this condition in the first place
print('A word that is similar to your query is:')
min_distance, most_similar_word = words[0]
for distance, woordje in words:
if distance != min_distance:
break
print('-', woordje)
print('These words have a Levenshtein Distance of', min_distance)
print('Tweets that contain the first word of this list:')
result = postlist[most_similar_word]
for r in result:
print(dict_pkld[r]['tekst'])
if __name__ == "__main__":
main()
Explore related questions
See similar questions with these tags.
open
calls leak without surroundingwith
, although in a one-off script it won't matter. Also, Python 2 or 3? \$\endgroup\$