Iterate over large amount of words in Python

Question 1

I wrote a program that should check a dictionary that contains about 50000 other dictionaries. In those dictionaries, one of the keys has a list of words as value. Now, I iterate over those words, finding the words most similar to a queried input. However, it takes quite a while for it to finish. How can I speed up this process?

import pickle,sys
def levenshtein_distance(first, second):
 if first == second: return 0
 elif len(first) == 0: return len(second)
 elif len(second) == 0: return len(first)
 v0 = [None] * (len(second) + 1)
 v1 = [None] * (len(second) + 1)
 for i in range(len(v0)):
 v0[i] = i
 for i in range(len(first)):
 v1[0] = i + 1
 for j in range(len(second)):
 cost = 0 if first[i] == second[j] else 1
 v1[j + 1] = min(v1[j] + 1, v0[j + 1] + 1, v0[j] + cost)
 for j in range(len(v0)):
 v0[j] = v1[j]
 return v1[len(second)]
def remove_duplicates(seq):
 seen = set()
 seen_add = seen.add
 return [ x for x in seq if not (x in seen or seen_add(x))]
def main():
 dict_pkld = pickle.load(open('woordjes2.pkl', 'rb'))
 postlist = pickle.load(open('postlist.pkl', 'rb'))
 lowest_distance = []
 words = []
 f = sys.stdin.readlines()
 for line in f:
 woorden = line.rstrip()
 query1 = woorden
 print ('Your query is: "' + query1 + '"')
 print ('A word that is similar to your query is: \n')
 for sub in dict_pkld.values():
 woorden = sub['termen']
 if not query1 in woorden:
 for woord in woorden:
 x = levenshtein_distance(query1, woord)
 temp_list_number = []
 temp_list_word = []
 if lowest_distance:
 for number, word in zip(lowest_distance, words):
 if number > x:
 #print ('x is kleiner')
 loc = lowest_distance.index(number)
 lowest_distance[loc] = x
 words[loc] = woord
 elif number == x:
 #print ('x is gelijk')
 temp_list_number.append(x)
 temp_list_word.append(woord)
 else:
 #print ('Niks')
 pass
 else:
 #print ('lijst is leeg')
 lowest_distance.append(x)
 words.append(woord)
 for item, woordje in zip(temp_list_number, temp_list_word):
 if not woordje in words:
 lowest_distance.append(item)
 words.append(woordje)
 else:
 pass
 words_new = remove_duplicates(words)
 for woordje in words_new:
 print ('-' + woordje)
 print ('\nThese words have a Levenshtein Distance of ' + str(lowest_distance[0]))
 print ('\nWe are now looking for Tweets that contain the first word of this list: \n')
 result = postlist[words_new[0]]
 for r in result:
 print (dict_pkld[r]['tekst'] + '\n')
main()

Input: Random word as query, like "Ajax"

Output: Is all Tweets in the dictionary that contain a word that is most similar to the query "Ajax"

Question 2

Could you please add some sample input, and sample output, to help us understand what the code is actually doing? It's kind of hard to read Dutch(?), for someone like me... Even though it's somewhat similar to a combination of German (and English) that some of it is understandable...

Question 3

The open calls leak without surrounding with, although in a one-off script it won't matter. Also, Python 2 or 3?

Question 4

The largest performance problem I see is in the for woord in woorden loop. The code iterates over zip(lowest_distance, words) - which is growing - over and over again, effectively getting a quadratic performance. I recommend to calculate distances to each woord, and sort the resulting list:
```
woord_distance_list = [ (levenstein_distance(woord), woord) for woord in woorden ].sorted()
```
The words list undergoes remove_duplicates, which is a strong indication that it should be set in the first place.
I would consider delegating levenstein_distance to a C function. Python is not sell-suited for such kind of computations.
Not related to performance:
- postlist is declared way early. Move it down to where it is used:
```
 postlist = pickle.load(open('postlist.pkl', 'rb'))
 result = postlist[words_new[0]]
```
- A massive main with 4 levels of loop nesting is a strong indication that something is not right. Try to make each loop in a function, just to give it a meaningful name.

Question 5

Stay consistent on your indentations levels. Mixing 4 and 8 spaces indent in the same file is not really a good idea. Worse is to do it in the same function.
Do not call your main function freely at the end of the file, guard it with:
```
if __name__ == "__main__":
 main()
```
Make better use of higher-level python functions to reduce cluttering of your code: copying a list is done with new_list = old_list[:], iterating over elements and getting their index is done with enumerate and so on.
Do not write else: pass there is no added value, it’s just noise.

Combine that with @vnp words of wisdom to use set and order your woorden before computing anything, you get:

import pickle,sys
def levenshtein_distance(reference, study):
 if not reference: return len(study)
 if not study: return len(reference)
 if first == second: return 0
 v0 = list(range(len(second) + 1))
 v1 = v0[:]
 for i, ref_elem in enumerate(reference):
 v1[0] = i + 1
 for j, study_elem in enumerate(study):
 v1[j+1] = min(v1[j] + 1,
 v0[j+1] + 1,
 v0[j] + int(ref_elem != study_elem))
 v0[:] = v1[:]
 return v1[-1]
def levenshtein_query(query):
 def wrapped(word):
 return levenshtein_distance(query, word), word
 return wrapped
def main(words_filename='woordjes2.pkl', tweets_filename='postlist.pkl'):
 dict_pkld = pickle.load(open(words_filename, 'rb'))
 postlist = pickle.load(open(tweets_filename, 'rb'))
 for line in sys.stdin:
 woorden = line.rstrip()
 print('Your query is: "' + woorden + '"')
 words = sorted({
 sorted(map(levenshtein_query(woorden), sub['termen']))[0]
 for sub in dict_pkld.values()
 if woorden not in sub['termen']}) # I wonder why you put this condition in the first place
 print('A word that is similar to your query is:')
 min_distance, most_similar_word = words[0]
 for distance, woordje in words:
 if distance != min_distance:
 break
 print('-', woordje)
 print('These words have a Levenshtein Distance of', min_distance)
 print('Tweets that contain the first word of this list:')
 result = postlist[most_similar_word]
 for r in result:
 print(dict_pkld[r]['tekst'])
if __name__ == "__main__":
 main()

vnp vnp 58.6k4 gold badges55 silver badges144 bronze badges · Answer 1 · 2015-10-20 23:50:04Z

The largest performance problem I see is in the for woord in woorden loop. The code iterates over zip(lowest_distance, words) - which is growing - over and over again, effectively getting a quadratic performance. I recommend to calculate distances to each woord, and sort the resulting list:
```
woord_distance_list = [ (levenstein_distance(woord), woord) for woord in woorden ].sorted()
```
The words list undergoes remove_duplicates, which is a strong indication that it should be set in the first place.
I would consider delegating levenstein_distance to a C function. Python is not sell-suited for such kind of computations.
Not related to performance:
- postlist is declared way early. Move it down to where it is used:
```
 postlist = pickle.load(open('postlist.pkl', 'rb'))
 result = postlist[words_new[0]]
```
- A massive main with 4 levels of loop nesting is a strong indication that something is not right. Try to make each loop in a function, just to give it a meaningful name.

score 1 · Answer 2 · 2015-10-21 22:58:53Z

Stay consistent on your indentations levels. Mixing 4 and 8 spaces indent in the same file is not really a good idea. Worse is to do it in the same function.
Do not call your main function freely at the end of the file, guard it with:
```
if __name__ == "__main__":
 main()
```
Make better use of higher-level python functions to reduce cluttering of your code: copying a list is done with new_list = old_list[:], iterating over elements and getting their index is done with enumerate and so on.
Do not write else: pass there is no added value, it’s just noise.

Combine that with @vnp words of wisdom to use set and order your woorden before computing anything, you get:

import pickle,sys
def levenshtein_distance(reference, study):
 if not reference: return len(study)
 if not study: return len(reference)
 if first == second: return 0
 v0 = list(range(len(second) + 1))
 v1 = v0[:]
 for i, ref_elem in enumerate(reference):
 v1[0] = i + 1
 for j, study_elem in enumerate(study):
 v1[j+1] = min(v1[j] + 1,
 v0[j+1] + 1,
 v0[j] + int(ref_elem != study_elem))
 v0[:] = v1[:]
 return v1[-1]
def levenshtein_query(query):
 def wrapped(word):
 return levenshtein_distance(query, word), word
 return wrapped
def main(words_filename='woordjes2.pkl', tweets_filename='postlist.pkl'):
 dict_pkld = pickle.load(open(words_filename, 'rb'))
 postlist = pickle.load(open(tweets_filename, 'rb'))
 for line in sys.stdin:
 woorden = line.rstrip()
 print('Your query is: "' + woorden + '"')
 words = sorted({
 sorted(map(levenshtein_query(woorden), sub['termen']))[0]
 for sub in dict_pkld.values()
 if woorden not in sub['termen']}) # I wonder why you put this condition in the first place
 print('A word that is similar to your query is:')
 min_distance, most_similar_word = words[0]
 for distance, woordje in words:
 if distance != min_distance:
 break
 print('-', woordje)
 print('These words have a Levenshtein Distance of', min_distance)
 print('Tweets that contain the first word of this list:')
 result = postlist[most_similar_word]
 for r in result:
 print(dict_pkld[r]['tekst'])
if __name__ == "__main__":
 main()

Stack Exchange Network

Iterate over large amount of words in Python

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Iterate over large amount of words in Python

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions