How to optimize the for loop for finding a matching 2 string using fuzzywuzzy

Question 1

I am getting the probability of a string being similar to another string in Python using fuzzywuzzy lib.

Currently, I am doing this using a for loop and the search is time consuming.

Below is working code :

from fuzzywuzzy import fuzz
with open('all_nut_data.csv', newline='') as csvfile:
 spamwriter = csv.DictReader(csvfile)
 mostsimilarcs = 0
 mostsimilarns = 0
 for rowdata in spamwriter:
 mostsimilarns = fuzz.ratio(rowdata["Food Item"].lower(), name.lower())
 if mostsimilarns > mostsimilarcs:
 mostsimilarcs = mostsimilarns
 row1 = rowdata

How I can optimize this code without for loop?

Note* CSV file contain 600,000 rows and 17 column

Sample CSV file

Question 2

Does the food-item column contain duplicates? Caching might help.

Question 3

Could you include a snippet of the csv file for testing?

Question 4

Also, what is name?

Question 5

@AustinHastings Hastings food-item column not contain duplicates value.

Question 6

@Graipher its any random value like : "Test" or "John"

Question 7

This will not be much faster, but more readable (IMO) and extendable. You are looking for the maximum (in similarity). So, use the built-in max function. You can also define a function that does the file reading (so you can swap it out for a list of dictionaries, or whatever, for testing) and a function to be use as key. I made it slightly more complicated than needed here to give some customizability. The word it is compared to is fixed, so it is passed to the outer function, but so is the column name (you could also hard-code that).

import csv
from fuzzywuzzy.fuzz import ratio as fuzz_ratio
def get_rows(file_name):
 with open(file_name, newline='') as csvfile:
 reader = csv.DictReader(csvfile)
 yield from reader
def similarity_to(x, column_name):
 x = x.lower()
 def similarity(row):
 return fuzz_ratio(row[column_name].lower(), x)
 return similarity
if __name__ == "__main__":
 items = get_rows('all_nut_data.csv')
 name = "Hayelnut"
 best_match = max(items, key=similarity_to(name, "Food Item"))
 match_quality = similarity_to(name, "Food Item")(best_match)

max ensures that the key function is only called once per element (so no unnecessary calculations). However, since the similarity is not part of the row, you have to calculate it again at the end. On the other hand, I don't call name.lower() every loop iteration. Note that get_rows is a generator. This is very nice because you don't need to load the whole file into memory (just like in your code), but if you want to run it multiple times, you need to recreate the generator each time.

In the end the code as currently written can not avoid having to call the function on each row, one at a time. With max at least the iteration is partially done in C and therefore potentially faster, but not by much. For some naive tests, the built-in max is about 30% faster than a simple for loop, like you have.

The only way to get a significant speed increase would be to use a vectorized version of that function. After some digging I found out that internally the fuzzywuzzy just returns the Levenshtein ratio for the two words (after type and bound checking, and then applies some casting and rounding) from the Levenshtein module. So you could look for different modules that implemented this or try if directly using the underlying method is faster. Unfortunately I have not managed to find a vectorized version of the Levenshtein ratio (or distance) where one word is fixed and the other is not.

However, there is fuzzywuzzy.process.extractOne, which lets you customize the scoring and processing. It might be even faster than the loop run by max:

from fuzzywuzzy import process, fuzz
def processor(x):
 return x["Food Item"].lower()
def get_best_match(name, rows):
 name = {"Food Item": name}
 return process.extractOne(name, rows,
 scorer=fuzz.ratio, processor=processor)
if __name__ == "__main__":
 rows = get_rows('all_nut_data.csv')
 name = "Hayelnut"
 best_match, match_quality = get_best_match(name, rows)
 print(best_match, match_quality)

The packing of the name in the dictionary is necessary, because the processor is called also on the query.

Using a local dictionary (from the hunspell package), which contains 65867 words, I get the following timings for finding the closest match for "Hayelnut":

OP: 207 ms ± 4.05 ms
max: 206 ms ± 8.33 ms
get_best_match: 221 ms ± 3.77 ms

So no real improvement, in fact the last function is even slightly slower! But at least all three determine that "hazelnut" is the correct choice in this case.

Graipher Graipher 41.6k7 gold badges70 silver badges134 bronze badges · Accepted Answer · 2019-10-01 12:32:15Z

This will not be much faster, but more readable (IMO) and extendable. You are looking for the maximum (in similarity). So, use the built-in max function. You can also define a function that does the file reading (so you can swap it out for a list of dictionaries, or whatever, for testing) and a function to be use as key. I made it slightly more complicated than needed here to give some customizability. The word it is compared to is fixed, so it is passed to the outer function, but so is the column name (you could also hard-code that).

import csv
from fuzzywuzzy.fuzz import ratio as fuzz_ratio
def get_rows(file_name):
 with open(file_name, newline='') as csvfile:
 reader = csv.DictReader(csvfile)
 yield from reader
def similarity_to(x, column_name):
 x = x.lower()
 def similarity(row):
 return fuzz_ratio(row[column_name].lower(), x)
 return similarity
if __name__ == "__main__":
 items = get_rows('all_nut_data.csv')
 name = "Hayelnut"
 best_match = max(items, key=similarity_to(name, "Food Item"))
 match_quality = similarity_to(name, "Food Item")(best_match)

max ensures that the key function is only called once per element (so no unnecessary calculations). However, since the similarity is not part of the row, you have to calculate it again at the end. On the other hand, I don't call name.lower() every loop iteration. Note that get_rows is a generator. This is very nice because you don't need to load the whole file into memory (just like in your code), but if you want to run it multiple times, you need to recreate the generator each time.

In the end the code as currently written can not avoid having to call the function on each row, one at a time. With max at least the iteration is partially done in C and therefore potentially faster, but not by much. For some naive tests, the built-in max is about 30% faster than a simple for loop, like you have.

The only way to get a significant speed increase would be to use a vectorized version of that function. After some digging I found out that internally the fuzzywuzzy just returns the Levenshtein ratio for the two words (after type and bound checking, and then applies some casting and rounding) from the Levenshtein module. So you could look for different modules that implemented this or try if directly using the underlying method is faster. Unfortunately I have not managed to find a vectorized version of the Levenshtein ratio (or distance) where one word is fixed and the other is not.

However, there is fuzzywuzzy.process.extractOne, which lets you customize the scoring and processing. It might be even faster than the loop run by max:

from fuzzywuzzy import process, fuzz
def processor(x):
 return x["Food Item"].lower()
def get_best_match(name, rows):
 name = {"Food Item": name}
 return process.extractOne(name, rows,
 scorer=fuzz.ratio, processor=processor)
if __name__ == "__main__":
 rows = get_rows('all_nut_data.csv')
 name = "Hayelnut"
 best_match, match_quality = get_best_match(name, rows)
 print(best_match, match_quality)

The packing of the name in the dictionary is necessary, because the processor is called also on the query.

Using a local dictionary (from the hunspell package), which contains 65867 words, I get the following timings for finding the closest match for "Hayelnut":

OP: 207 ms ± 4.05 ms
max: 206 ms ± 8.33 ms
get_best_match: 221 ms ± 3.77 ms

So no real improvement, in fact the last function is even slightly slower! But at least all three determine that "hazelnut" is the correct choice in this case.

Stack Exchange Network

How to optimize the for loop for finding a matching 2 string using fuzzywuzzy

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

How to optimize the for loop for finding a matching 2 string using fuzzywuzzy

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions