I am getting the probability of a string being similar to another string in Python using fuzzywuzzy
lib.
Currently, I am doing this using a for loop and the search is time consuming.
Below is working code :
from fuzzywuzzy import fuzz
with open('all_nut_data.csv', newline='') as csvfile:
spamwriter = csv.DictReader(csvfile)
mostsimilarcs = 0
mostsimilarns = 0
for rowdata in spamwriter:
mostsimilarns = fuzz.ratio(rowdata["Food Item"].lower(), name.lower())
if mostsimilarns > mostsimilarcs:
mostsimilarcs = mostsimilarns
row1 = rowdata
How I can optimize this code without for loop?
Note* CSV file contain 600,000 rows and 17 column
1 Answer 1
This will not be much faster, but more readable (IMO) and extendable. You are looking for the maximum (in similarity). So, use the built-in max
function. You can also define a function that does the file reading (so you can swap it out for a list of dictionaries, or whatever, for testing) and a function to be use as key
. I made it slightly more complicated than needed here to give some customizability. The word it is compared to is fixed, so it is passed to the outer function, but so is the column name (you could also hard-code that).
import csv
from fuzzywuzzy.fuzz import ratio as fuzz_ratio
def get_rows(file_name):
with open(file_name, newline='') as csvfile:
reader = csv.DictReader(csvfile)
yield from reader
def similarity_to(x, column_name):
x = x.lower()
def similarity(row):
return fuzz_ratio(row[column_name].lower(), x)
return similarity
if __name__ == "__main__":
items = get_rows('all_nut_data.csv')
name = "Hayelnut"
best_match = max(items, key=similarity_to(name, "Food Item"))
match_quality = similarity_to(name, "Food Item")(best_match)
max
ensures that the key
function is only called once per element (so no unnecessary calculations). However, since the similarity is not part of the row, you have to calculate it again at the end. On the other hand, I don't call name.lower()
every loop iteration.
Note that get_rows
is a generator. This is very nice because you don't need to load the whole file into memory (just like in your code), but if you want to run it multiple times, you need to recreate the generator each time.
In the end the code as currently written can not avoid having to call the function on each row, one at a time. With max
at least the iteration is partially done in C and therefore potentially faster, but not by much. For some naive tests, the built-in max
is about 30% faster than a simple for
loop, like you have.
The only way to get a significant speed increase would be to use a vectorized version of that function. After some digging I found out that internally the fuzzywuzzy
just returns the Levenshtein ratio for the two words (after type and bound checking, and then applies some casting and rounding) from the Levenshtein
module. So you could look for different modules that implemented this or try if directly using the underlying method is faster. Unfortunately I have not managed to find a vectorized version of the Levenshtein ratio (or distance) where one word is fixed and the other is not.
However, there is fuzzywuzzy.process.extractOne
, which lets you customize the scoring and processing. It might be even faster than the loop run by max
:
from fuzzywuzzy import process, fuzz
def processor(x):
return x["Food Item"].lower()
def get_best_match(name, rows):
name = {"Food Item": name}
return process.extractOne(name, rows,
scorer=fuzz.ratio, processor=processor)
if __name__ == "__main__":
rows = get_rows('all_nut_data.csv')
name = "Hayelnut"
best_match, match_quality = get_best_match(name, rows)
print(best_match, match_quality)
The packing of the name
in the dictionary is necessary, because the processor
is called also on the query.
Using a local dictionary (from the hunspell package), which contains 65867 words, I get the following timings for finding the closest match for "Hayelnut"
:
OP: 207 ms ± 4.05 ms
max: 206 ms ± 8.33 ms
get_best_match: 221 ms ± 3.77 ms
So no real improvement, in fact the last function is even slightly slower! But at least all three determine that "hazelnut"
is the correct choice in this case.
name
? \$\endgroup\$