I got a function that has as input 2 strings, computes the string similarity of all the combinations and give back and output the highest similarity. For example, "you are" and "are you" will have 1 as similarity value.
import itertools
from difflib import SequenceMatcher
import numpy as np
def _compare_names(a, b):
return SequenceMatcher(None, a, b).ratio()
def _compare_names_with_combinations(name_1,name_2):
name_1 = name_1.lower().split()
name_2 = name_2.lower().split()
combinations_name_1 = list(itertools.permutations(name_1))
combinations_name_2 = list(itertools.permutations(name_2))
combinations_name_joined_1 = [' '.join(list(name)) for name in combinations_name_1]
combinations_name_joined_2 = [' '.join(list(name)) for name in combinations_name_2]
distances = []
for name1 in combinations_name_joined_1:
for name2 in combinations_name_joined_2:
distance = _compare_names(name1,name2)
distances.append(distance)
return max(distances)
examples:
_compare_names_with_combinations('you are','are you')
>> 1
_compare_names_with_combinations('you are','are yos')
>> 0.85714285
My concerns come when I have to compare a lot of texts and it seems that there should be around a more efficient way of computing this value. Do you think there is space in this function to decrease the computational time?
2 Answers 2
This will not reduce time complexity, just space complexity; as well as spending less time in the interpreter, most of the work being done in C.
The key idea is to make use of generators to avoid building lists that you only iterate once; reducing your need for memory management, thus increasing the overall speed.
Since you are already importing itertools
, I’ll just import the relevant parts to save on typing a bit, feel free to use the long names if you see fit:
from itertools import product, permutations
in case you are using Python 2, I'd use:
from itertools import product, permutations, imap as map
Now we transform each list into a generator expression:
def _compare_names_with_combinations(name_1,name_2):
join = ' '.join
name_1 = name_1.lower().split()
name_2 = name_2.lower().split()
distances = []
for name1 in map(join, permutations(name_1)):
for name2 in map(join, permutations(name_2)):
distance = _compare_names(name1,name2)
distances.append(distance)
return max(distances)
Then product
lets us merge the two loops:
def _compare_names_with_combinations(name_1,name_2):
join = ' '.join
name_1 = name_1.lower().split()
name_2 = name_2.lower().split()
distances = []
for name1, name2 in product(map(join, permutations(name_1)), map(join, permutations(name_2))):
distance = _compare_names(name1,name2)
distances.append(distance)
return max(distances)
An other improvement is to not build the list of all distances since we are only interested in the maximum value: let's put this max
call out of the function and turn it into a generator:
def _compare_names_with_combinations(name_1,name_2):
join = ' '.join
name_1 = name_1.lower().split()
name_2 = name_2.lower().split()
for name1, name2 in product(map(join, permutations(name_1)), map(join, permutations(name_2))):
yield _compare_names(name1,name2)
def score_names(name_1, name_2):
return max(_compare_names_with_combinations(name_1, name_2))
Well, we’re just mapping _compare_names
over the product
. Maybe we could have that max
call in the function after all:
def _compare_names_with_combinations(name_1, name_2):
join = ' '.join
name_1 = name_1.lower().split()
name_2 = name_2.lower().split()
return max(
map(
_compare_names,
product(
map(join, permutations(name_1)),
map(join, permutations(name_2))
)
)
)
But this will require that we modify _compare_names
to accept a tuple of two names as parameter instead of two names as two separate parameters:
def _compare_names(names):
a, b = names
return SequenceMatcher(None, a, b).ratio()
You're doing the same thing to both lists three times, just use a loop instead:
args = []
for name in (name_1, name_2):
name = name.lower().split()
combinations = itertools.permutations(name)
args.append([' '.join(list(name)) for name in combinations])
Notice I also removed list
from the combinations
line. If you only need to iterate over combinations
there's no need to turn it into a list first. This also has the side effect of cleaning up your names to be shorter an easier:
for name1 in args[0]:
for name2 in args[1]:
distances.append(_compare_names(name1, name2))
You could also just write that as a list comprehension:
distances = [_compare_names(name1, name2)
for name1 in args[0] for name2 in args[1]]
Having two for x in y
s in a list comprehension just serves to be a nested for loop, same as you had before.
Explore related questions
See similar questions with these tags.
john smith
,billy bob thornton
.. . \$\endgroup\$_compare_names
look? Also, no need fordistances
, just store the max value. \$\endgroup\$