Comparison function for strings

Question 1

I got a function that has as input 2 strings, computes the string similarity of all the combinations and give back and output the highest similarity. For example, "you are" and "are you" will have 1 as similarity value.

import itertools
from difflib import SequenceMatcher
import numpy as np
def _compare_names(a, b):
 return SequenceMatcher(None, a, b).ratio()
def _compare_names_with_combinations(name_1,name_2):
 name_1 = name_1.lower().split()
 name_2 = name_2.lower().split()
 combinations_name_1 = list(itertools.permutations(name_1))
 combinations_name_2 = list(itertools.permutations(name_2))
 combinations_name_joined_1 = [' '.join(list(name)) for name in combinations_name_1]
 combinations_name_joined_2 = [' '.join(list(name)) for name in combinations_name_2]
 distances = []
 for name1 in combinations_name_joined_1:
 for name2 in combinations_name_joined_2:
 distance = _compare_names(name1,name2)
 distances.append(distance)
 return max(distances)

examples:

_compare_names_with_combinations('you are','are you')
>> 1
_compare_names_with_combinations('you are','are yos')
>> 0.85714285

My concerns come when I have to compare a lot of texts and it seems that there should be around a more efficient way of computing this value. Do you think there is space in this function to decrease the computational time?

Question 2

Is it intended that you are only testing different orders of words, not of all characters?

Question 3

only words and more precise names such as : john smith,billy bob thornton.. .

Question 4

Do you really have to compare all the permutations of name 1 with all the permutations of name 2? Wouldn't it suffice to compare the permutations of name 1 with the original name 2, or just to compare the different word counts, or the words in sorted order? How does _compare_names look? Also, no need for distances, just store the max value.

Question 5

This will not reduce time complexity, just space complexity; as well as spending less time in the interpreter, most of the work being done in C.

The key idea is to make use of generators to avoid building lists that you only iterate once; reducing your need for memory management, thus increasing the overall speed.

Since you are already importing itertools, I’ll just import the relevant parts to save on typing a bit, feel free to use the long names if you see fit:

from itertools import product, permutations

in case you are using Python 2, I'd use:

from itertools import product, permutations, imap as map

Now we transform each list into a generator expression:

def _compare_names_with_combinations(name_1,name_2):
 join = ' '.join
 name_1 = name_1.lower().split()
 name_2 = name_2.lower().split()
 distances = []
 for name1 in map(join, permutations(name_1)):
 for name2 in map(join, permutations(name_2)):
 distance = _compare_names(name1,name2)
 distances.append(distance)
 return max(distances)

Then product lets us merge the two loops:

def _compare_names_with_combinations(name_1,name_2):
 join = ' '.join
 name_1 = name_1.lower().split()
 name_2 = name_2.lower().split()
 distances = []
 for name1, name2 in product(map(join, permutations(name_1)), map(join, permutations(name_2))):
 distance = _compare_names(name1,name2)
 distances.append(distance)
 return max(distances)

An other improvement is to not build the list of all distances since we are only interested in the maximum value: let's put this max call out of the function and turn it into a generator:

def _compare_names_with_combinations(name_1,name_2):
 join = ' '.join
 name_1 = name_1.lower().split()
 name_2 = name_2.lower().split()
 for name1, name2 in product(map(join, permutations(name_1)), map(join, permutations(name_2))):
 yield _compare_names(name1,name2)
def score_names(name_1, name_2):
 return max(_compare_names_with_combinations(name_1, name_2))

Well, we’re just mapping _compare_names over the product. Maybe we could have that max call in the function after all:

def _compare_names_with_combinations(name_1, name_2):
 join = ' '.join
 name_1 = name_1.lower().split()
 name_2 = name_2.lower().split()
 return max(
 map(
 _compare_names,
 product(
 map(join, permutations(name_1)),
 map(join, permutations(name_2))
 )
 )
 )

But this will require that we modify _compare_names to accept a tuple of two names as parameter instead of two names as two separate parameters:

def _compare_names(names):
 a, b = names
 return SequenceMatcher(None, a, b).ratio()

Question 6

You're doing the same thing to both lists three times, just use a loop instead:

args = []
for name in (name_1, name_2):
 name = name.lower().split()
 combinations = itertools.permutations(name)
 args.append([' '.join(list(name)) for name in combinations])

Notice I also removed list from the combinations line. If you only need to iterate over combinations there's no need to turn it into a list first. This also has the side effect of cleaning up your names to be shorter an easier:

for name1 in args[0]:
 for name2 in args[1]:
 distances.append(_compare_names(name1, name2))

You could also just write that as a list comprehension:

distances = [_compare_names(name1, name2)
 for name1 in args[0] for name2 in args[1]]

Having two for x in ys in a list comprehension just serves to be a nested for loop, same as you had before.

score 3 · Answer 1 · 2016-02-12 12:53:41Z

This will not reduce time complexity, just space complexity; as well as spending less time in the interpreter, most of the work being done in C.

The key idea is to make use of generators to avoid building lists that you only iterate once; reducing your need for memory management, thus increasing the overall speed.

Since you are already importing itertools, I’ll just import the relevant parts to save on typing a bit, feel free to use the long names if you see fit:

from itertools import product, permutations

in case you are using Python 2, I'd use:

from itertools import product, permutations, imap as map

Now we transform each list into a generator expression:

def _compare_names_with_combinations(name_1,name_2):
 join = ' '.join
 name_1 = name_1.lower().split()
 name_2 = name_2.lower().split()
 distances = []
 for name1 in map(join, permutations(name_1)):
 for name2 in map(join, permutations(name_2)):
 distance = _compare_names(name1,name2)
 distances.append(distance)
 return max(distances)

Then product lets us merge the two loops:

def _compare_names_with_combinations(name_1,name_2):
 join = ' '.join
 name_1 = name_1.lower().split()
 name_2 = name_2.lower().split()
 distances = []
 for name1, name2 in product(map(join, permutations(name_1)), map(join, permutations(name_2))):
 distance = _compare_names(name1,name2)
 distances.append(distance)
 return max(distances)

An other improvement is to not build the list of all distances since we are only interested in the maximum value: let's put this max call out of the function and turn it into a generator:

def _compare_names_with_combinations(name_1,name_2):
 join = ' '.join
 name_1 = name_1.lower().split()
 name_2 = name_2.lower().split()
 for name1, name2 in product(map(join, permutations(name_1)), map(join, permutations(name_2))):
 yield _compare_names(name1,name2)
def score_names(name_1, name_2):
 return max(_compare_names_with_combinations(name_1, name_2))

Well, we’re just mapping _compare_names over the product. Maybe we could have that max call in the function after all:

def _compare_names_with_combinations(name_1, name_2):
 join = ' '.join
 name_1 = name_1.lower().split()
 name_2 = name_2.lower().split()
 return max(
 map(
 _compare_names,
 product(
 map(join, permutations(name_1)),
 map(join, permutations(name_2))
 )
 )
 )

But this will require that we modify _compare_names to accept a tuple of two names as parameter instead of two names as two separate parameters:

def _compare_names(names):
 a, b = names
 return SequenceMatcher(None, a, b).ratio()

score 1 · Answer 2 · 2016-02-12 17:40:23Z

You're doing the same thing to both lists three times, just use a loop instead:

args = []
for name in (name_1, name_2):
 name = name.lower().split()
 combinations = itertools.permutations(name)
 args.append([' '.join(list(name)) for name in combinations])

Notice I also removed list from the combinations line. If you only need to iterate over combinations there's no need to turn it into a list first. This also has the side effect of cleaning up your names to be shorter an easier:

for name1 in args[0]:
 for name2 in args[1]:
 distances.append(_compare_names(name1, name2))

You could also just write that as a list comprehension:

distances = [_compare_names(name1, name2)
 for name1 in args[0] for name2 in args[1]]

Having two for x in ys in a list comprehension just serves to be a nested for loop, same as you had before.

Stack Exchange Network

Comparison function for strings

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Comparison function for strings

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions