5
\$\begingroup\$

I was wondering what's the fastest method to compare similarity between two lists of strings (e.g. two dataframes, documents etc.) using levenshtein distance or other procedures.

I am currently using:

def wuzzyfuzzy(df1, df2):
 myList = []
 total = len(df1)
 for idx1, df1_str in enumerate(df1.col1):
 myDict = {}
 my_str = ('Progress : ' + str(round((idx1/total)*100,3))+'%')
 sys.stdout.write('\r' + str(my_str))
 sys.stdout.flush()
 for idx2, df2_str in enumerate(df2.col1):
 s = SequenceMatcher(None, df1_str, df2_str)
 r = s.ratio()
 myDict.update({df2_str:r})
 best_match = max(myDict, key=myDict.get)
 myList.append([df1_str, best_match, myDict[best_match]])
 return myList

As the dataframes that are passed to the function have both> 30.000 values, it takes currently around 6 hours to compare each value in df1 with all other values in df2 to find the best match.

Of course I cleaned the strings as good as possible beforehand (all lowercase, get rid of punctuations etc.)

What's the most efficient way to perfom such task?

200_success
146k22 gold badges190 silver badges478 bronze badges
asked Jul 25, 2019 at 8:39
\$\endgroup\$
1
  • \$\begingroup\$ It might help to know what the strings look like. Can you provide sample data? \$\endgroup\$ Commented Jul 26, 2019 at 2:38

1 Answer 1

2
\$\begingroup\$

Comparing 30,000 strings against 30,000 other strings is 900 million comparisons. It's going to take a while.

Run the Python profiler on a small data set to see where it is spending the most time. So you can focus your efforts.

difflib

The documentation for SequenceMatcher says it caches information about the second sequence. So that to compare one string against a bunch of other strings, use .set_seq2() to set the one string, and the use .set_seq() to check it against each of the other strings.

It also says that calculating ratio() is expensive to compute so you might want to use 'quick_ratio()orreal_quick_ratio()` first.

def wuzzyfuzzy(df1, df2):
 myList = []
 total = len(df1)
 s = SequenceMatcher(isjunk=None, autojunk=False)
 for idx1, df1_str in enumerate(df1.col1):
 s.set_seq2(df1_str)
 my_str = ('Progress : ' + str(round((idx1/total)*100,3))+'%')
 sys.stdout.write('\r' + str(my_str))
 sys.stdout.flush()
 best_str2 = ''
 best_ratio = 0
 for idx2, df2_str in enumerate(df2.col1):
 s.set_seq2(df2_str)
 if s.real_quick_ratio() > best_ratio and s.quick_ratio() > best_ratio:
 r = s.ratio()
 if r > best_ratio:
 best_match = df2_str
 best_ratio = r
 myList.append([df1_str, best_match, best_ratio])
 return myList

You could also consider difflib.get_close_matches(string, possibilities, n, cutoff). It compares string against a list possibilities and returns a list of upto n that match better than cutoff.

def wuzzyfuzzy(df1, df2):
 myList = []
 possibilities = list(df2.col1)
 s = SequenceMatcher(isjunk=None, autojunk=False)
 for idx1, df1_str in enumerate(df1.col1):
 my_str = ('Progress : ' + str(round((idx1/total)*100,3))+'%')
 sys.stdout.write('\r' + str(my_str))
 sys.stdout.flush()
 # get 1 best match that has a ratio of at least 0.7
 best_match = get_close_matches(df1_str1, possibilities, 1, 0.7)
 s.set_seq(df1_str, best_match) 
 myList.append([df1_str, best_match, s.ratio()])
 return myList
answered Jul 26, 2019 at 2:37
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.