Finding the closest string match for each string in two dataframe columns

Question 1

I was wondering what's the fastest method to compare similarity between two lists of strings (e.g. two dataframes, documents etc.) using levenshtein distance or other procedures.

I am currently using:

def wuzzyfuzzy(df1, df2):
 myList = []
 total = len(df1)
 for idx1, df1_str in enumerate(df1.col1):
 myDict = {}
 my_str = ('Progress : ' + str(round((idx1/total)*100,3))+'%')
 sys.stdout.write('\r' + str(my_str))
 sys.stdout.flush()
 for idx2, df2_str in enumerate(df2.col1):
 s = SequenceMatcher(None, df1_str, df2_str)
 r = s.ratio()
 myDict.update({df2_str:r})
 best_match = max(myDict, key=myDict.get)
 myList.append([df1_str, best_match, myDict[best_match]])
 return myList

As the dataframes that are passed to the function have both> 30.000 values, it takes currently around 6 hours to compare each value in df1 with all other values in df2 to find the best match.

Of course I cleaned the strings as good as possible beforehand (all lowercase, get rid of punctuations etc.)

What's the most efficient way to perfom such task?

Question 2

It might help to know what the strings look like. Can you provide sample data?

Question 3

Comparing 30,000 strings against 30,000 other strings is 900 million comparisons. It's going to take a while.

Run the Python profiler on a small data set to see where it is spending the most time. So you can focus your efforts.

difflib

The documentation for SequenceMatcher says it caches information about the second sequence. So that to compare one string against a bunch of other strings, use .set_seq2() to set the one string, and the use .set_seq() to check it against each of the other strings.

It also says that calculating ratio() is expensive to compute so you might want to use 'quick_ratio()orreal_quick_ratio()` first.

def wuzzyfuzzy(df1, df2):
 myList = []
 total = len(df1)
 s = SequenceMatcher(isjunk=None, autojunk=False)
 for idx1, df1_str in enumerate(df1.col1):
 s.set_seq2(df1_str)
 my_str = ('Progress : ' + str(round((idx1/total)*100,3))+'%')
 sys.stdout.write('\r' + str(my_str))
 sys.stdout.flush()
 best_str2 = ''
 best_ratio = 0
 for idx2, df2_str in enumerate(df2.col1):
 s.set_seq2(df2_str)
 if s.real_quick_ratio() > best_ratio and s.quick_ratio() > best_ratio:
 r = s.ratio()
 if r > best_ratio:
 best_match = df2_str
 best_ratio = r
 myList.append([df1_str, best_match, best_ratio])
 return myList

You could also consider difflib.get_close_matches(string, possibilities, n, cutoff). It compares string against a list possibilities and returns a list of upto n that match better than cutoff.

def wuzzyfuzzy(df1, df2):
 myList = []
 possibilities = list(df2.col1)
 s = SequenceMatcher(isjunk=None, autojunk=False)
 for idx1, df1_str in enumerate(df1.col1):
 my_str = ('Progress : ' + str(round((idx1/total)*100,3))+'%')
 sys.stdout.write('\r' + str(my_str))
 sys.stdout.flush()
 # get 1 best match that has a ratio of at least 0.7
 best_match = get_close_matches(df1_str1, possibilities, 1, 0.7)
 s.set_seq(df1_str, best_match) 
 myList.append([df1_str, best_match, s.ratio()])
 return myList

RootTwo RootTwo 10.6k1 gold badge14 silver badges30 bronze badges · Accepted Answer · 2019-07-26 02:37:45Z

Comparing 30,000 strings against 30,000 other strings is 900 million comparisons. It's going to take a while.

Run the Python profiler on a small data set to see where it is spending the most time. So you can focus your efforts.

difflib

The documentation for SequenceMatcher says it caches information about the second sequence. So that to compare one string against a bunch of other strings, use .set_seq2() to set the one string, and the use .set_seq() to check it against each of the other strings.

It also says that calculating ratio() is expensive to compute so you might want to use 'quick_ratio()orreal_quick_ratio()` first.

def wuzzyfuzzy(df1, df2):
 myList = []
 total = len(df1)
 s = SequenceMatcher(isjunk=None, autojunk=False)
 for idx1, df1_str in enumerate(df1.col1):
 s.set_seq2(df1_str)
 my_str = ('Progress : ' + str(round((idx1/total)*100,3))+'%')
 sys.stdout.write('\r' + str(my_str))
 sys.stdout.flush()
 best_str2 = ''
 best_ratio = 0
 for idx2, df2_str in enumerate(df2.col1):
 s.set_seq2(df2_str)
 if s.real_quick_ratio() > best_ratio and s.quick_ratio() > best_ratio:
 r = s.ratio()
 if r > best_ratio:
 best_match = df2_str
 best_ratio = r
 myList.append([df1_str, best_match, best_ratio])
 return myList

You could also consider difflib.get_close_matches(string, possibilities, n, cutoff). It compares string against a list possibilities and returns a list of upto n that match better than cutoff.

def wuzzyfuzzy(df1, df2):
 myList = []
 possibilities = list(df2.col1)
 s = SequenceMatcher(isjunk=None, autojunk=False)
 for idx1, df1_str in enumerate(df1.col1):
 my_str = ('Progress : ' + str(round((idx1/total)*100,3))+'%')
 sys.stdout.write('\r' + str(my_str))
 sys.stdout.flush()
 # get 1 best match that has a ratio of at least 0.7
 best_match = get_close_matches(df1_str1, possibilities, 1, 0.7)
 s.set_seq(df1_str, best_match) 
 myList.append([df1_str, best_match, s.ratio()])
 return myList

Stack Exchange Network

Finding the closest string match for each string in two dataframe columns

1 Answer 1

difflib

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Finding the closest string match for each string in two dataframe columns

1 Answer 1

difflib

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions