Python softmatcher using difflib impracticably slow

Question 1

I have a softmatch function (below) that takes a donor list and a new entry and looks to see if the given donor already exists.

The data is not exact, so I have to use a softmatch to determine whether a given record exists (ex: Jon Doe at 123 Sesame St. is the same as John P. Doe at 123 Sesame Street). I have this implemented with difflib, and it works; it's just slow as Christmas.

The program currently takes about two days to process 10mb of data. The profiler indicates it's the difflib operations in the softmatch function that is causing the slowness.

Is there a way to optimize my matching function to work better?

def softMatch(self, new, donors):
 #Takes new, a contribution record from a DG report, and attempts to soft match it with a record in donors
 #Returns the position of the matched record or -1 if no match 
 #Methodology
 #Name 
 #Address
 #Affiliation,Occupation,Employer not analyzed 
 #Dependences cleanAddr(str), re(from cleanAddr), difflib
 match = -1
 name = new[0]
 address = self.cleanAddr(new[1] + " " + new[2] + " " + new[3] + " " + new[4] + " " + new[5][:5])
 while address.find(" ") != -1:
 address = address.replace(" "," ")
 diff = 0.6
 for x in range(0, len(donors)):
 ratio = difflib.SequenceMatcher(None, name, donors[x].getBestName()[0]).ratio() * difflib.SequenceMatcher(None, address, donors[x].getBestAddr()).ratio()**2
 if ratio > diff:
 diff = ratio
 match = x
 return match

And the profiler output:

 Ordered by: internal time
 ncalls tottime percall cumtime percall filename:lineno(function)
 3091394 53.737 0.000 66.798 0.000 difflib.py:353(find_longest_match)
 368770 10.030 0.000 80.702 0.000 difflib.py:463(get_matching_blocks)
 59454412 7.683 0.000 7.683 0.000 {method 'get' of 'dict' objects}
 368770 6.906 0.000 10.521 0.000 difflib.py:300(__chain_b)
 5210450 4.009 0.000 4.009 0.000 {built-in method __new__ of type object at 0x1001e14a0}
 16895496 3.056 0.000 3.056 0.000 {method 'append' of 'list' objects}
 756 2.863 0.004 104.674 0.138 june12.py:74(softMatch)
 3091394 1.875 0.000 1.875 0.000 {method 'pop' of 'list' objects}
 10299176 1.850 0.000 1.850 0.000 {method 'setdefault' of 'dict' objects}
 2487807 1.808 0.000 4.634 0.000 difflib.py:661(<genexpr>)
 3091394 1.620 0.000 4.345 0.000 <string>:12(__new__)
 368770 1.461 0.000 88.032 0.000 difflib.py:639(ratio)
 2119037 1.258 0.000 2.826 0.000 <string>:16(_make)
 8127502 1.032 0.000 1.032 0.000 {method '__contains__' of 'set' objects}
 184385 0.919 0.000 1.500 0.000 Donor.py:63(getBestAddr)
 368770 0.859 0.000 5.493 0.000 {built-in method sum}
 368770 0.554 0.000 12.131 0.000 difflib.py:154(__init__)
 368817 0.552 0.000 0.552 0.000 {method 'sort' of 'list' objects}
3965153/3965134 0.538 0.000 0.538 0.000 {built-in method len}
 368770 0.421 0.000 11.577 0.000 difflib.py:218(set_seqs)
 368770 0.403 0.000 10.924 0.000 difflib.py:256(set_seq2)
 337141 0.371 0.000 0.371 0.000 {method 'find' of 'str' objects}
 368770 0.281 0.000 0.281 0.000 difflib.py:41(_calculate_ratio)
 368770 0.232 0.000 0.232 0.000 difflib.py:230(set_seq1)
 159562 0.216 0.000 0.216 0.000 {method 'replace' of 'str' objects}
 184385 0.134 0.000 0.134 0.000 Donor.py:59(getBestName)
 1 0.017 0.017 104.718 104.718 june12.py:56(loadDonors)
 ...

(softmatch is called from loadDonors)

Question 2

I recommend you try different approaches:

Use quick_ratio instead of ratio
Apply .lower().split(' ') to the data before sending it to the SequenceMatcher
Add a line somewhere that checks if the two strings are equal. In those cases don't call SequenceMatcher

The first approach will most likely reduce the overall run time, but decrease precision. The second approach should work pretty well, since SequenceMatcher knows how to handle lists and will match only two items: ['john', 'doe'] and ['john', 'doe'] instead of all the letters in the name.

andres.riancho andres.riancho 1213 bronze badges · Answer 1 · 2018-05-07 19:07:40Z

I recommend you try different approaches:

Use quick_ratio instead of ratio
Apply .lower().split(' ') to the data before sending it to the SequenceMatcher
Add a line somewhere that checks if the two strings are equal. In those cases don't call SequenceMatcher

The first approach will most likely reduce the overall run time, but decrease precision. The second approach should work pretty well, since SequenceMatcher knows how to handle lists and will match only two items: ['john', 'doe'] and ['john', 'doe'] instead of all the letters in the name.

Stack Exchange Network

Python softmatcher using difflib impracticably slow

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Python softmatcher using difflib impracticably slow

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions