Matching entities in SQL Server to replace iterative Python script

Question 1

I have a SQL (MS SQL Server) database of ~22 million companies (called target_db in my code). For example:

+-----------------------+----------------+-----------+
| CompanyName | ReFcode | ID_number |
+-----------------------+----------------+-----------+
| Mercedes Benz Limited | Germany | 12345 |
| Apple Corporation | United States | 67899 |
| Aunt Mary Butcher | United Kingdom | 56789 |
+-----------------------+----------------+-----------+

Then, I have another list of companies (called input in my example) and I would like to assign ID_number based on approximate company name match. The size of my sample is 1000 companies but can be much more.

+--------------------+----------------+
| name | ReFcode |
+--------------------+----------------+
| Mercedes Benz Ltd. | Germany |
| Apple Corp. | United States |
| Butcher Aunt Mary | United Kingdom |
| Volkswagen Gmbh | Germany |
+--------------------+----------------+

I wrote a script in Python that does the following for each of the companies in my sample:

Connect to the database and filter the target_db to only get the names that start with the same 3 letters, have similar length (+- 7 characters) and Soundex DIFFERENCE is 4
Calculate levenshtein with all possible matches (using fuzzywuzzy library)
Choose the match with the highest similarity

This worked mostly fine and accomplished the task in around 6 minutes. I did some profiling and noticed that 5 minutes and 30 seconds was spent on connecting to the DB (in every iteration of the loop), running query and fetching the results. The rest was calculating the similarity. Which I found quite surprising.

I thought that I could save time and improve performance by doing the process entirely in SQL, so I started writing this query:

SELECT distinct input.name, 
 target_db.CompanyName, 
 input.Country, 
 input.ReFcode, 
 [dbo].[edit_distance](input.name, target_db.CompanyName) as levenshtein
FROM dbo.Sample as input
JOIN dbo.Company as target_db
on 
 (input.ReFcode = target_db.Refcode and 
 LEFT(input.name, 3) = LEFT(target_db.CompanyName, 3) and
 SOUNDEX(input.name)=SOUNDEX(target_db.CompanyName))
WHERE ABS( LEN(input.name) - LEN(target_db.CompanyName)) <= 7
ORDER BY levenshtein asc

Then I would take the match with the lowest EditDistance (Levenshtein) below a certain threshold. This works but the process now is taking almost 30 minutes. What am I doing wrong? Any suggestions on how to improve it?

Question 2

Don't know exactly why it's so much slower, but you can add a ROW_NUMBER to apply the lowest EditDistance filter within the Select. And poor Aunt Mary Butcher will not be found, using NGrams is probably a better way: sqlservercentral.com/articles/Tally+Table/142316

Question 3

If you are still interested in getting a review of this code, can you post an execution plan for both versions (before and after) and provide the before code so we can look at that as well?

Question 4

One red flag: you mentioned "connecting to the database on every iteration of the loop". Definitely don't do this. Your Python might be fine if you just maintain a single connection. Please post your Python code.

Question 5

Without knowing exactly what your old code was it is hard to pinpoint where the slowdown came from. That being said, here are some things you're doing in your query that are going to make it slower:

Using a UDF prohibits parallel plans. You could throw 1000 CPUs at this problem and it would still only use one of them. If computing this is too complex to do inline, then use an inline table-valued function instead, which enables parallelism.
JOINing on functions (UDF or otherwise) performs really poorly. SQL Server has no cardinality estimate for LEFT(input.name, 3), but it does for input.name. If at all possible find another way to do this join (a suggestion is below). Similarly, filtering in a WHERE clause has the same restrictions.

To address this, I would likely change your join to just be on RefCode - this gets you a good hash match (probably), but will increase your granularity. Then, make your edit_distance function table valued, like so

SELECT <<stuff>>
 FROM dbo.Sample input
 INNER JOIN dbo.Company target_db
 ON input.ReFcode = target_db.Refcode
 CROSS APPLY dbo.edit_distance( input.name, target_db.CompanyName ) levenshtein

Then, inside of dbo.edit_distance I would include the LEFT, SOUNDEX and LEN calculations (LEN calc would also be better if you did LEN( input.name ) - LEN( targetdb.CompanyName ) BETWEEN -7 AND 7).

If levenshtein returns some score, then you could do something like this to only get the best score per input name. Alternatively if you want them all, then have it return everything and do whatever calculations the consuming app needs.

SELECT name,
 CompanyName,
 Country,
 RefCode,
 levenshtein_score
 FROM ( SELECT input.name,
 target_db.CompanyName,
 input.Country,
 input.ReFcode,
 levenshtein.levenshtein_score,
 ROW_NUMBER() OVER ( PARTITION BY input.name
 ORDER BY levenshtein.levenshtein_score ASC ) RowNumber
 FROM dbo.Sample input
 INNER JOIN dbo.Company target_db
 ON input.ReFcode = target_db.Refcode
 CROSS APPLY dbo.edit_distance( input.name, target_db.CompanyName ) levenshtein ) ComputedLevenshteinScores
 WHERE ComputedLevenshteinScores.RowNumber = 1
 ORDER BY levenshtein ASC;

Dan Oberlam Dan Oberlam 8,0492 gold badges33 silver badges74 bronze badges · Answer 1 · 2019-08-28 15:14:29Z

Without knowing exactly what your old code was it is hard to pinpoint where the slowdown came from. That being said, here are some things you're doing in your query that are going to make it slower:

Using a UDF prohibits parallel plans. You could throw 1000 CPUs at this problem and it would still only use one of them. If computing this is too complex to do inline, then use an inline table-valued function instead, which enables parallelism.
JOINing on functions (UDF or otherwise) performs really poorly. SQL Server has no cardinality estimate for LEFT(input.name, 3), but it does for input.name. If at all possible find another way to do this join (a suggestion is below). Similarly, filtering in a WHERE clause has the same restrictions.

To address this, I would likely change your join to just be on RefCode - this gets you a good hash match (probably), but will increase your granularity. Then, make your edit_distance function table valued, like so

SELECT <<stuff>>
 FROM dbo.Sample input
 INNER JOIN dbo.Company target_db
 ON input.ReFcode = target_db.Refcode
 CROSS APPLY dbo.edit_distance( input.name, target_db.CompanyName ) levenshtein

Then, inside of dbo.edit_distance I would include the LEFT, SOUNDEX and LEN calculations (LEN calc would also be better if you did LEN( input.name ) - LEN( targetdb.CompanyName ) BETWEEN -7 AND 7).

If levenshtein returns some score, then you could do something like this to only get the best score per input name. Alternatively if you want them all, then have it return everything and do whatever calculations the consuming app needs.

SELECT name,
 CompanyName,
 Country,
 RefCode,
 levenshtein_score
 FROM ( SELECT input.name,
 target_db.CompanyName,
 input.Country,
 input.ReFcode,
 levenshtein.levenshtein_score,
 ROW_NUMBER() OVER ( PARTITION BY input.name
 ORDER BY levenshtein.levenshtein_score ASC ) RowNumber
 FROM dbo.Sample input
 INNER JOIN dbo.Company target_db
 ON input.ReFcode = target_db.Refcode
 CROSS APPLY dbo.edit_distance( input.name, target_db.CompanyName ) levenshtein ) ComputedLevenshteinScores
 WHERE ComputedLevenshteinScores.RowNumber = 1
 ORDER BY levenshtein ASC;

Stack Exchange Network

Matching entities in SQL Server to replace iterative Python script

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Matching entities in SQL Server to replace iterative Python script

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions