I have a SQL (MS SQL Server) database of ~22 million companies (called target_db
in my code). For example:
+-----------------------+----------------+-----------+
| CompanyName | ReFcode | ID_number |
+-----------------------+----------------+-----------+
| Mercedes Benz Limited | Germany | 12345 |
| Apple Corporation | United States | 67899 |
| Aunt Mary Butcher | United Kingdom | 56789 |
+-----------------------+----------------+-----------+
Then, I have another list of companies (called input in my example) and I would like to assign ID_number
based on approximate company name match. The size of my sample is 1000 companies but can be much more.
+--------------------+----------------+
| name | ReFcode |
+--------------------+----------------+
| Mercedes Benz Ltd. | Germany |
| Apple Corp. | United States |
| Butcher Aunt Mary | United Kingdom |
| Volkswagen Gmbh | Germany |
+--------------------+----------------+
I wrote a script in Python that does the following for each of the companies in my sample:
- Connect to the database and filter the
target_db
to only get the names that start with the same 3 letters, have similar length (+- 7 characters) and Soundex DIFFERENCE is 4 - Calculate levenshtein with all possible matches (using fuzzywuzzy library)
- Choose the match with the highest similarity
This worked mostly fine and accomplished the task in around 6 minutes. I did some profiling and noticed that 5 minutes and 30 seconds was spent on connecting to the DB (in every iteration of the loop), running query and fetching the results. The rest was calculating the similarity. Which I found quite surprising.
I thought that I could save time and improve performance by doing the process entirely in SQL, so I started writing this query:
SELECT distinct input.name,
target_db.CompanyName,
input.Country,
input.ReFcode,
[dbo].[edit_distance](input.name, target_db.CompanyName) as levenshtein
FROM dbo.Sample as input
JOIN dbo.Company as target_db
on
(input.ReFcode = target_db.Refcode and
LEFT(input.name, 3) = LEFT(target_db.CompanyName, 3) and
SOUNDEX(input.name)=SOUNDEX(target_db.CompanyName))
WHERE ABS( LEN(input.name) - LEN(target_db.CompanyName)) <= 7
ORDER BY levenshtein asc
Then I would take the match with the lowest EditDistance
(Levenshtein) below a certain threshold. This works but the process now is taking almost 30 minutes. What am I doing wrong? Any suggestions on how to improve it?
-
1\$\begingroup\$ Don't know exactly why it's so much slower, but you can add a ROW_NUMBER to apply the lowest EditDistance filter within the Select. And poor Aunt Mary Butcher will not be found, using NGrams is probably a better way: sqlservercentral.com/articles/Tally+Table/142316 \$\endgroup\$dnoeth– dnoeth2018年02月11日 11:30:36 +00:00Commented Feb 11, 2018 at 11:30
-
\$\begingroup\$ If you are still interested in getting a review of this code, can you post an execution plan for both versions (before and after) and provide the before code so we can look at that as well? \$\endgroup\$Dan Oberlam– Dan Oberlam2019年08月27日 21:30:34 +00:00Commented Aug 27, 2019 at 21:30
-
\$\begingroup\$ One red flag: you mentioned "connecting to the database on every iteration of the loop". Definitely don't do this. Your Python might be fine if you just maintain a single connection. Please post your Python code. \$\endgroup\$Reinderien– Reinderien2019年08月28日 15:33:09 +00:00Commented Aug 28, 2019 at 15:33
1 Answer 1
Without knowing exactly what your old code was it is hard to pinpoint where the slowdown came from. That being said, here are some things you're doing in your query that are going to make it slower:
- Using a UDF prohibits parallel plans. You could throw 1000 CPUs at this problem and it would still only use one of them. If computing this is too complex to do inline, then use an inline table-valued function instead, which enables parallelism.
JOIN
ing on functions (UDF or otherwise) performs really poorly. SQL Server has no cardinality estimate forLEFT(input.name, 3)
, but it does forinput.name
. If at all possible find another way to do this join (a suggestion is below). Similarly, filtering in aWHERE
clause has the same restrictions.
To address this, I would likely change your join to just be on RefCode
- this gets you a good hash match (probably), but will increase your granularity. Then, make your edit_distance
function table valued, like so
SELECT <<stuff>>
FROM dbo.Sample input
INNER JOIN dbo.Company target_db
ON input.ReFcode = target_db.Refcode
CROSS APPLY dbo.edit_distance( input.name, target_db.CompanyName ) levenshtein
Then, inside of dbo.edit_distance
I would include the LEFT
, SOUNDEX
and LEN
calculations (LEN
calc would also be better if you did LEN( input.name ) - LEN( targetdb.CompanyName ) BETWEEN -7 AND 7
).
If levenshtein returns some score, then you could do something like this to only get the best score per input name. Alternatively if you want them all, then have it return everything and do whatever calculations the consuming app needs.
SELECT name,
CompanyName,
Country,
RefCode,
levenshtein_score
FROM ( SELECT input.name,
target_db.CompanyName,
input.Country,
input.ReFcode,
levenshtein.levenshtein_score,
ROW_NUMBER() OVER ( PARTITION BY input.name
ORDER BY levenshtein.levenshtein_score ASC ) RowNumber
FROM dbo.Sample input
INNER JOIN dbo.Company target_db
ON input.ReFcode = target_db.Refcode
CROSS APPLY dbo.edit_distance( input.name, target_db.CompanyName ) levenshtein ) ComputedLevenshteinScores
WHERE ComputedLevenshteinScores.RowNumber = 1
ORDER BY levenshtein ASC;
Explore related questions
See similar questions with these tags.