3
\$\begingroup\$

I have a table with over 1,000,000 records. I need to replace any names in the text fields with aliases to help de-identify the data. For this example, let's assume the table is TemporaryTest and has two fields: Id (the key field) and IndexedXML (the text field).

I have a second table, AppellationSubstitution, that has the following columns: TextEntry (a name needing replacement), Length (length of TextEntry), Replacement (the replacement name, which may be of a different length). That table has about 110,000 rows.

The first step I use is (the regex matches words in the text field -- it looks a bit odd because of some odd characters that show up in this database):

SELECT id, 
 matchindex, 
 matchlength, 
 replacement 
FROM TemporaryTest 
 CROSS APPLY
master.dbo.Regexmatches('([Xx]-)?[\w-[0-9üÿ_]]{2,}(-[\w-[0-9üÿ_]]{2,})?(''[\w-[0-9üÿ_]])?', [IndexedXML], 
master.dbo.Regexoptionenumeration(0, 0, 1, 1, 0, 0, 0, 0, 0)) 
 INNER JOIN dbo.appellationsubstitution 
 ON match = textentry
 ORDER BY Id, MatchIndex DESC;--if replace in forward order, insertion point gets moved 

This produces a table with over 100,000 rows, which the following shows a few lines:

Id matchindex matchlength replacement
99309 122 5 "Demarcus"
108639 106 5 "Demarcus"
109809 84 6 "Rehbein"
110373 89 7 "Reginald"
111156 105 5 "Demarcus"
112452 129 6 "Thie"
112896 113 6 "Diberardino"
112896 92 6 "Diberardino"
113503 119 3 "Rubin"

The full procedure I'm currently trying out is:

SET NOCOUNT ON;
SET XACT_ABORT ON;
BEGIN TRANSACTION;
DECLARE ReplaceCursor CURSOR LOCAL FOR
SELECT id, 
 matchindex, 
 matchlength, 
 replacement
FROM TemporaryTest 
 CROSS APPLY
master.dbo.Regexmatches('([Xx]-)?[\w-[0-9üÿ_]]{2,}(-[\w-[0-9üÿ_]]{2,})?(''[\w-[0-9üÿ_]])?', [IndexedXML], 
master.dbo.Regexoptionenumeration(0, 0, 1, 1, 0, 0, 0, 0, 0)) 
 INNER JOIN dbo.appellationsubstitution 
 ON match = textentry
 ORDER BY Id, MatchIndex DESC;--if replace in forward order, insertion point gets moved 
DECLARE @Rid int, @Rmi AS int, @Rml AS int, @Rrep AS nvarchar(255);
OPEN ReplaceCursor;
FETCH NEXT FROM ReplaceCursor INTO @Rid, @Rmi, @Rml, @Rrep;
WHILE @@FETCH_STATUS = 0
BEGIN
 UPDATE TemporaryTest
 Set IndexedXML = STUFF([IndexedXML],@Rmi+1,@Rml,@Rrep) 
 WHERE Id = @Rid;
 FETCH NEXT FROM ReplaceCursor INTO @Rid, @Rmi, @Rml, @Rrep;
END;
CLOSE ReplaceCursor;
DEALLOCATE ReplaceCursor;
COMMIT TRANSACTION

This works, but takes a very long time to run (over an hour and not yet completed), and IndexedXML is one of the smallest text fields I have in the production database.

I resorted to using a cursor as I didn't know any other way to manage sequential STUFF calls on the same cell, where subsequent STUFF calls use the result of the previous ones.

Am I taking the right course with this, or is there a faster/cleaner way of achieving this?

Jamal
35.2k13 gold badges134 silver badges238 bronze badges
asked Nov 23, 2016 at 22:34
\$\endgroup\$
1
  • \$\begingroup\$ I've now run this on my full dataset and all the fields that had to be deidentified (around 10 fields). The run took 43 hours on a machine running SQL Server Standard with an intel i7-4790 and the bottleneck was the processor (running at over 90% utilization). So any ideas about optimization would be welcome! \$\endgroup\$ Commented Nov 28, 2016 at 17:44

1 Answer 1

1
\$\begingroup\$

You seem to run a separate UPDATE command on each row of the cursor. That's inherently slow and will likely be a lot quicker when you run the UPDATE based on a SELECT statement that does the required projection for all matching rows at once.

answered Feb 5, 2017 at 0:59
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.