I have a table with over 1,000,000 records. I need to replace any names in the text fields with aliases to help de-identify the data. For this example, let's assume the table is TemporaryTest
and has two fields: Id
(the key field) and IndexedXML
(the text field).
I have a second table, AppellationSubstitution
, that has the following columns: TextEntry
(a name needing replacement), Length
(length of TextEntry
), Replacement
(the replacement name, which may be of a different length). That table has about 110,000 rows.
The first step I use is (the regex matches words in the text field -- it looks a bit odd because of some odd characters that show up in this database):
SELECT id,
matchindex,
matchlength,
replacement
FROM TemporaryTest
CROSS APPLY
master.dbo.Regexmatches('([Xx]-)?[\w-[0-9üÿ_]]{2,}(-[\w-[0-9üÿ_]]{2,})?(''[\w-[0-9üÿ_]])?', [IndexedXML],
master.dbo.Regexoptionenumeration(0, 0, 1, 1, 0, 0, 0, 0, 0))
INNER JOIN dbo.appellationsubstitution
ON match = textentry
ORDER BY Id, MatchIndex DESC;--if replace in forward order, insertion point gets moved
This produces a table with over 100,000 rows, which the following shows a few lines:
Id matchindex matchlength replacement
99309 122 5 "Demarcus"
108639 106 5 "Demarcus"
109809 84 6 "Rehbein"
110373 89 7 "Reginald"
111156 105 5 "Demarcus"
112452 129 6 "Thie"
112896 113 6 "Diberardino"
112896 92 6 "Diberardino"
113503 119 3 "Rubin"
The full procedure I'm currently trying out is:
SET NOCOUNT ON;
SET XACT_ABORT ON;
BEGIN TRANSACTION;
DECLARE ReplaceCursor CURSOR LOCAL FOR
SELECT id,
matchindex,
matchlength,
replacement
FROM TemporaryTest
CROSS APPLY
master.dbo.Regexmatches('([Xx]-)?[\w-[0-9üÿ_]]{2,}(-[\w-[0-9üÿ_]]{2,})?(''[\w-[0-9üÿ_]])?', [IndexedXML],
master.dbo.Regexoptionenumeration(0, 0, 1, 1, 0, 0, 0, 0, 0))
INNER JOIN dbo.appellationsubstitution
ON match = textentry
ORDER BY Id, MatchIndex DESC;--if replace in forward order, insertion point gets moved
DECLARE @Rid int, @Rmi AS int, @Rml AS int, @Rrep AS nvarchar(255);
OPEN ReplaceCursor;
FETCH NEXT FROM ReplaceCursor INTO @Rid, @Rmi, @Rml, @Rrep;
WHILE @@FETCH_STATUS = 0
BEGIN
UPDATE TemporaryTest
Set IndexedXML = STUFF([IndexedXML],@Rmi+1,@Rml,@Rrep)
WHERE Id = @Rid;
FETCH NEXT FROM ReplaceCursor INTO @Rid, @Rmi, @Rml, @Rrep;
END;
CLOSE ReplaceCursor;
DEALLOCATE ReplaceCursor;
COMMIT TRANSACTION
This works, but takes a very long time to run (over an hour and not yet completed), and IndexedXML
is one of the smallest text fields I have in the production database.
I resorted to using a cursor as I didn't know any other way to manage sequential STUFF
calls on the same cell, where subsequent STUFF
calls use the result of the previous ones.
Am I taking the right course with this, or is there a faster/cleaner way of achieving this?
-
\$\begingroup\$ I've now run this on my full dataset and all the fields that had to be deidentified (around 10 fields). The run took 43 hours on a machine running SQL Server Standard with an intel i7-4790 and the bottleneck was the processor (running at over 90% utilization). So any ideas about optimization would be welcome! \$\endgroup\$rsjaffe– rsjaffe2016年11月28日 17:44:29 +00:00Commented Nov 28, 2016 at 17:44
1 Answer 1
You seem to run a separate UPDATE command on each row of the cursor. That's inherently slow and will likely be a lot quicker when you run the UPDATE based on a SELECT statement that does the required projection for all matching rows at once.