Replacing names in text fields with aliases to help de-identify data

Question 1

I have a table with over 1,000,000 records. I need to replace any names in the text fields with aliases to help de-identify the data. For this example, let's assume the table is TemporaryTest and has two fields: Id (the key field) and IndexedXML (the text field).

I have a second table, AppellationSubstitution, that has the following columns: TextEntry (a name needing replacement), Length (length of TextEntry), Replacement (the replacement name, which may be of a different length). That table has about 110,000 rows.

The first step I use is (the regex matches words in the text field -- it looks a bit odd because of some odd characters that show up in this database):

SELECT id, 
 matchindex, 
 matchlength, 
 replacement 
FROM TemporaryTest 
 CROSS APPLY
master.dbo.Regexmatches('([Xx]-)?[\w-[0-9üÿ_]]{2,}(-[\w-[0-9üÿ_]]{2,})?(''[\w-[0-9üÿ_]])?', [IndexedXML], 
master.dbo.Regexoptionenumeration(0, 0, 1, 1, 0, 0, 0, 0, 0)) 
 INNER JOIN dbo.appellationsubstitution 
 ON match = textentry
 ORDER BY Id, MatchIndex DESC;--if replace in forward order, insertion point gets moved

This produces a table with over 100,000 rows, which the following shows a few lines:

Id matchindex matchlength replacement
99309 122 5 "Demarcus"
108639 106 5 "Demarcus"
109809 84 6 "Rehbein"
110373 89 7 "Reginald"
111156 105 5 "Demarcus"
112452 129 6 "Thie"
112896 113 6 "Diberardino"
112896 92 6 "Diberardino"
113503 119 3 "Rubin"

The full procedure I'm currently trying out is:

SET NOCOUNT ON;
SET XACT_ABORT ON;
BEGIN TRANSACTION;
DECLARE ReplaceCursor CURSOR LOCAL FOR
SELECT id, 
 matchindex, 
 matchlength, 
 replacement
FROM TemporaryTest 
 CROSS APPLY
master.dbo.Regexmatches('([Xx]-)?[\w-[0-9üÿ_]]{2,}(-[\w-[0-9üÿ_]]{2,})?(''[\w-[0-9üÿ_]])?', [IndexedXML], 
master.dbo.Regexoptionenumeration(0, 0, 1, 1, 0, 0, 0, 0, 0)) 
 INNER JOIN dbo.appellationsubstitution 
 ON match = textentry
 ORDER BY Id, MatchIndex DESC;--if replace in forward order, insertion point gets moved 
DECLARE @Rid int, @Rmi AS int, @Rml AS int, @Rrep AS nvarchar(255);
OPEN ReplaceCursor;
FETCH NEXT FROM ReplaceCursor INTO @Rid, @Rmi, @Rml, @Rrep;
WHILE @@FETCH_STATUS = 0
BEGIN
 UPDATE TemporaryTest
 Set IndexedXML = STUFF([IndexedXML],@Rmi+1,@Rml,@Rrep) 
 WHERE Id = @Rid;
 FETCH NEXT FROM ReplaceCursor INTO @Rid, @Rmi, @Rml, @Rrep;
END;
CLOSE ReplaceCursor;
DEALLOCATE ReplaceCursor;
COMMIT TRANSACTION

This works, but takes a very long time to run (over an hour and not yet completed), and IndexedXML is one of the smallest text fields I have in the production database.

I resorted to using a cursor as I didn't know any other way to manage sequential STUFF calls on the same cell, where subsequent STUFF calls use the result of the previous ones.

Am I taking the right course with this, or is there a faster/cleaner way of achieving this?

Question 2

I've now run this on my full dataset and all the fields that had to be deidentified (around 10 fields). The run took 43 hours on a machine running SQL Server Standard with an intel i7-4790 and the bottleneck was the processor (running at over 90% utilization). So any ideas about optimization would be welcome!

Question 3

You seem to run a separate UPDATE command on each row of the cursor. That's inherently slow and will likely be a lot quicker when you run the UPDATE based on a SELECT statement that does the required projection for all matching rows at once.

Lars Br. Lars Br. 1114 bronze badges · Answer 1 · 2017-02-05 00:59:43Z

You seem to run a separate UPDATE command on each row of the cursor. That's inherently slow and will likely be a lot quicker when you run the UPDATE based on a SELECT statement that does the required projection for all matching rows at once.

Stack Exchange Network

Replacing names in text fields with aliases to help de-identify data

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Replacing names in text fields with aliases to help de-identify data

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions