I would like to obfuscate (scramble) sensitive data from a SQL Server database, but in the way which will provide:
- irreversibility (the plaintext can't be derived from the obfuscated data),
- obfuscated data length needs to be the same as a length of data before obfuscation.
- obfuscated value does not need to be unique for repeated obfuscations of the same input value. To be honest, I rather like getting the same value for the same input which can used (e.g. some matching data in different tables, probably useful in test cases).
Example:
Abc -> zyx (lenght: 3)
StackOverflow -> a65vr4doqjd (lenght: 11)
Usually I avoid "home made" algorithms, so are you aware of some MS builtin solution which could provide this kind of obfuscation?
I hope I expressed my problem clearly, otherwise let me know and I'll try to add as much info as needed.
1 Answer 1
No, I am not aware of any built-in function that does exactly this. But, you can still accomplish this without doing anything too complicated.
You could use the built-in CRYPT_GEN_RANDOM function (introduced in SQL Server 2008 R2) which generates random values based on a supplied length. The output is in hex/binary values so each byte returned is represented as two alphanumeric characters (hence the / 2 + 1
part below).
DECLARE @InputString NVARCHAR(4000) = 'hello';
SELECT SUBSTRING(CONVERT(VARCHAR(8000),
CRYPT_GEN_RANDOM((LEN(@InputString) / 2) + 1),
2),
1,
LEN(@InputString)) AS [Obfuscated];
SET @InputString = 'test';
SELECT SUBSTRING(CONVERT(VARCHAR(8000),
CRYPT_GEN_RANDOM((LEN(@InputString) / 2) + 1),
2),
1,
LEN(@InputString)) AS [Obfuscated];
Returns something along the lines of:
8C108
9A7A
The only real downside here is that this needs to be done inline as CRYPT_GEN_RANDOM
cannot be used in a User-Defined Function (UDF: Scalar or Table-Valued). However, it can still be applied in a set-based approach using a CTE as shown here (just set @MaxLength
to the max length of the column being obfuscated):
DECLARE @MaxLength INT = 10;
;WITH cte AS
(
SELECT CONVERT(VARCHAR(8000),
CRYPT_GEN_RANDOM((@MaxLength / 2) + 1),
2) AS [Random]
)
SELECT tmp.[String],
cte.[Random],
SUBSTRING(cte.[Random], 1, LEN(tmp.[String])) AS [Obfuscated]
FROM (VALUES (N'test'), (N'Hello')) tmp(String)
CROSS JOIN cte;
Returns something along the lines of:
String Random Obfuscated
------ ------------ ----------
test F99B3888F993 F99B
Hello D3250E74F0A3 D3250
As you can see, CRYPT_GEN_RANDOM
returns a different value for each row.
Also, not sure if this is acceptable or not, but the only alpha characters returned are A
- F
.
OR, if you want the obfuscation to be repeatable for the same input value, or at least don't mind it being repeatable and prefer that this code be in a function so that it is easier to apply to multiple columns, you can use the HASHBYTES function which, like CRYPT_GEN_RANDOM
, returns hex/binary bytes. Unlike CRYPT_GEN_RANDOM
, the output length is fixed (in this case at 64 characters since I am using SHA2_256
), so I used REPLICATE
to repeat the hashed valued if the length of the input string is more than 64 characters. Also unlike CRYPT_GEN_RANDOM
, HASHBYTES
can be used in a User-Defined Function (UDF) :-).
CREATE FUNCTION dbo.Obfuscate(@InputString NVARCHAR(4000))
RETURNS TABLE
WITH SCHEMABINDING
AS RETURN
SELECT SUBSTRING(REPLICATE(CONVERT(VARCHAR(8000),
HASHBYTES('SHA2_256', @InputString),
2),
(LEN(@InputString) / 64) + 1),
1,
LEN(@InputString)) AS [Obfuscated];
GO
And that can be used as follows:
SELECT tmp.[String],
LEN(tmp.[String]) AS [InputLength],
ob.[Obfuscated],
LEN(ob.[Obfuscated]) AS [OutputLength]
FROM (VALUES (N'test'), (N'Hello'), (REPLICATE(N'A', 63)),
(REPLICATE(N'B', 64)), (REPLICATE(N'C', 65)),
(REPLICATE(N'D', 4000))) tmp(String)
CROSS APPLY dbo.Obfuscate(tmp.[String]) ob;
Returns something along the lines of:
String InputLength Obfuscated OutputLength
------ ----------- ---------- ------------
test 4 FE52 4
Hello 5 A07E4 5
AAAAAAAAAAAAAAAAAAAAAA... 63 4B589C85DE74E76487730F3... 63
BBBBBBBBBBBBBBBBBBBBBB... 64 79813FB6480F354F1C6017A... 64
CCCCCCCCCCCCCCCCCCCCCC... 65 FB4B38FBA41ECC24B5B0F68... 65
DDDDDDDDDDDDDDDDDDDDDD... 4000 5D01CC6508C164E652B5C77... 4000
PLEASE NOTE: If you need alpha characters beyond A
- F
and/or need to have distinct obfuscated values for distinct input values (i.e. reduce chances of collisions), then either method above can be adapted easily enough to do that.
Explore related questions
See similar questions with these tags.