The purpose of this code is to search the entire database for duplicate records and produce a script that a user could then run to delete all of the duplicates.
The stored procedure takes the following parameters:
@DiscludeTables varchar(max)
: a string of Table Names separated by semicolons. These tables are not included in the search for duplicate records and therefore will not be affected in any way by the resulting script when it is run.
@DiscludeColumns varchar(max)
: a string of Column Names separated by semicolons. These columns are not used to determine whether records are duplicates or not. Including any column here will effectively identify that records can have varying information within this column and still be identified as a duplicate.
@Output varchar(max)
: this is the string that the output will be passed to.
The output string defines the use database, a cte of duplicates and delete statement for each table in the database.
I've just started to get my hands dirty with SQL and could use some enlightenment on how to improve this script in any way.
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE PROCEDURE [dbo].[sp_getDatabaseWideDuplicateDeleteScript] (@DiscludeTables varchar(max), @DiscludeColumns varchar(max), @Output varchar(max) output )
AS
--START SETUP DISCLUDE TABLES AND DISCLUDE COLUMNS
--***************************************************
--this section basically just splits the strings @discludetables and @discludecolumns by ';' occurance
-- and sticks the resulting strings in to tables @@discludecolumns and @@discludeTables
--Debug Variables
--DECLARE @DiscludeTables varchar(max)
--SET @DiscludeTables = '__MigrationHistory'
--DECLARE @DiscludeColumns varchar(max)
--SET @DiscludeColumns = 'Name'
--DECLARE @Output varchar(max)
DECLARE @pos INT
DECLARE @string varchar(max)
DECLARE @@DiscludeTables TABLE
(
tableName varchar(max)
)
DECLARE @@DiscludeColumns TABLE
(
columnName varchar(max)
)
DECLARE @stringToSplit varchar(max)
SET @stringToSplit = @DiscludeTables
WHILE CHARINDEX(';', @stringToSplit) > 0
BEGIN
SELECT @pos = CHARINDEX(';', @stringToSplit)
SELECT @string = SUBSTRING(@stringToSplit, 1, @pos-1)
INSERT INTO @@DiscludeTables
SELECT @string
SELECT @stringToSplit = SUBSTRING(@stringToSplit, @pos+1, LEN(@stringToSplit)-@pos)
END
IF @stringToSplit IS NOT NULL AND @stringToSplit != ''
BEGIN
INSERT INTO @@DiscludeTables
SELECT @stringToSplit
END
SET @stringToSplit = @DiscludeColumns
WHILE CHARINDEX(';', @stringToSplit) > 0
BEGIN
SELECT @pos = CHARINDEX(';', @stringToSplit)
SELECT @string = SUBSTRING(@stringToSplit, 1, @pos-1)
INSERT INTO @@DiscludeColumns
SELECT @string
SELECT @stringToSplit = SUBSTRING(@stringToSplit, @pos+1, LEN(@stringToSplit)-@pos)
END
IF @stringToSplit IS NOT NULL AND @stringToSplit != ''
BEGIN
INSERT INTO @@DiscludeColumns
SELECT @stringToSplit
END
SET @Output = ''
DECLARE @DBName varchar(max)
SET @DBName = DB_NAME()
SELECT * FROM @@DiscludeColumns
SELECT * FROM @@DiscludeTables
--**********************************************
--END SETUP DISCLUDE TABLES AND DISCLUDE COLUMNS
--LOOP THROUGH ALL TABLES IN DATABASE NOT IN DISCLUDE TABLES
--***********************************
DECLARE @TableName nvarchar(256), @ColumnName nvarchar(128)
SET @TableName = ''
WHILE @TableName IS NOT NULL
BEGIN
--GET THE CURRENT TABLE
SET @TableName =
(
SELECT MIN(QUOTENAME(TABLE_SCHEMA) + '.' + QUOTENAME(TABLE_NAME))
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_TYPE = 'BASE TABLE'
AND QUOTENAME(TABLE_SCHEMA) + '.' + QUOTENAME(TABLE_NAME) > @TableName
AND OBJECTPROPERTY(
OBJECT_ID(
QUOTENAME(TABLE_SCHEMA) + '.' + QUOTENAME(TABLE_NAME)
), 'IsMSShipped'
) = 0
AND TABLE_NAME NOT in (select tablename from @@DiscludeTables)
)
--GET THE PRIMARY KEY COLUMN FOR THIS TABLE IF ANY
--************************************************
DECLARE @PkColumnName NVARCHAR(128) =
(
SELECT top 1 ccu.COLUMN_NAME
FROM INFORMATION_SCHEMA.TABLE_CONSTRAINTS tc
JOIN INFORMATION_SCHEMA.CONSTRAINT_COLUMN_USAGE ccu ON tc.CONSTRAINT_NAME = ccu.Constraint_name
WHERE tc.CONSTRAINT_TYPE = 'Primary Key' AND '[dbo].['+TC.TABLE_NAME+']' = @TableName
)
--LOOP THROUGH ALL COLUMNS NOT IN DISCLUDE COLUMNS
--************************************************
SET @ColumnName = ''
DECLARE @columns nvarchar(max) = ' '
WHILE @ColumnName IS NOT NULL
BEGIN
SET @ColumnName =
(
SELECT MIN(QUOTENAME(COLUMN_NAME))
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_SCHEMA = PARSENAME(@TableName, 2)
AND TABLE_NAME = PARSENAME(@TableName, 1)
AND QUOTENAME(COLUMN_NAME) > @ColumnName
AND COLUMN_NAME NOT in (SELECT columnname FROM @@DiscludeColumns)
)
if @columnname != '['+@pkColumnName+']'
BEGIN
if @columns != ' '
begin
set @columns = @columns + ', '
END
set @columns = @columns + 't.'+@ColumnName
END
END
if @columns != ' '
BEGIN
DECLARE @SQLSTRING varchar(max) =
'
;WITH CTE_META AS (
SELECT [TableName] = ' + ''''+@tablename+''', t.*
, [Rank] = ROW_NUMBER() OVER(
PARTITION BY ' + @columns + '
ORDER BY ' + 't.['+@pkColumnName+']
) FROM ' + @tablename + ' t
)
DELETE FROM ' + @tablename + '
WHERE ' + @pkColumnName + ' IN (SELECT cte.'+@pkColumnName+' FROM CTE_META cte WHERE cte.Rank > 1)
--SELECT cte.*
--FROM CTE_META cte
--WHERE cte.RANK > 1
'
SET @Output = @Output + @SQLSTRING
END
END
--END OF LOOP THROUGH TABLES
--**************************
if @OUTPUT IS NOT NULL AND @OUTPUT != ''
BEGIN
SET @OUTPUT =
'use [' + @DBName + ']
' + @Output
END
PRINT @OUTPUT
-
1\$\begingroup\$ Have you considered just using a couple of tables in dbo instead of comma separated strings? It would simplify the code a bit. Administrative code like this is why I don't allow application objects in dbo. Reserving the schema makes sure the tables you could be using here don't "get lost in the shuffle". \$\endgroup\$RubberDuck– RubberDuck2015年09月16日 00:28:00 +00:00Commented Sep 16, 2015 at 0:28
-
1\$\begingroup\$ This script assumes that all tables to be checked have no more than one primary key column. Is that guaranteed? \$\endgroup\$joranvar– joranvar2015年12月19日 16:06:38 +00:00Commented Dec 19, 2015 at 16:06
-
1\$\begingroup\$ Also, which (minimal) version of MSSQL are you running this on? At least since 2008 you could pass table valued parameters instead of strings which need to be split. \$\endgroup\$joranvar– joranvar2015年12月19日 16:08:50 +00:00Commented Dec 19, 2015 at 16:08
2 Answers 2
You started out with all your keywords being capitalized and then you abandoned your capitalization haphazardly throughout the script, this makes it a little hard to read because your brain says that keywords are capitalized and then all of a sudden you see a begin
and your brain says,
"hold up a second, no that's a keyword, never mind"
make sure that all your keywords are capitalized.
Here is a good example
if @columnname != '['+@pkColumnName+']' BEGIN if @columns != ' ' begin set @columns = @columns + ', ' END set @columns = @columns + 't.'+@ColumnName END
This also brings up something else that can really go either way, but I will explain my reasoning.
Your if statements should be indented, and this is how I would have written it,
IF @columnname != '['+@pkColumnName+']'
BEGIN
IF @columns != ' '
BEGIN
SET @columns = @columns + ', '
END
SET @columns = @columns + 't.'+@ColumnName
END
I know that is a lot of indentation for an if statement nested inside of an if statement, but I always want to make certain I know exactly what my code is doing, especially when writing SQL Stored Procedures that are going to be creating DELETE
scripts.
I may also consider doing it this way to save on indentation
IF @columnname != '['+@pkColumnName+']'
BEGIN
IF @columns != ' '
BEGIN
SET @columns = @columns + ', '
END
SET @columns = @columns + 't.'+@ColumnName
END
I also indent the SELECT
statements a little differently as well, and if I recall correctly, I think that many other SQL writers do it the same way that I do.
I would take this,
SELECT MIN(QUOTENAME(COLUMN_NAME)) FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_SCHEMA = PARSENAME(@TableName, 2) AND TABLE_NAME = PARSENAME(@TableName, 1) AND QUOTENAME(COLUMN_NAME) > @ColumnName AND COLUMN_NAME NOT in (SELECT columnname FROM @@DiscludeColumns)
and indent everything after the SELECT
like this,
SELECT MIN(QUOTENAME(COLUMN_NAME))
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_SCHEMA = PARSENAME(@TableName, 2)
AND TABLE_NAME = PARSENAME(@TableName, 1)
AND QUOTENAME(COLUMN_NAME) > @ColumnName
AND COLUMN_NAME NOT IN (SELECT columnname
FROM @@DiscludeColumns)
And the last and most important thing that I could advise is that you put intentional commenting on the DELETE
portion of the script and uncomment the SELECT
portion of the rendered script, so that you always run a SELECT
to see what you are planning on deleting, before you delete it.
Aside from Malachi's remarks about capitalization of keywords and indentation of the code, I would add two points about variable naming:
DON'T use variable names that start with
@@
, as they are reserved for T-SQL functions. This would now give a conflict between the parameter@DiscludeTables
and the private table variable@@DiscludeTables
, but I would suggest renaming the parameter to@tablesToDisclude
.DO be consistent in capitalization of variable names. It is customary to use camelCasing, so change
@TableName
to@tableName
, for instance. I would do the same for parameter names.
Some remarks about clean habits in SQL code:
Use
nvarchar
when passing strings that are meant to reference dabatase object names. (Actually, it's a good idea to considernvarchar
for most values that will contain user-entered data which could easily need to include unicode characters, but that is outside the scope of this script).Use semicolons to end SQL statements. In your script you add a semicolon before the
WITH
for the CTE, because it needs to be separated from the statement preceding it. Always using semicolons keeps the code sane for future versions.You don't need to check whether something
IS NULL
if you are also comparing it against something else. For instance:if @OUTPUT IS NOT NULL AND @OUTPUT != ''
could be writtenIF @OUTPUT != ''
, because if@OUTPUT
isNULL
,@OUTPUT
compared to anything will always be false.IS NOT IN
will only work for single column comparisons. When comparing two sets of data against each other, prefer to do aLEFT JOIN
and check whether the right sideIS NULL
:LEFT JOIN @@DiscludeColumns dc ON dc.columnName = c.COLUMN_NAME WHERE dc.columnName IS NULL
will yield the same result, and is extensible to more columns (if you would, for instance, exclude a column only if both column name AND table name are equal).
Write a simple
JOIN
as inINNER JOIN
explicitly. This may be a matter of taste, but once you go using multipleLEFT
andRIGHT JOIN
s, this makes theINNER
stand out.
Now, on to specific things in this script. In the beginning of the stored procedure, you have some duplicate code. Splitting the incoming parameters into multiple values. It is not always easy to remove duplicated code in SQL, but in this case, a table valued function makes most sense:
CREATE FUNCTION [dbo].[fn_splitString] (@stringToSplit nvarchar(max), @separator nchar(1))
RETURNS @output TABLE (i int identity(0,1), s nvarchar(max))
AS
BEGIN
WHILE CHARINDEX(@separator, @stringToSplit) > 0
BEGIN
DECLARE @pos INT;
SELECT @pos = CHARINDEX(@separator, @stringToSplit);
INSERT INTO @output(s) SELECT SUBSTRING(@stringToSplit, 1, @pos-1);
SELECT @stringToSplit = SUBSTRING(@stringToSplit, @pos+1, LEN(@stringToSplit)-@pos);
END
IF @stringToSplit != ''
BEGIN
INSERT INTO @output SELECT @stringToSplit;
END
RETURN
END
GO
This function will take two inputs, the @stringToSplit
, which you already use, and the @separator
, for which you'd just enter the ';'
. This function returns a table with two columns, namely i
, which is an identity column, in case you would like to remember the order in which the strings were given, and s
, the strings themselves. The rest of the function is actually quite unchanged from the code you used, except that it stores the results in @output
instead of e.g. @@DiscludeTables
. You would call it in the stored procedure as such:
INSERT @@DiscludeTables (tableName)
SELECT s FROM [dbo].[fn_splitString] (@DiscludeTables, N';');
INSERT @@DiscludeColumns (columnName)
SELECT s FROM [dbo].[fn_splitString] (@DiscludeColumns, N';');
The last thing I would advise you, is to avoid loops if possible in SQL code, and use set based operations such as queries with joins. SQL Server was made to do set based operations, and you should take advantage of that.
For instance, instead of querying the INFORMATION_SCHEMA for each table to get the name, then querying it again to get the primary key column (only one?) and then repeatedly querying it to get the names of all the columns that you don't want to ignore, consider a larger query that takes all that information for all tables at once:
DECLARE @COLUMN_INFORMATION TABLE (tableName sysname, columnName sysname, isPrimaryKey bit, primary key (tableName, columnName));
INSERT @COLUMN_INFORMATION (tableName, columnName, isPrimaryKey)
SELECT [tableName] = (QUOTENAME(t.TABLE_SCHEMA) + N'.' + QUOTENAME(t.TABLE_NAME))
, [columnName] = c.COLUMN_NAME
, [isPrimaryKey] = CASE WHEN tc.TABLE_NAME IS NULL THEN 0 ELSE 1 END
FROM INFORMATION_SCHEMA.TABLES t
INNER JOIN INFORMATION_SCHEMA.COLUMNS c
ON c.TABLE_NAME = t.TABLE_NAME
AND c.TABLE_SCHEMA = t.TABLE_SCHEMA
LEFT JOIN INFORMATION_SCHEMA.TABLE_CONSTRAINTS tc
INNER JOIN INFORMATION_SCHEMA.CONSTRAINT_COLUMN_USAGE ccu
ON tc.CONSTRAINT_NAME = ccu.Constraint_name
ON tc.CONSTRAINT_TYPE = 'Primary Key'
AND tc.TABLE_NAME = t.TABLE_NAME
AND tc.TABLE_SCHEMA = t.TABLE_SCHEMA
AND c.COLUMN_NAME = ccu.COLUMN_NAME
LEFT JOIN @@DiscludeTables dt
ON dt.tableName = t.TABLE_NAME
LEFT JOIN @@DiscludeColumns dc
ON dc.columnName = c.COLUMN_NAME
WHERE t.TABLE_TYPE = 'BASE TABLE'
AND OBJECTPROPERTY( OBJECT_ID( QUOTENAME(t.TABLE_SCHEMA) + N'.' + QUOTENAME(t.TABLE_NAME) ) , 'IsMSShipped' ) = 0
AND dt.tableName IS NULL
AND dc.columnName IS NULL;
(This is not the most usual indentation style for SQL, by the way, so ignore it if it's not your taste).
This will return a table containing all non-ignored table names, combined with all non-ignored (and non-primary-key) column names and a bit that says whether that column is a primary key.
Now we have all the information, we can group the information per table. We want only one script for each table, so we need to combine the column names, which is something that SQL does not yet know how to do (there is no CONCAT
function that would work like a MAX
or SUM
). In T-SQL, we can use FOR XML PATH
though, to simulate that behavior:
DECLARE @GROUPED_COLUMN_INFORMATION TABLE (tableName sysname primary key, columns nvarchar(max), primaryColumns nvarchar(max), primaryColumnsCompare nvarchar(max));
INSERT @GROUPED_COLUMN_INFORMATION (tableName, columns, primaryColumns, primaryColumnsCompare)
SELECT [tableName] = ci.tableName
, [columns] =
( STUFF((SELECT N', t.' + columnName
FROM @COLUMN_INFORMATION c2
WHERE ci.tableName = c2.tableName
AND c2.isPrimaryKey = 0
FOR XML PATH(N''), TYPE).value(N'.[1]', N'nvarchar(max)'), 1, 2, N'')
)
, [primaryColumns] =
( STUFF((SELECT N', t.' + columnName
FROM @COLUMN_INFORMATION c2
WHERE ci.tableName = c2.tableName
AND c2.isPrimaryKey = 1
FOR XML PATH(N''), TYPE).value(N'.[1]', N'nvarchar(max)'), 1, 2, N'')
)
, [primaryColumnsCompare] =
( N'ON ' + STUFF((SELECT N' AND t.' + columnName + N' = cte.' + columnName
FROM @COLUMN_INFORMATION c2
WHERE ci.tableName = c2.tableName
AND c2.isPrimaryKey = 1
FOR XML PATH(N''), TYPE).value(N'.[1]', N'nvarchar(max)'), 1, 5, N'')
)
FROM @COLUMN_INFORMATION ci
GROUP BY ci.tableName;
While grouping the information, we also prepare some of the strings that we need in the script to output. For instance, in [columns]
, we combine all the column names from the @COLUMN_INFORMATION
table that are not a primary key, and we join them with the text ', t.'
, so that they are separated by commas and show up as t.columnName
. The STUFF(nvarchar, first, last, N'')
function makes sure that positions first
through last
are replaced with the empty string (N''
). This gets rid of the comma in the beginning.
In [primaryColumnsCompare]
, all primary columns are listed twice, showing up as AND t.primaryColumn = cte.primaryColumn
. STUFF
removes the inital ' AND '
part for us.
Finally, now we have combined all information into a table with one entry per table to check, we generate the script:
SELECT @output = '';
SELECT @output += '
;WITH CTE_META AS (
SELECT [TableName] = ' + ''''+ gci.[tablename] +''', t.*
, [Rank] = ROW_NUMBER() OVER(
PARTITION BY ' + gci.[columns] + '
ORDER BY ' + gci.[primaryColumns] + '
) FROM ' + gci.[tablename] + ' t
)
SELECT cte.*
--DELETE t
FROM ' + gci.[tablename] + ' t
INNER JOIN CTE_META cte ' + gci.[primaryColumnsCompare] + '
WHERE cte.Rank > 1;
'
FROM @GROUPED_COLUMN_INFORMATION gci;
IF @output != ''
BEGIN
SELECT @output = 'USE [' + db_name() + ']' + @output;
PRINT @output;
END
(In this case, I didn't bother with the FOR XML PATH
, because we are summing up one single string. Although the behavior of SELECT @output += '...'
is not guaranteed to work with future versions of SQL Server, so you may want to do an extra step to combine it using a FOR XML PATH
.)
Also, the output script changed a bit: the DELETE
is commented out in favor of the SELECT
to verify which data you are deleting. And also here, the IS IN
has been replaced by a simple INNER JOIN
.