Find & Delete Duplicate Records on All Tables in current MSSQL Database

Question 1

The purpose of this code is to search the entire database for duplicate records and produce a script that a user could then run to delete all of the duplicates.

The stored procedure takes the following parameters:

@DiscludeTables varchar(max): a string of Table Names separated by semicolons. These tables are not included in the search for duplicate records and therefore will not be affected in any way by the resulting script when it is run.

@DiscludeColumns varchar(max): a string of Column Names separated by semicolons. These columns are not used to determine whether records are duplicates or not. Including any column here will effectively identify that records can have varying information within this column and still be identified as a duplicate.

@Output varchar(max): this is the string that the output will be passed to.

The output string defines the use database, a cte of duplicates and delete statement for each table in the database.

I've just started to get my hands dirty with SQL and could use some enlightenment on how to improve this script in any way.

SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE PROCEDURE [dbo].[sp_getDatabaseWideDuplicateDeleteScript] (@DiscludeTables varchar(max), @DiscludeColumns varchar(max), @Output varchar(max) output )
AS
--START SETUP DISCLUDE TABLES AND DISCLUDE COLUMNS
--***************************************************
--this section basically just splits the strings @discludetables and @discludecolumns by ';' occurance
-- and sticks the resulting strings in to tables @@discludecolumns and @@discludeTables
--Debug Variables
--DECLARE @DiscludeTables varchar(max)
--SET @DiscludeTables = '__MigrationHistory'
--DECLARE @DiscludeColumns varchar(max)
--SET @DiscludeColumns = 'Name'
--DECLARE @Output varchar(max)
DECLARE @pos INT
DECLARE @string varchar(max)
DECLARE @@DiscludeTables TABLE
(
 tableName varchar(max)
)
DECLARE @@DiscludeColumns TABLE
(
 columnName varchar(max)
)
DECLARE @stringToSplit varchar(max)
SET @stringToSplit = @DiscludeTables
WHILE CHARINDEX(';', @stringToSplit) > 0
BEGIN
 SELECT @pos = CHARINDEX(';', @stringToSplit) 
 SELECT @string = SUBSTRING(@stringToSplit, 1, @pos-1)
 INSERT INTO @@DiscludeTables 
 SELECT @string
 SELECT @stringToSplit = SUBSTRING(@stringToSplit, @pos+1, LEN(@stringToSplit)-@pos)
END
IF @stringToSplit IS NOT NULL AND @stringToSplit != ''
BEGIN
 INSERT INTO @@DiscludeTables
 SELECT @stringToSplit
END
SET @stringToSplit = @DiscludeColumns
WHILE CHARINDEX(';', @stringToSplit) > 0
BEGIN
 SELECT @pos = CHARINDEX(';', @stringToSplit) 
 SELECT @string = SUBSTRING(@stringToSplit, 1, @pos-1)
 INSERT INTO @@DiscludeColumns
 SELECT @string
 SELECT @stringToSplit = SUBSTRING(@stringToSplit, @pos+1, LEN(@stringToSplit)-@pos)
END
IF @stringToSplit IS NOT NULL AND @stringToSplit != ''
BEGIN
 INSERT INTO @@DiscludeColumns
 SELECT @stringToSplit
END
SET @Output = ''
DECLARE @DBName varchar(max)
SET @DBName = DB_NAME()
SELECT * FROM @@DiscludeColumns
SELECT * FROM @@DiscludeTables
--**********************************************
--END SETUP DISCLUDE TABLES AND DISCLUDE COLUMNS
--LOOP THROUGH ALL TABLES IN DATABASE NOT IN DISCLUDE TABLES
--***********************************
DECLARE @TableName nvarchar(256), @ColumnName nvarchar(128)
SET @TableName = ''
WHILE @TableName IS NOT NULL
BEGIN
 --GET THE CURRENT TABLE
 SET @TableName = 
 (
 SELECT MIN(QUOTENAME(TABLE_SCHEMA) + '.' + QUOTENAME(TABLE_NAME))
 FROM INFORMATION_SCHEMA.TABLES
 WHERE TABLE_TYPE = 'BASE TABLE'
 AND QUOTENAME(TABLE_SCHEMA) + '.' + QUOTENAME(TABLE_NAME) > @TableName
 AND OBJECTPROPERTY(
 OBJECT_ID(
 QUOTENAME(TABLE_SCHEMA) + '.' + QUOTENAME(TABLE_NAME)
 ), 'IsMSShipped'
 ) = 0
 AND TABLE_NAME NOT in (select tablename from @@DiscludeTables)
 )
 --GET THE PRIMARY KEY COLUMN FOR THIS TABLE IF ANY
 --************************************************
 DECLARE @PkColumnName NVARCHAR(128) = 
 (
 SELECT top 1 ccu.COLUMN_NAME
 FROM INFORMATION_SCHEMA.TABLE_CONSTRAINTS tc
 JOIN INFORMATION_SCHEMA.CONSTRAINT_COLUMN_USAGE ccu ON tc.CONSTRAINT_NAME = ccu.Constraint_name
 WHERE tc.CONSTRAINT_TYPE = 'Primary Key' AND '[dbo].['+TC.TABLE_NAME+']' = @TableName
 )
 --LOOP THROUGH ALL COLUMNS NOT IN DISCLUDE COLUMNS
 --************************************************
 SET @ColumnName = ''
 DECLARE @columns nvarchar(max) = ' '
 WHILE @ColumnName IS NOT NULL
 BEGIN
 SET @ColumnName = 
 (
 SELECT MIN(QUOTENAME(COLUMN_NAME))
 FROM INFORMATION_SCHEMA.COLUMNS
 WHERE TABLE_SCHEMA = PARSENAME(@TableName, 2)
 AND TABLE_NAME = PARSENAME(@TableName, 1)
 AND QUOTENAME(COLUMN_NAME) > @ColumnName
 AND COLUMN_NAME NOT in (SELECT columnname FROM @@DiscludeColumns)
 )
 if @columnname != '['+@pkColumnName+']'
 BEGIN
 if @columns != ' '
 begin
 set @columns = @columns + ', '
 END 
 set @columns = @columns + 't.'+@ColumnName
 END 
 END
 if @columns != ' '
 BEGIN
 DECLARE @SQLSTRING varchar(max) =
 '
 ;WITH CTE_META AS (
 SELECT [TableName] = ' + ''''+@tablename+''', t.*
 , [Rank] = ROW_NUMBER() OVER(
 PARTITION BY ' + @columns + ' 
 ORDER BY ' + 't.['+@pkColumnName+']
 ) FROM ' + @tablename + ' t
 ) 
 DELETE FROM ' + @tablename + '
 WHERE ' + @pkColumnName + ' IN (SELECT cte.'+@pkColumnName+' FROM CTE_META cte WHERE cte.Rank > 1)
 --SELECT cte.* 
 --FROM CTE_META cte
 --WHERE cte.RANK > 1
 '
 SET @Output = @Output + @SQLSTRING
 END
END
--END OF LOOP THROUGH TABLES
--**************************
if @OUTPUT IS NOT NULL AND @OUTPUT != ''
BEGIN
 SET @OUTPUT = 
 'use [' + @DBName + ']
 ' + @Output
END 
PRINT @OUTPUT

Question 2

Have you considered just using a couple of tables in dbo instead of comma separated strings? It would simplify the code a bit. Administrative code like this is why I don't allow application objects in dbo. Reserving the schema makes sure the tables you could be using here don't "get lost in the shuffle".

Question 3

This script assumes that all tables to be checked have no more than one primary key column. Is that guaranteed?

Question 4

Also, which (minimal) version of MSSQL are you running this on? At least since 2008 you could pass table valued parameters instead of strings which need to be split.

Question 5

You started out with all your keywords being capitalized and then you abandoned your capitalization haphazardly throughout the script, this makes it a little hard to read because your brain says that keywords are capitalized and then all of a sudden you see a begin and your brain says,

"hold up a second, no that's a keyword, never mind"

make sure that all your keywords are capitalized.

Here is a good example

 if @columnname != '['+@pkColumnName+']'
 BEGIN
 if @columns != ' '
 begin
 set @columns = @columns + ', '
 END 
 set @columns = @columns + 't.'+@ColumnName
 END

This also brings up something else that can really go either way, but I will explain my reasoning.

Your if statements should be indented, and this is how I would have written it,

IF @columnname != '['+@pkColumnName+']'
 BEGIN
 IF @columns != ' '
 BEGIN
 SET @columns = @columns + ', '
 END 
 SET @columns = @columns + 't.'+@ColumnName
 END

I know that is a lot of indentation for an if statement nested inside of an if statement, but I always want to make certain I know exactly what my code is doing, especially when writing SQL Stored Procedures that are going to be creating DELETE scripts.

I may also consider doing it this way to save on indentation

IF @columnname != '['+@pkColumnName+']'
BEGIN
 IF @columns != ' '
 BEGIN
 SET @columns = @columns + ', '
 END 
 SET @columns = @columns + 't.'+@ColumnName
END

I also indent the SELECT statements a little differently as well, and if I recall correctly, I think that many other SQL writers do it the same way that I do.

I would take this,

SELECT MIN(QUOTENAME(COLUMN_NAME))
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_SCHEMA = PARSENAME(@TableName, 2)
 AND TABLE_NAME = PARSENAME(@TableName, 1)
 AND QUOTENAME(COLUMN_NAME) > @ColumnName
 AND COLUMN_NAME NOT in (SELECT columnname FROM @@DiscludeColumns)

and indent everything after the SELECT like this,

SELECT MIN(QUOTENAME(COLUMN_NAME))
 FROM INFORMATION_SCHEMA.COLUMNS
 WHERE TABLE_SCHEMA = PARSENAME(@TableName, 2)
 AND TABLE_NAME = PARSENAME(@TableName, 1)
 AND QUOTENAME(COLUMN_NAME) > @ColumnName
 AND COLUMN_NAME NOT IN (SELECT columnname 
 FROM @@DiscludeColumns)

And the last and most important thing that I could advise is that you put intentional commenting on the DELETE portion of the script and uncomment the SELECT portion of the rendered script, so that you always run a SELECT to see what you are planning on deleting, before you delete it.

Question 6

Aside from Malachi's remarks about capitalization of keywords and indentation of the code, I would add two points about variable naming:

DON'T use variable names that start with @@, as they are reserved for T-SQL functions. This would now give a conflict between the parameter @DiscludeTables and the private table variable @@DiscludeTables, but I would suggest renaming the parameter to @tablesToDisclude.
DO be consistent in capitalization of variable names. It is customary to use camelCasing, so change @TableName to @tableName, for instance. I would do the same for parameter names.

Some remarks about clean habits in SQL code:

Use nvarchar when passing strings that are meant to reference dabatase object names. (Actually, it's a good idea to consider nvarchar for most values that will contain user-entered data which could easily need to include unicode characters, but that is outside the scope of this script).
Use semicolons to end SQL statements. In your script you add a semicolon before the WITH for the CTE, because it needs to be separated from the statement preceding it. Always using semicolons keeps the code sane for future versions.
You don't need to check whether something IS NULL if you are also comparing it against something else. For instance: if @OUTPUT IS NOT NULL AND @OUTPUT != '' could be written IF @OUTPUT != '', because if @OUTPUT is NULL, @OUTPUT compared to anything will always be false.
IS NOT IN will only work for single column comparisons. When comparing two sets of data against each other, prefer to do a LEFT JOIN and check whether the right side IS NULL:
```
 LEFT JOIN @@DiscludeColumns dc
 ON dc.columnName = c.COLUMN_NAME
WHERE dc.columnName IS NULL
```
will yield the same result, and is extensible to more columns (if you would, for instance, exclude a column only if both column name AND table name are equal).
Write a simple JOIN as in INNER JOIN explicitly. This may be a matter of taste, but once you go using multiple LEFT and RIGHT JOINs, this makes the INNER stand out.

Now, on to specific things in this script. In the beginning of the stored procedure, you have some duplicate code. Splitting the incoming parameters into multiple values. It is not always easy to remove duplicated code in SQL, but in this case, a table valued function makes most sense:

CREATE FUNCTION [dbo].[fn_splitString] (@stringToSplit nvarchar(max), @separator nchar(1))
RETURNS @output TABLE (i int identity(0,1), s nvarchar(max))
AS
BEGIN
 WHILE CHARINDEX(@separator, @stringToSplit) > 0
 BEGIN
 DECLARE @pos INT;
 SELECT @pos = CHARINDEX(@separator, @stringToSplit);
 INSERT INTO @output(s) SELECT SUBSTRING(@stringToSplit, 1, @pos-1);
 SELECT @stringToSplit = SUBSTRING(@stringToSplit, @pos+1, LEN(@stringToSplit)-@pos);
 END
 IF @stringToSplit != ''
 BEGIN
 INSERT INTO @output SELECT @stringToSplit;
 END
 RETURN
END
GO

This function will take two inputs, the @stringToSplit, which you already use, and the @separator, for which you'd just enter the ';'. This function returns a table with two columns, namely i, which is an identity column, in case you would like to remember the order in which the strings were given, and s, the strings themselves. The rest of the function is actually quite unchanged from the code you used, except that it stores the results in @output instead of e.g. @@DiscludeTables. You would call it in the stored procedure as such:

INSERT @@DiscludeTables (tableName)
 SELECT s FROM [dbo].[fn_splitString] (@DiscludeTables, N';');
INSERT @@DiscludeColumns (columnName)
 SELECT s FROM [dbo].[fn_splitString] (@DiscludeColumns, N';');

The last thing I would advise you, is to avoid loops if possible in SQL code, and use set based operations such as queries with joins. SQL Server was made to do set based operations, and you should take advantage of that.

For instance, instead of querying the INFORMATION_SCHEMA for each table to get the name, then querying it again to get the primary key column (only one?) and then repeatedly querying it to get the names of all the columns that you don't want to ignore, consider a larger query that takes all that information for all tables at once:

DECLARE @COLUMN_INFORMATION TABLE (tableName sysname, columnName sysname, isPrimaryKey bit, primary key (tableName, columnName));
INSERT @COLUMN_INFORMATION (tableName, columnName, isPrimaryKey)
SELECT [tableName] = (QUOTENAME(t.TABLE_SCHEMA) + N'.' + QUOTENAME(t.TABLE_NAME))
 , [columnName] = c.COLUMN_NAME
 , [isPrimaryKey] = CASE WHEN tc.TABLE_NAME IS NULL THEN 0 ELSE 1 END 
 FROM INFORMATION_SCHEMA.TABLES t
 INNER JOIN INFORMATION_SCHEMA.COLUMNS c
 ON c.TABLE_NAME = t.TABLE_NAME
 AND c.TABLE_SCHEMA = t.TABLE_SCHEMA
 LEFT JOIN INFORMATION_SCHEMA.TABLE_CONSTRAINTS tc
 INNER JOIN INFORMATION_SCHEMA.CONSTRAINT_COLUMN_USAGE ccu
 ON tc.CONSTRAINT_NAME = ccu.Constraint_name
 ON tc.CONSTRAINT_TYPE = 'Primary Key'
 AND tc.TABLE_NAME = t.TABLE_NAME
 AND tc.TABLE_SCHEMA = t.TABLE_SCHEMA
 AND c.COLUMN_NAME = ccu.COLUMN_NAME
 LEFT JOIN @@DiscludeTables dt
 ON dt.tableName = t.TABLE_NAME
 LEFT JOIN @@DiscludeColumns dc
 ON dc.columnName = c.COLUMN_NAME
 WHERE t.TABLE_TYPE = 'BASE TABLE'
 AND OBJECTPROPERTY( OBJECT_ID( QUOTENAME(t.TABLE_SCHEMA) + N'.' + QUOTENAME(t.TABLE_NAME) ) , 'IsMSShipped' ) = 0
 AND dt.tableName IS NULL
 AND dc.columnName IS NULL;

(This is not the most usual indentation style for SQL, by the way, so ignore it if it's not your taste).

This will return a table containing all non-ignored table names, combined with all non-ignored (and non-primary-key) column names and a bit that says whether that column is a primary key.

Now we have all the information, we can group the information per table. We want only one script for each table, so we need to combine the column names, which is something that SQL does not yet know how to do (there is no CONCAT function that would work like a MAX or SUM). In T-SQL, we can use FOR XML PATH though, to simulate that behavior:

DECLARE @GROUPED_COLUMN_INFORMATION TABLE (tableName sysname primary key, columns nvarchar(max), primaryColumns nvarchar(max), primaryColumnsCompare nvarchar(max));
INSERT @GROUPED_COLUMN_INFORMATION (tableName, columns, primaryColumns, primaryColumnsCompare)
 SELECT [tableName] = ci.tableName
 , [columns] =
 ( STUFF((SELECT N', t.' + columnName 
 FROM @COLUMN_INFORMATION c2
 WHERE ci.tableName = c2.tableName
 AND c2.isPrimaryKey = 0
 FOR XML PATH(N''), TYPE).value(N'.[1]', N'nvarchar(max)'), 1, 2, N'')
 )
 , [primaryColumns] =
 ( STUFF((SELECT N', t.' + columnName 
 FROM @COLUMN_INFORMATION c2
 WHERE ci.tableName = c2.tableName
 AND c2.isPrimaryKey = 1
 FOR XML PATH(N''), TYPE).value(N'.[1]', N'nvarchar(max)'), 1, 2, N'')
 )
 , [primaryColumnsCompare] =
 ( N'ON ' + STUFF((SELECT N' AND t.' + columnName + N' = cte.' + columnName
 FROM @COLUMN_INFORMATION c2
 WHERE ci.tableName = c2.tableName
 AND c2.isPrimaryKey = 1
 FOR XML PATH(N''), TYPE).value(N'.[1]', N'nvarchar(max)'), 1, 5, N'')
 )
 FROM @COLUMN_INFORMATION ci
 GROUP BY ci.tableName;

While grouping the information, we also prepare some of the strings that we need in the script to output. For instance, in [columns], we combine all the column names from the @COLUMN_INFORMATION table that are not a primary key, and we join them with the text ', t.', so that they are separated by commas and show up as t.columnName. The STUFF(nvarchar, first, last, N'') function makes sure that positions first through last are replaced with the empty string (N''). This gets rid of the comma in the beginning.

In [primaryColumnsCompare], all primary columns are listed twice, showing up as AND t.primaryColumn = cte.primaryColumn. STUFF removes the inital ' AND ' part for us.

Finally, now we have combined all information into a table with one entry per table to check, we generate the script:

SELECT @output = '';
SELECT @output += '
 ;WITH CTE_META AS (
 SELECT [TableName] = ' + ''''+ gci.[tablename] +''', t.*
 , [Rank] = ROW_NUMBER() OVER(
 PARTITION BY ' + gci.[columns] + ' 
 ORDER BY ' + gci.[primaryColumns] + '
 ) FROM ' + gci.[tablename] + ' t
 ) 
 SELECT cte.*
 --DELETE t
 FROM ' + gci.[tablename] + ' t
 INNER JOIN CTE_META cte ' + gci.[primaryColumnsCompare] + '
 WHERE cte.Rank > 1;
 '
 FROM @GROUPED_COLUMN_INFORMATION gci;
IF @output != ''
BEGIN
 SELECT @output = 'USE [' + db_name() + ']' + @output;
 PRINT @output;
END

(In this case, I didn't bother with the FOR XML PATH, because we are summing up one single string. Although the behavior of SELECT @output += '...' is not guaranteed to work with future versions of SQL Server, so you may want to do an extra step to combine it using a FOR XML PATH.)

Also, the output script changed a bit: the DELETE is commented out in favor of the SELECT to verify which data you are deleting. And also here, the IS IN has been replaced by a simple INNER JOIN.

Malachi MalachiMalachi 29k11 gold badges86 silver badges188 bronze badges · Answer 1 · 2015-12-15 01:17:27Z

You started out with all your keywords being capitalized and then you abandoned your capitalization haphazardly throughout the script, this makes it a little hard to read because your brain says that keywords are capitalized and then all of a sudden you see a begin and your brain says,

"hold up a second, no that's a keyword, never mind"

make sure that all your keywords are capitalized.

Here is a good example

 if @columnname != '['+@pkColumnName+']'
 BEGIN
 if @columns != ' '
 begin
 set @columns = @columns + ', '
 END 
 set @columns = @columns + 't.'+@ColumnName
 END

This also brings up something else that can really go either way, but I will explain my reasoning.

Your if statements should be indented, and this is how I would have written it,

IF @columnname != '['+@pkColumnName+']'
 BEGIN
 IF @columns != ' '
 BEGIN
 SET @columns = @columns + ', '
 END 
 SET @columns = @columns + 't.'+@ColumnName
 END

I know that is a lot of indentation for an if statement nested inside of an if statement, but I always want to make certain I know exactly what my code is doing, especially when writing SQL Stored Procedures that are going to be creating DELETE scripts.

I may also consider doing it this way to save on indentation

IF @columnname != '['+@pkColumnName+']'
BEGIN
 IF @columns != ' '
 BEGIN
 SET @columns = @columns + ', '
 END 
 SET @columns = @columns + 't.'+@ColumnName
END

I also indent the SELECT statements a little differently as well, and if I recall correctly, I think that many other SQL writers do it the same way that I do.

I would take this,

SELECT MIN(QUOTENAME(COLUMN_NAME))
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_SCHEMA = PARSENAME(@TableName, 2)
 AND TABLE_NAME = PARSENAME(@TableName, 1)
 AND QUOTENAME(COLUMN_NAME) > @ColumnName
 AND COLUMN_NAME NOT in (SELECT columnname FROM @@DiscludeColumns)

and indent everything after the SELECT like this,

SELECT MIN(QUOTENAME(COLUMN_NAME))
 FROM INFORMATION_SCHEMA.COLUMNS
 WHERE TABLE_SCHEMA = PARSENAME(@TableName, 2)
 AND TABLE_NAME = PARSENAME(@TableName, 1)
 AND QUOTENAME(COLUMN_NAME) > @ColumnName
 AND COLUMN_NAME NOT IN (SELECT columnname 
 FROM @@DiscludeColumns)

And the last and most important thing that I could advise is that you put intentional commenting on the DELETE portion of the script and uncomment the SELECT portion of the rendered script, so that you always run a SELECT to see what you are planning on deleting, before you delete it.

joranvar joranvarjoranvar 9927 silver badges16 bronze badges · Answer 2 · 2015-12-19 19:58:57Z

Aside from Malachi's remarks about capitalization of keywords and indentation of the code, I would add two points about variable naming:

DON'T use variable names that start with @@, as they are reserved for T-SQL functions. This would now give a conflict between the parameter @DiscludeTables and the private table variable @@DiscludeTables, but I would suggest renaming the parameter to @tablesToDisclude.
DO be consistent in capitalization of variable names. It is customary to use camelCasing, so change @TableName to @tableName, for instance. I would do the same for parameter names.

Some remarks about clean habits in SQL code:

Use nvarchar when passing strings that are meant to reference dabatase object names. (Actually, it's a good idea to consider nvarchar for most values that will contain user-entered data which could easily need to include unicode characters, but that is outside the scope of this script).
Use semicolons to end SQL statements. In your script you add a semicolon before the WITH for the CTE, because it needs to be separated from the statement preceding it. Always using semicolons keeps the code sane for future versions.
You don't need to check whether something IS NULL if you are also comparing it against something else. For instance: if @OUTPUT IS NOT NULL AND @OUTPUT != '' could be written IF @OUTPUT != '', because if @OUTPUT is NULL, @OUTPUT compared to anything will always be false.
IS NOT IN will only work for single column comparisons. When comparing two sets of data against each other, prefer to do a LEFT JOIN and check whether the right side IS NULL:
```
 LEFT JOIN @@DiscludeColumns dc
 ON dc.columnName = c.COLUMN_NAME
WHERE dc.columnName IS NULL
```
will yield the same result, and is extensible to more columns (if you would, for instance, exclude a column only if both column name AND table name are equal).
Write a simple JOIN as in INNER JOIN explicitly. This may be a matter of taste, but once you go using multiple LEFT and RIGHT JOINs, this makes the INNER stand out.

Now, on to specific things in this script. In the beginning of the stored procedure, you have some duplicate code. Splitting the incoming parameters into multiple values. It is not always easy to remove duplicated code in SQL, but in this case, a table valued function makes most sense:

CREATE FUNCTION [dbo].[fn_splitString] (@stringToSplit nvarchar(max), @separator nchar(1))
RETURNS @output TABLE (i int identity(0,1), s nvarchar(max))
AS
BEGIN
 WHILE CHARINDEX(@separator, @stringToSplit) > 0
 BEGIN
 DECLARE @pos INT;
 SELECT @pos = CHARINDEX(@separator, @stringToSplit);
 INSERT INTO @output(s) SELECT SUBSTRING(@stringToSplit, 1, @pos-1);
 SELECT @stringToSplit = SUBSTRING(@stringToSplit, @pos+1, LEN(@stringToSplit)-@pos);
 END
 IF @stringToSplit != ''
 BEGIN
 INSERT INTO @output SELECT @stringToSplit;
 END
 RETURN
END
GO

This function will take two inputs, the @stringToSplit, which you already use, and the @separator, for which you'd just enter the ';'. This function returns a table with two columns, namely i, which is an identity column, in case you would like to remember the order in which the strings were given, and s, the strings themselves. The rest of the function is actually quite unchanged from the code you used, except that it stores the results in @output instead of e.g. @@DiscludeTables. You would call it in the stored procedure as such:

INSERT @@DiscludeTables (tableName)
 SELECT s FROM [dbo].[fn_splitString] (@DiscludeTables, N';');
INSERT @@DiscludeColumns (columnName)
 SELECT s FROM [dbo].[fn_splitString] (@DiscludeColumns, N';');

The last thing I would advise you, is to avoid loops if possible in SQL code, and use set based operations such as queries with joins. SQL Server was made to do set based operations, and you should take advantage of that.

For instance, instead of querying the INFORMATION_SCHEMA for each table to get the name, then querying it again to get the primary key column (only one?) and then repeatedly querying it to get the names of all the columns that you don't want to ignore, consider a larger query that takes all that information for all tables at once:

DECLARE @COLUMN_INFORMATION TABLE (tableName sysname, columnName sysname, isPrimaryKey bit, primary key (tableName, columnName));
INSERT @COLUMN_INFORMATION (tableName, columnName, isPrimaryKey)
SELECT [tableName] = (QUOTENAME(t.TABLE_SCHEMA) + N'.' + QUOTENAME(t.TABLE_NAME))
 , [columnName] = c.COLUMN_NAME
 , [isPrimaryKey] = CASE WHEN tc.TABLE_NAME IS NULL THEN 0 ELSE 1 END 
 FROM INFORMATION_SCHEMA.TABLES t
 INNER JOIN INFORMATION_SCHEMA.COLUMNS c
 ON c.TABLE_NAME = t.TABLE_NAME
 AND c.TABLE_SCHEMA = t.TABLE_SCHEMA
 LEFT JOIN INFORMATION_SCHEMA.TABLE_CONSTRAINTS tc
 INNER JOIN INFORMATION_SCHEMA.CONSTRAINT_COLUMN_USAGE ccu
 ON tc.CONSTRAINT_NAME = ccu.Constraint_name
 ON tc.CONSTRAINT_TYPE = 'Primary Key'
 AND tc.TABLE_NAME = t.TABLE_NAME
 AND tc.TABLE_SCHEMA = t.TABLE_SCHEMA
 AND c.COLUMN_NAME = ccu.COLUMN_NAME
 LEFT JOIN @@DiscludeTables dt
 ON dt.tableName = t.TABLE_NAME
 LEFT JOIN @@DiscludeColumns dc
 ON dc.columnName = c.COLUMN_NAME
 WHERE t.TABLE_TYPE = 'BASE TABLE'
 AND OBJECTPROPERTY( OBJECT_ID( QUOTENAME(t.TABLE_SCHEMA) + N'.' + QUOTENAME(t.TABLE_NAME) ) , 'IsMSShipped' ) = 0
 AND dt.tableName IS NULL
 AND dc.columnName IS NULL;

(This is not the most usual indentation style for SQL, by the way, so ignore it if it's not your taste).

This will return a table containing all non-ignored table names, combined with all non-ignored (and non-primary-key) column names and a bit that says whether that column is a primary key.

Now we have all the information, we can group the information per table. We want only one script for each table, so we need to combine the column names, which is something that SQL does not yet know how to do (there is no CONCAT function that would work like a MAX or SUM). In T-SQL, we can use FOR XML PATH though, to simulate that behavior:

DECLARE @GROUPED_COLUMN_INFORMATION TABLE (tableName sysname primary key, columns nvarchar(max), primaryColumns nvarchar(max), primaryColumnsCompare nvarchar(max));
INSERT @GROUPED_COLUMN_INFORMATION (tableName, columns, primaryColumns, primaryColumnsCompare)
 SELECT [tableName] = ci.tableName
 , [columns] =
 ( STUFF((SELECT N', t.' + columnName 
 FROM @COLUMN_INFORMATION c2
 WHERE ci.tableName = c2.tableName
 AND c2.isPrimaryKey = 0
 FOR XML PATH(N''), TYPE).value(N'.[1]', N'nvarchar(max)'), 1, 2, N'')
 )
 , [primaryColumns] =
 ( STUFF((SELECT N', t.' + columnName 
 FROM @COLUMN_INFORMATION c2
 WHERE ci.tableName = c2.tableName
 AND c2.isPrimaryKey = 1
 FOR XML PATH(N''), TYPE).value(N'.[1]', N'nvarchar(max)'), 1, 2, N'')
 )
 , [primaryColumnsCompare] =
 ( N'ON ' + STUFF((SELECT N' AND t.' + columnName + N' = cte.' + columnName
 FROM @COLUMN_INFORMATION c2
 WHERE ci.tableName = c2.tableName
 AND c2.isPrimaryKey = 1
 FOR XML PATH(N''), TYPE).value(N'.[1]', N'nvarchar(max)'), 1, 5, N'')
 )
 FROM @COLUMN_INFORMATION ci
 GROUP BY ci.tableName;

While grouping the information, we also prepare some of the strings that we need in the script to output. For instance, in [columns], we combine all the column names from the @COLUMN_INFORMATION table that are not a primary key, and we join them with the text ', t.', so that they are separated by commas and show up as t.columnName. The STUFF(nvarchar, first, last, N'') function makes sure that positions first through last are replaced with the empty string (N''). This gets rid of the comma in the beginning.

In [primaryColumnsCompare], all primary columns are listed twice, showing up as AND t.primaryColumn = cte.primaryColumn. STUFF removes the inital ' AND ' part for us.

Finally, now we have combined all information into a table with one entry per table to check, we generate the script:

SELECT @output = '';
SELECT @output += '
 ;WITH CTE_META AS (
 SELECT [TableName] = ' + ''''+ gci.[tablename] +''', t.*
 , [Rank] = ROW_NUMBER() OVER(
 PARTITION BY ' + gci.[columns] + ' 
 ORDER BY ' + gci.[primaryColumns] + '
 ) FROM ' + gci.[tablename] + ' t
 ) 
 SELECT cte.*
 --DELETE t
 FROM ' + gci.[tablename] + ' t
 INNER JOIN CTE_META cte ' + gci.[primaryColumnsCompare] + '
 WHERE cte.Rank > 1;
 '
 FROM @GROUPED_COLUMN_INFORMATION gci;
IF @output != ''
BEGIN
 SELECT @output = 'USE [' + db_name() + ']' + @output;
 PRINT @output;
END

(In this case, I didn't bother with the FOR XML PATH, because we are summing up one single string. Although the behavior of SELECT @output += '...' is not guaranteed to work with future versions of SQL Server, so you may want to do an extra step to combine it using a FOR XML PATH.)

Also, the output script changed a bit: the DELETE is commented out in favor of the SELECT to verify which data you are deleting. And also here, the IS IN has been replaced by a simple INNER JOIN.

Stack Exchange Network

Find & Delete Duplicate Records on All Tables in current MSSQL Database

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Find & Delete Duplicate Records on All Tables in current MSSQL Database

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions