So I posted this question yesterday. Some of the responses I got were helpful, however it seems my issue is a bit more complex than I originally thought.
After doing some looking the reason I was getting errors with my INSERT
statement was because I was having columns like this:
part_number | description | information
------------------------------------------------
331335A11 Desc1 Info1
331335A11 Desc2 Info1
Essentially, there are a number of entries that have the same value for the part_number
field (which is suppose to be a UNIQUE
column) but different vales for their other columns. As such the query was trying to insert them into the database, and I have my problem.
So what I am trying to do, because I am unsure just how many records in my table have this problem, is to do the INSERT
into my parts
table, but every time I get a repeated part_number
value, instead of inserting it into the parts
table, it is instead inserted into a table called parts_duplicates
which won't have the unique restriction for the part_number
column (but still have all the same columns as the parts
table. From here I can analyze my incorrect data points and fix them (hopefully).
My only problem is...I have no idea where to even get started on tackling this. In the question I posted above one of the responses suggested using MERGE
and I am currently in the process of testing that, but I am wondering if there is a better way to go about this.
3 Answers 3
Here is a possible solution that seems to work and doesn't require triggers - you'd have to test it against your real data.
--Demo setup
Declare @Parts table (part_number varchar(30), description varchar(30), information varchar(30))
Declare @PartsTemp table (part_number varchar(30), description varchar(30), information varchar(30))
Declare @PartsDuplicates table (part_number varchar(30), description varchar(30), information varchar(30))
insert into @Parts(part_number,description,information) values
('331335A10', 'Desc1', 'Info1') --Row already exists on the @Parts table
insert into @PartsTemp(part_number,description,information) values
('331335A00', 'Desc1', 'Info1'), --No row on the @Parts table and no duplicate
('331335A10', 'Desc1', 'Info1'), --Row already exists on the @Parts table
('331335A11', 'Desc1', 'Info1'), --No row on the @Parts table
('331335A11', 'Desc2', 'Info1') --Duplicate row on the @PartsTemp table
--The solution
--Common table expression to add row number to each PartsTemp row
;WITH PartsTempAndRowNumber
AS (
SELECT *
,ROW_NUMBER() OVER (
PARTITION BY part_number ORDER BY description
) AS rn
FROM @PartsTemp
)
--Insert into @PartsDuplicates where either:
--The rn<>1 - meaning duplicates on the @PartsTemp table
--OR
--The part number already exists on the @Parts table
INSERT INTO @PartsDuplicates (
part_number
,description
,information
)
SELECT part_number
,description
,information
FROM PartsTempAndRowNumber ptarn
WHERE rn <> 1
UNION ALL
SELECT ptarn.part_number
,ptarn.description
,ptarn.information
FROM PartsTempAndRowNumber ptarn
JOIN @Parts pt
ON pt.part_number = ptarn.part_number
AND ptarn.rn = 1
--Insert rows to @Parts selecting from @PartsTemp where the part_number can't be found
--on the @PartsDuplicates table
INSERT INTO @Parts (
part_number
,description
,information
)
SELECT part_number
,description
,information
FROM @PartsTemp pt
WHERE NOT EXISTS (
SELECT *
FROM @PartsDuplicates
WHERE part_number = pt.part_number
)
--Verify @Parts rows
SELECT *
FROM @Parts
ORDER BY part_number
--Verify @PartsDuplicates rows
SELECT *
FROM @PartsDuplicates
ORDER BY part_number
After execution @Parts
| part_number | description | information |
|-------------|-------------|-------------|
| 331335A00 | Desc1 | Info1 |
| 331335A10 | Desc1 | Info1 |
After execution @PartsDuplicates
| part_number | description | information |
|-------------|-------------|-------------|
| 331335A10 | Desc1 | Info1 |
| 331335A11 | Desc2 | Info1 |
The reason I suggested a trigger is because I always assume that you can't control all of the ways that data can get into the table (assuming otherwise can be dangerous). Your inserts might be ad hoc, distributed in apps, auto-generated by ORMs, etc.
Given these tables:
CREATE TABLE dbo.parts(PartID int PRIMARY KEY, descr sysname /*, other cols */);
CREATE TABLE dbo.parts_duplicates(PartID int, descr sysname /*, other cols */);
CREATE CLUSTERED INDEX x ON dbo.parts_duplicates(PartID);
GO
We can build this trigger:
CREATE TRIGGER dbo.ShelvePartsDupes ON dbo.parts INSTEAD OF INSERT
AS
BEGIN
SET NOCOUNT ON;
-- first, stuff rows that already exist in parts
-- or that are duplicates from this batch only into dupes
INSERT dbo.parts_duplicates(PartID, descr /*, other cols */)
SELECT PartID, descr /*, other cols */
FROM
(
SELECT PartID, c = COUNT(*) OVER (PARTITION BY PartID), descr
/*, other cols */
FROM inserted
) AS x
WHERE c > 1
OR EXISTS (SELECT 1 FROM dbo.parts WHERE PartID = x.PartID);
-- rows that are both singular and don't already exist:
INSERT dbo.parts(PartID, descr /*, other cols */)
SELECT PartID, descr /*, other cols */
FROM
(
-- aggregating here is ok because it'll only ever be one row
SELECT PartID, descr = MAX(descr) /*, other cols = MAX(other cols) */
FROM inserted AS i
WHERE NOT EXISTS
(
SELECT 1 FROM dbo.parts WHERE PartID = i.PartID
)
GROUP BY PartID
HAVING COUNT(*) = 1
) AS x;
END
So three sample inserts, one to create an initial row, the second to simulate (a) new single row that already exists (b) new single row that doesn't already exist (c) new pair of rows that don't already exist, and the third to simulate a new pair of rows that already have a partID in the target.
INSERT dbo.Parts(PartID, descr)
VALUES(1, N'floob');
GO
INSERT dbo.Parts(PartID, descr)
VALUES(1, N'bar'), (2, N'New'), (3, N'New dupe 1'), (3, N'New dupe 2');
GO
INSERT dbo.Parts(PartID, descr)
VALUES(2, N'New dupe 3'), (2, N'New dupe 4');
Let's check what we have:
SELECT * FROM dbo.parts;
SELECT * FROM dbo.parts_duplicates;
Results:
If you want to build in some kind of logic that would have picked an arbitrary duplicate from the PartID = 3
rows, you can, but your comments seemed to indicate you want to manually determine which row to keep.
You can use queries like below to filter out Duplicate / unique rows from parts_temp table and build onto to that logic to insert rows in main parts table or duplicate_parts table
-- TO GET non-duplicate entries in parts_temp based on part number
SELECT * FROM parts_temp
WHERE partnumber IN (SELECT part_number FROM parts_temp GROUP BY part_numnber HAVING COUNT(1) = 1)
-- TO GET duplicate entries in parts_temp based on part number
SELECT * FROM parts_temp
WHERE partnumber IN (SELECT part_number FROM parts_temp GROUP BY part_numnber HAVING COUNT(1) > 1)
INSTEAD OF
trigger to accomplish the same, maybe only when duplicates are detected. (Also, personally, I wouldn't useMERGE
unless you have a really good reason.)parts_duplicates
table so I can go through all of them to find out which are good, fix the data, and then import them later. I want to be able to insert everything that doesn't have a duplicate into theparts
table without issue. Would I need to use likeWHERE Count(part_number) > 1
or something for this essentially?