Insert into Table ignoring duplicate values

Question 1

I am having a brain fart on figuring this out. I have the following two Tables:

Table: parts
 part_id INT IDENTITY(1,1) NOT NULL,
 part_number VARCHAR(50) UNIQUE NOT NULL,
 part_description VARCHAR(MAX) NOT NULL,
 information VARCHAR(MAX) NULL,
 manufacturer_id INT NOT NULL,
 subcategory_id INT NOT NULL
Table: part_temp
 part_num VARCHAR(50) NOT NULL,
 part_desc VARCHAR(MAX) NULL,
 info VARCHAR(MAX) NULL,
 man_id INT NULL,
 sub_id INT NULL

part_temp is my temporary table that contains data from a CSV file, that is why only one column is set to be NOT NULL. I need to insert the data from part_temp into parts.

I have properly cleaned the data in the table so there are no null values trying to be inserted into rows which require a value. My issue however is with my UNIQUE constraint for my part_number columns in the parts table. There are duplicate values within the part_temp table so I need a way to be able to skip over them during my insert. This is what I have tried so far, but it does not work:

INSERT INTO parts
SELECT DISTINCT pt.part_num, pt.part_desc, pt.info, m.manufacturer_id, s.subcategory_id
 FROM part_temp AS pt
 FULL OUTER JOIN man_temp AS mt ON pt.man_id = mt.man_id
 INNER JOIN manufacturers AS m ON mt.man_name = m.manufacturer_name
 FULL OUTER JOIN cat_temp AS ct ON pt.sub_id = ct.category_id
 INNER JOIN subcategories AS s ON ct.category_name = s.subcategory_name
 WHERE NOT EXISTS(SELECT part_number FROM parts WHERE part_number = pt.part_num)

These are the tables included in the joins not listed above

Table: man_temp
 man_id INT NOT NULL,
 man_name VARCHAR(100) NOT NULL
Table: manufacturers
 manufacturer_id INT IDENTITY(1,1) NOT NULL,
 manufacturer_name VARCHAR(100) NOT NULL
Table: cat_temp
 category_id INT NOT NULL,
 category_name VARCHAR(100) NOT NULL
Table: subcategories
 subcategory_id INT IDENTITY(1,1) NOT NULL,
 subcategory_name VARCHAR(100) NOT NULL

What is it that is wrong with my INSERT query?

The specific error I am getting is:

Msg 2627, Level 14, State 1, Line 1

Violation of UNIQUE KEY constraint. Cannot insert duplicate key in object 'dbo.parts'. The duplicate key value is (31335A11)

part_num 31335A11 appears in the csv file more than once. As such it appears in the part_temp table more than once. It would be easy if it was just this entry, but I have more than 1,000 repeat entries, so it would take forever to remove all the duplicates. Nothing exists in parts as it is a brand new empty table I am trying to put values into.

Question 2

In that case, I believe your DISTINCT isn't helping because other attributes you are selecting out are making it non-unique and therefore looks to be a bog standard data quality issue. If you stick a WHERE partnum = '31335A11' what output do you get?

Question 3

@sqldavedb If I do WHERE part_num = '31335A11' I get the same error, but for the next duplicate item in the table

Question 4

I mean don't run the insert, just run the SELECT and you should be able to visually see why 31335A11 shows up twice - unless of course if it is an invisible control character in one of the text columns.

Question 5

To further clarify this: you most likely have two records with the same part_num but different part_desc, info, manufacturer_id, or subcategory_id. So you need to decide which one is the right one. Or pick a random one of you don't care.

Question 6

If I run into a Violation of UNIQUE KEY constraint the first thing I do is ask myself the following questions:

Does the duplicate key already exist in the target table?
Is the key duplicated in any source table?

From answering the above you will have a good idea of where the duplicate value is sourced from and a reasonable idea of the cause. Example possibilities are:

Duplicate exists in source table
Duplicate does not exist at source, but a transform of the data causes duplicates to be returned

Ultimately you need to strip this problem back to basics, as you have the duplicate value in question you need to start digging into the data and pay close attention to the source query. Luckily in your example it is a simple query and you can easily strip out joins to see if any of those introduce the duplicate, or if indeed the duplicate exists due to non-key attributes being different therefore causing your DISTINCT to not do what you expect it to do.

I guess what I am trying to say is that there isn't necessarily a quick win for this kind of problem and you need to do some good ol' fashion detective work using the clues provided by SQL.

Moot point, but I like to explicitly list the columns on an insert so that nothing breaks if the order of columns on the target table is different to that of the source query.

Question 7

First of all: there is no CONFLICT command in SQL SERVER.

Second:

You are going to the wrong direction using DISTINCT and WHERE NOT EXISTS. Neither commands will prevent you from dupplicating data:

Suppose you have the following data on part_temp:

part_num part_desc info man_id sub_id
============================================================
000345 something1 some info 2 1
000345 something2 some info1 4 6
000345 something3 some info2 5 8

Suppose part_num 000345 does not exists in parts table; the query will try to insert the 3 records (which are distinct each other).

Finally:

For accomplish what you want. SQL Server provides a command called: MERGE

Using MERGE, you can decide what to do everytime you hit a conflict: this is either to update the previous inserted record or simply skip the new candidate for insertion

Use the following code for skip inserting conflict cases:

MERGE INTO parts A
USING (
 --YOUR SELECT QUERY HERE; NO NEED TO USE DISTINCT / WHERE
) B
ON(A.part_num = B.part_num)
WHEN MATCHED THEN UPDATE SET A.part_num=A.part_num
WHEN NOT MATCHED THEN INSERT(part_number,part_description,....) 
VALUES(B.part_num,B.part_desc,....);

Good luck!

If you want to store repeated data into a second table include the OUTPUT Clause at the end of the UPDATE statement (which works as a DO NOTHING actually)....

...
WHEN MATCHED THEN 
 UPDATE SET A.part_num=A.part_num
 OUTPUT DELETED.part_id, B.* INTO part_duplicates
WHEN NOT MATCHED THEN 
 INSERT(part_number,part_description,....) 
 VALUES(B.part_num,B.part_desc,....);

Question 8

Personally I wouldn't use MERGE purely to avoid inserting a duplicate.I think people would be better off looking at why the duplicate exists in the first place and correcting it, otherwise perhaps use a window function like DENSE_RANK so that you can reliably pick a candidate row for insert. Otherwise you are going to pseudo-randomly pick which row is the end state for the row in the target table

Question 9

My solution focus on a recurrent loading process. If this is a single unique loading task, I agree with cleaning up the data on every table. But if you face several input csv files every day, the best you can do is to stablish rules of what to do with repeated data, and process them accordingly. Otherwise you will lose hours of work doing this check.

Question 10

@AMG Let's say I used MERGE and wanted to take all of the duplicates and put them in a table called part_duplicates that has all the same columns as the parts table (so I can find specifically what entries are causing problems, and potentially edit them in the future). How would I do that?

Question 11

@Skitzafreak, check my answer extension... But important thing here.... you can have several repeated parts for a single part_id... so don't treat this column as a primary key in the part_duplicates table

Question 12

@AMG agreed you can’t fix them one by one, but by using merge I think you will have an unreliable end state for the row which could cause more pain down the line with reconciliation issues. In which case I think youd be better off using a mechanism that will reliably/predictably return a specific row. I guess ultimately the OP needs to define rules on what is acceptable for the data processing they’re doing.

CasualFisher CasualFisher 3591 silver badge5 bronze badges · Accepted Answer · 2018-08-16 22:36:08Z

If I run into a Violation of UNIQUE KEY constraint the first thing I do is ask myself the following questions:

Does the duplicate key already exist in the target table?
Is the key duplicated in any source table?

From answering the above you will have a good idea of where the duplicate value is sourced from and a reasonable idea of the cause. Example possibilities are:

Duplicate exists in source table
Duplicate does not exist at source, but a transform of the data causes duplicates to be returned

Ultimately you need to strip this problem back to basics, as you have the duplicate value in question you need to start digging into the data and pay close attention to the source query. Luckily in your example it is a simple query and you can easily strip out joins to see if any of those introduce the duplicate, or if indeed the duplicate exists due to non-key attributes being different therefore causing your DISTINCT to not do what you expect it to do.

I guess what I am trying to say is that there isn't necessarily a quick win for this kind of problem and you need to do some good ol' fashion detective work using the clues provided by SQL.

Moot point, but I like to explicitly list the columns on an insert so that nothing breaks if the order of columns on the target table is different to that of the source query.

Stack Exchange Network

Insert into Table ignoring duplicate values

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Insert into Table ignoring duplicate values

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions