How do I remove duplicate records in a join table in PostgreSQL?

Question 1

I have a table that has a schema like this:

create_table "questions_tags", :id => false, :force => true do |t|
 t.integer "question_id"
 t.integer "tag_id"
 end
 add_index "questions_tags", ["question_id"], :name => "index_questions_tags_on_question_id"
 add_index "questions_tags", ["tag_id"], :name => "index_questions_tags_on_tag_id"

I would like to remove records that are duplicates, i.e. they have both the same tag_id and question_id as another record.

What does the SQL look like for that?

Question 2

In my experience (and as shown in many tests) NOT IN as demonstrated by @gsiems is rather slow and scales terribly. The inverse IN is typically faster (where you can reformulate that way, like in this case), but this query with EXISTS (doing exactly what you asked) should be much faster yet - with big tables by orders of magnitude:

DELETE FROM questions_tags q
WHERE EXISTS (
 SELECT FROM questions_tags q1
 WHERE q1.ctid < q.ctid
 AND q1.question_id = q.question_id
 AND q1.tag_id = q.tag_id
 );

Deletes every row where another row with the same (tag_id, question_id) and a smaller ctid exists. (Effectively keeps the first instance according to the physical order of tuples.) Using ctid in the absence of a better alternative, your table does not seem to have a PK or any other unique (set of) column(s).

ctid is the internal tuple identifier present in every row and necessarily unique within a single table. You need to do more if multiple tables can be involved under the hood, like with inheritance or partitioning. See:

How to delete duplicate rows without unique identifier

Test

I ran a test case with this table matched to your question and 100k rows:

CREATE TABLE questions_tags(
 question_id integer NOT NULL
, tag_id integer NOT NULL
);
INSERT INTO questions_tags (question_id, tag_id)
SELECT (random()* 100)::int, (random()* 100)::int
FROM generate_series(1, 100000);
ANALYZE questions_tags;

Indexes do not help in this case.

Results

NOT IN
The SQLfiddle times out.
Tried the same locally but I canceled it, too, after several minutes.

EXISTS
Finishes in half a second in this SQLfiddle.

Alternatives

If you are going to delete most of the rows, it will be faster to select the survivors into another table, drop the original and rename the survivor's table. Careful, this has implications if you have view or foreign keys (or other dependencies) defined on the original.

If you have dependencies and want to keep them, you could:

Drop all foreign keys and indexes - for performance.
SELECT survivors to a temporary table.
TRUNCATE the original.
Re-INSERT survivors.
Re-CREATE indexes and foreign keys. Views can just stay, they have no impact on performance. More here or here.

Question 3

++ for the exists solution. Much better than my suggestion.

Question 4

Could you please explain the ctid comparison in your WHERE clause?

Question 5

@KevinMeredith: I added some explanation.

Question 6

You can use the ctid to accomplish that. For example:

Create a table with duplicates:

=# create table foo (id1 integer, id2 integer);
CREATE TABLE
=# insert into foo values (1,1), (1, 2), (1, 2), (1, 3);
INSERT 0 4
=# select * from foo;
 id1 | id2 
-----+-----
 1 | 1
 1 | 2
 1 | 2
 1 | 3
(4 rows)

Select the duplicate data:

=# select foo.ctid, foo.id1, foo.id2, foo2.min_ctid
-# from foo
-# join (
-# select id1, id2, min(ctid) as min_ctid 
-# from foo 
-# group by id1, id2 
-# having count (*) > 1
-# ) foo2 
-# on foo.id1 = foo2.id1 and foo.id2 = foo2.id2
-# where foo.ctid <> foo2.min_ctid ;
 ctid | id1 | id2 | min_ctid 
-------+-----+-----+----------
 (0,3) | 1 | 2 | (0,2)
(1 row)

Delete the duplicate data:

=# delete from foo
-# where ctid not in (select min (ctid) as min_ctid from foo group by id1, id2);
DELETE 1
=# select * from foo;
 id1 | id2 
-----+-----
 1 | 1
 1 | 2
 1 | 3
(3 rows)

In your case the following should work:

delete from questions_tags
 where ctid not in (
 select min (ctid) as min_ctid 
 from questions_tags 
 group by question_id, tag_id
 );

Question 7

Where can I read more about this ctid? Thanks.

Question 8

@marcamillion -- The documentation has a short blurb on ctids at postgresql.org/docs/current/static/ddl-system-columns.html

Question 9

What does ctid stand for?

Question 10

@marcamillion -- tid == "tuple id", not sure what the c means.

score 15 · Accepted Answer · 2013-03-13 18:52:27Z

In my experience (and as shown in many tests) NOT IN as demonstrated by @gsiems is rather slow and scales terribly. The inverse IN is typically faster (where you can reformulate that way, like in this case), but this query with EXISTS (doing exactly what you asked) should be much faster yet - with big tables by orders of magnitude:

DELETE FROM questions_tags q
WHERE EXISTS (
 SELECT FROM questions_tags q1
 WHERE q1.ctid < q.ctid
 AND q1.question_id = q.question_id
 AND q1.tag_id = q.tag_id
 );

Deletes every row where another row with the same (tag_id, question_id) and a smaller ctid exists. (Effectively keeps the first instance according to the physical order of tuples.) Using ctid in the absence of a better alternative, your table does not seem to have a PK or any other unique (set of) column(s).

ctid is the internal tuple identifier present in every row and necessarily unique within a single table. You need to do more if multiple tables can be involved under the hood, like with inheritance or partitioning. See:

How to delete duplicate rows without unique identifier

Test

I ran a test case with this table matched to your question and 100k rows:

CREATE TABLE questions_tags(
 question_id integer NOT NULL
, tag_id integer NOT NULL
);
INSERT INTO questions_tags (question_id, tag_id)
SELECT (random()* 100)::int, (random()* 100)::int
FROM generate_series(1, 100000);
ANALYZE questions_tags;

Indexes do not help in this case.

Results

NOT IN
The SQLfiddle times out.
Tried the same locally but I canceled it, too, after several minutes.

EXISTS
Finishes in half a second in this SQLfiddle.

Alternatives

If you are going to delete most of the rows, it will be faster to select the survivors into another table, drop the original and rename the survivor's table. Careful, this has implications if you have view or foreign keys (or other dependencies) defined on the original.

If you have dependencies and want to keep them, you could:

Drop all foreign keys and indexes - for performance.
SELECT survivors to a temporary table.
TRUNCATE the original.
Re-INSERT survivors.
Re-CREATE indexes and foreign keys. Views can just stay, they have no impact on performance. More here or here.

Could you please explain the ctid comparison in your WHERE clause?

Stack Exchange Network

How do I remove duplicate records in a join table in PostgreSQL?

2 Answers 2

Test

Results

Alternatives

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

How do I remove duplicate records in a join table in PostgreSQL?

2 Answers 2

Test

Results

Alternatives

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions