How can I delete repeated rows?

Question 1

I have a system that gathers RSS from many sites including UTF-8 based sites (non-english sites). I have tried to avoid repeated rows but I have repeats in my table.

RSS_items (id(auto_increment) PRIMARY KEY, url ,title, body)

Is there any way to delete the repeat? The problem is that sometimes this repeat happens and sometimes not. The first thing I think I must find those who are repeated, then delete the extra rows. But how ?

Question 2

What RDBMS? What criteria are you using to determine a duplicate? This is a very common problem in databases.

Question 3

I am using Mysql and the i have a robot gathering the information, before inserting it checks if the url exists or not,if not it inserts it, but still i have duplication, i think thats because of the languages other than english in the url. is there a way find duplications?

Question 4

So a duplicate is defined by having the same URL? Or same URL and title? or all 3? Once you know what defines a distinct record, you can group and find those with more than one.

Question 5

This would depend on which one of the repeat you want to keep. For this example, I will dedup the table based on url and title and keeping the first occurrence.

First thing to do: run these queries

SELECT COUNT(1) FROM RSS_items;
SELECT COUNT(1) dupcount_two,url,title FROM RSS_items
GROUP BY url,title HAVING COUNT(1) = 2 WITH ROLLUP;
SELECT COUNT(1) dupcount_morethantwo,url,title FROM RSS_items
GROUP BY url,title HAVING COUNT(1) > 2 WITH ROLLUP;

This will show total table count and how all (url,title) combinations that repeat twice and more than twice with the total for each at the bottom. If the sum of the repeats is more than 5% of the total:

CREATE TABLE RSS_items_URLTitle LIKE RSS_items;
INSERT INTO RSS_items_URLTitle (id,url,title)
 SELECT MIN(id),url,title
 FROM RSS_items GROUP BY url,title
 HAVING 
;
DELETE B.*
 FROM RSS_items_URLTitle A
 INNER JOIN RSS_items B
 USING (url,title) WHERE A.id <> B.id
;
DROP TABLE RSS_items_URLTitle;

This method made be very slow if you do not have an index on (url,title)

Otherwise, run this

CREATE TABLE RSS_items_New LIKE RSS_items;
CREATE TABLE RSS_items_URLTitle LIKE RSS_items;
INSERT INTO RSS_items_URLTitle (id,url,title)
 SELECT MIN(id),url,title
 FROM RSS_items GROUP BY url,title
;
INSERT INTO RSS_items_New
 SELECT B.*
 FROM RSS_items_URLTitle A
 INNER JOIN RSS_items USING (id)
;
ALTER TABLE RSS_items RENAME RSS_items_Zap;
ALTER TABLE RSS_items_New RENAME RSS_items;
DROP TABLE RSS_items_URLTitle;
DROP TABLE RSS_items_Zap;

This method may be better since it only operates on the table id.

Please look over both methods. Try testing it on test databases with copies of the data before running anything.

If you want to keep the last occurrence of duplicates, replace MIN(id) with MAX(id).

Give it a Try !!!

Question 6

This query will return rows that have different IDs but everything else is the same:

Select RSS1.*
from RSS_items RSS1
inner join RSS_items RSS2
 on RSS1.id <> RSS2.id --IDs must be different
 and RSS1.url = RSS2.url
 and RSS1.title = RSS2.title
 and RSS1.body = RSS2.body

Then you can choose which record to delete.

score 1 · Answer 1 · 2012-06-22 15:49:12Z

This would depend on which one of the repeat you want to keep. For this example, I will dedup the table based on url and title and keeping the first occurrence.

First thing to do: run these queries

SELECT COUNT(1) FROM RSS_items;
SELECT COUNT(1) dupcount_two,url,title FROM RSS_items
GROUP BY url,title HAVING COUNT(1) = 2 WITH ROLLUP;
SELECT COUNT(1) dupcount_morethantwo,url,title FROM RSS_items
GROUP BY url,title HAVING COUNT(1) > 2 WITH ROLLUP;

This will show total table count and how all (url,title) combinations that repeat twice and more than twice with the total for each at the bottom. If the sum of the repeats is more than 5% of the total:

CREATE TABLE RSS_items_URLTitle LIKE RSS_items;
INSERT INTO RSS_items_URLTitle (id,url,title)
 SELECT MIN(id),url,title
 FROM RSS_items GROUP BY url,title
 HAVING 
;
DELETE B.*
 FROM RSS_items_URLTitle A
 INNER JOIN RSS_items B
 USING (url,title) WHERE A.id <> B.id
;
DROP TABLE RSS_items_URLTitle;

This method made be very slow if you do not have an index on (url,title)

Otherwise, run this

CREATE TABLE RSS_items_New LIKE RSS_items;
CREATE TABLE RSS_items_URLTitle LIKE RSS_items;
INSERT INTO RSS_items_URLTitle (id,url,title)
 SELECT MIN(id),url,title
 FROM RSS_items GROUP BY url,title
;
INSERT INTO RSS_items_New
 SELECT B.*
 FROM RSS_items_URLTitle A
 INNER JOIN RSS_items USING (id)
;
ALTER TABLE RSS_items RENAME RSS_items_Zap;
ALTER TABLE RSS_items_New RENAME RSS_items;
DROP TABLE RSS_items_URLTitle;
DROP TABLE RSS_items_Zap;

This method may be better since it only operates on the table id.

Please look over both methods. Try testing it on test databases with copies of the data before running anything.

If you want to keep the last occurrence of duplicates, replace MIN(id) with MAX(id).

Give it a Try !!!

score 0 · Answer 2 · 2012-06-22 15:02:19Z

This query will return rows that have different IDs but everything else is the same:

Select RSS1.*
from RSS_items RSS1
inner join RSS_items RSS2
 on RSS1.id <> RSS2.id --IDs must be different
 and RSS1.url = RSS2.url
 and RSS1.title = RSS2.title
 and RSS1.body = RSS2.body

Then you can choose which record to delete.

Stack Exchange Network

How can I delete repeated rows?

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

How can I delete repeated rows?

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions