I tried to create a peice of code that helps me remove duplicates from my sql database and reset the ID's as well. I am not sure that it's error proof.
import mysql.connector
# Database connection parameters
host = 'host
database = 'database'
user = 'user'
password = 'password'
port = numbers
try:
# Establishing the connection
conn = mysql.connector.connect(
host=host,
database=database,
user=user,
password=password,
port=port
)
if conn.is_connected():
print('Connected to the database')
cursor = conn.cursor()
# Remove duplicates
delete_duplicates_query = """
DELETE t1 FROM articles t1
INNER JOIN articles t2
WHERE
t1.ID < t2.ID AND
t1.headline = t2.headline AND
t1.description = t2.description AND
t1.link = t2.link AND
t1.date = t2.date AND
t1.hour = t2.hour;
"""
cursor.execute(delete_duplicates_query)
conn.commit()
print('Duplicates removed')
# Reset primary key
reset_primary_key_query = """
SET @new_id = 0;
UPDATE articles SET ID = (@new_id := @new_id + 1);
ALTER TABLE articles AUTO_INCREMENT = 1;
"""
cursor.execute(reset_primary_key_query)
conn.commit()
print('Primary key reset')
except mysql.connector.Error as e:
print(f"Error: {e}")
finally:
if conn.is_connected():
cursor.close()
conn.close()
print('Database connection closed')
For reference
Table name : articles
ID
headline
description
link
date
hour
1 Answer 1
UNIQUE index
The OP statement to DELETE dups looks fine.
It's unclear why we would want to INSERT those rows in the first place,
only to change our mind and rip them out later.
Consider creating
a UNIQUE articles index on
(headline, description, link, date, hour)
.
Then errant INSERT attempts will immediately report an error,
and there will be no need for any DELETE cleanup.
timestamp
The separate {date
, hour
} columns seem inconvenient.
Consider storing a single timestamp
column.
You can always extract a date or an hour from it later if you need that.
It should store UTC time in seconds since the epoch (1970), with no timezone information. Worry about zone offsets on the way in and the way out, preferring to store just UTC on disk.
reset primary key
This seems ill advised.
Recommend you not attempt to change a
PK
id
after it has been stored.
Suppose a line of code syslog's an id
,
and then you change it.
Now the logfile no longer speaks truth.
Suppose you wind up with
FK
relationships. Now UPDATE articles SET id ...
will be updating random rows on those other tables.
Prefer to think of each id
as immutable.