2
\$\begingroup\$

I tried to create a peice of code that helps me remove duplicates from my sql database and reset the ID's as well. I am not sure that it's error proof.

import mysql.connector
# Database connection parameters
host = 'host
database = 'database'
user = 'user'
password = 'password'
port = numbers
try:
 # Establishing the connection
 conn = mysql.connector.connect(
 host=host,
 database=database,
 user=user,
 password=password,
 port=port
 )
 
 if conn.is_connected():
 print('Connected to the database')
 cursor = conn.cursor()
 # Remove duplicates
 delete_duplicates_query = """
 DELETE t1 FROM articles t1
 INNER JOIN articles t2 
 WHERE 
 t1.ID < t2.ID AND 
 t1.headline = t2.headline AND 
 t1.description = t2.description AND 
 t1.link = t2.link AND 
 t1.date = t2.date AND 
 t1.hour = t2.hour;
 """
 cursor.execute(delete_duplicates_query)
 conn.commit()
 print('Duplicates removed')
 # Reset primary key
 reset_primary_key_query = """
 SET @new_id = 0;
 UPDATE articles SET ID = (@new_id := @new_id + 1);
 ALTER TABLE articles AUTO_INCREMENT = 1;
 """
 cursor.execute(reset_primary_key_query)
 conn.commit()
 print('Primary key reset')
except mysql.connector.Error as e:
 print(f"Error: {e}")
finally:
 if conn.is_connected():
 cursor.close()
 conn.close()
 print('Database connection closed')

For reference

Table name : articles

ID

headline

description

link

date

hour

toolic
14.5k5 gold badges29 silver badges203 bronze badges
asked Jun 23, 2024 at 23:10
\$\endgroup\$

1 Answer 1

1
\$\begingroup\$

UNIQUE index

The OP statement to DELETE dups looks fine.

It's unclear why we would want to INSERT those rows in the first place, only to change our mind and rip them out later. Consider creating a UNIQUE articles index on (headline, description, link, date, hour). Then errant INSERT attempts will immediately report an error, and there will be no need for any DELETE cleanup.

timestamp

The separate {date, hour} columns seem inconvenient. Consider storing a single timestamp column. You can always extract a date or an hour from it later if you need that.

It should store UTC time in seconds since the epoch (1970), with no timezone information. Worry about zone offsets on the way in and the way out, preferring to store just UTC on disk.

reset primary key

This seems ill advised. Recommend you not attempt to change a PK id after it has been stored.

Suppose a line of code syslog's an id, and then you change it. Now the logfile no longer speaks truth.

Suppose you wind up with FK relationships. Now UPDATE articles SET id ... will be updating random rows on those other tables.

Prefer to think of each id as immutable.

answered Jun 24, 2024 at 17:59
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.