MySQL row-based replication - Delete Fails

Question 1

We have a Master - Master replication setup, MySQL 5.6. Only one Master is used, the other is for backup and failover (we'll call that the slave). The binlog_format is set to ROW, autoincrement settings are made to avoid conflicts.

The problem: Slave is halted due to Duplicate key errors.

We debugged the cause to be that bulk deletes made by cron jobs on the Master did not run (completely?!?) on slave. We are talking about tens of thousands of records that are NOT deleted from the slave. We did not find errors in the MySQL error log.

That leads to unsyncronized replica and errors when inserts are made using PK that should have been deleted on the Slave.

The tables are MyISAM.

Any idea why the bulk delete doesn't propagate properly on the replica?

Question 2

seem you have multi threaded slave and on 5.6 you do not have variable slave_preserve_commit_order which often leads issues as yours..are you using multi threaded slave?

Question 3

@NawazSohail It seems slave_preserve_commit_order is introduced in MySQL 5.7.5 and the slave_parallel_workers is set to 0 on both servers

Question 4

"Commit order" -- Isn't that InnoDB only?

Question 5

Switch to InnoDB, then see if you can use PARTITIONing for bulk delete.

Question 6

In the absence of any obvious, logical reason why this might be happening, I'm going to shamelessly invoke the MyISAM boogey-man.

This question reminds me of one from a few years back. It's not a duplicate question, but the underlying mechanisms could be similar. The workaround I provided in that case was specifically based on the premise of duplicate "unique" values existing in a table.

I suspect you have latent, undetected defects in your MyISAM tables, where there are actually -- via an unknown (to me) mechanism -- duplicate primary (or unique) key values hiding below the surface.

When MyISAM selects or deletes a row by primary key, it won't see these duplicates, because as soon as it finds one row, it stops looking, because there "can't" be more to find... yet, if you SELECT * you'll get the duplicates... and deleting one of the rows would then unmask the other.

One way to test this that might work would be this:

mysql> CREATE TABLE test_table LIKE real_table;
mysql> INSERT INTO test_table SELECT * FROM real_table;

If this is indeed at the root of what's going on, you might get "lucky" and get a duplicate key error, which would prove the theory... because a duplicate key error should be impossible if the original table's data is intact.

You could, of course, review the binary logs using mysqlbinlog, to confirm that the deletions were logged... but I suspect you will find that the deletes did get logged correctly... but after the rows were deleted, the phantom rows were then visible, and caused the subsequent replication error.

Question 7

It seems that this is not the case as we found tens of thousands of not deleted rows and we could not find any duplicates.

Question 8

After some more digging we found the problem.

At some point this week, there was a restart of the main server. Because of this restart some of the inserts/updates/deletes weren't written to the binlog. We don't know why yet.
So, some changes weren't replicated on the secondary server.

Some duplicate key error appeared on slave when inserts were made on master. Also, when the cron jobs (bulk delete) were executed, we got a record not found error. We decided to skip these errors.

When we skipped the error generated by the delete operation, it didn't skip just one record. It skipped all of them. And this is why those tens of thousands of records weren't deleted from slave.

Question 9

An un-graceful shutdown (eg, power failure) leave many things in questionable state. Look at sync_binlog as a likely villain.

Question 10

To counter such issues in production environment, where consistency of data is priority, if slave servers are being actively used for read purpose by applications, it's a best practice to deploy an automated discrepancy check and sync script (as a daily cron) between master and slave server using pt-table-sync (available with Percona toolkit).

I have tested and deployed such scripts in my production environment which are having huge databases (more than 100 GB), so i don't need to worry in such failure and disaster cases as these script when run will check the discrepancies in number of records between master and slave and will automatically sync the databases and will provide the necessary stats if needed.

score 0 · Answer 1 · 2016-03-30 02:01:48Z

In the absence of any obvious, logical reason why this might be happening, I'm going to shamelessly invoke the MyISAM boogey-man.

This question reminds me of one from a few years back. It's not a duplicate question, but the underlying mechanisms could be similar. The workaround I provided in that case was specifically based on the premise of duplicate "unique" values existing in a table.

I suspect you have latent, undetected defects in your MyISAM tables, where there are actually -- via an unknown (to me) mechanism -- duplicate primary (or unique) key values hiding below the surface.

When MyISAM selects or deletes a row by primary key, it won't see these duplicates, because as soon as it finds one row, it stops looking, because there "can't" be more to find... yet, if you SELECT * you'll get the duplicates... and deleting one of the rows would then unmask the other.

One way to test this that might work would be this:

mysql> CREATE TABLE test_table LIKE real_table;
mysql> INSERT INTO test_table SELECT * FROM real_table;

If this is indeed at the root of what's going on, you might get "lucky" and get a duplicate key error, which would prove the theory... because a duplicate key error should be impossible if the original table's data is intact.

You could, of course, review the binary logs using mysqlbinlog, to confirm that the deletions were logged... but I suspect you will find that the deletes did get logged correctly... but after the rows were deleted, the phantom rows were then visible, and caused the subsequent replication error.

It seems that this is not the case as we found tens of thousands of not deleted rows and we could not find any duplicates.

Marvin Saldinger Marvin Saldinger 1331 silver badge7 bronze badges · Answer 2 · 2016-03-30 10:42:25Z

After some more digging we found the problem.

At some point this week, there was a restart of the main server. Because of this restart some of the inserts/updates/deletes weren't written to the binlog. We don't know why yet.
So, some changes weren't replicated on the secondary server.

Some duplicate key error appeared on slave when inserts were made on master. Also, when the cron jobs (bulk delete) were executed, we got a record not found error. We decided to skip these errors.

When we skipped the error generated by the delete operation, it didn't skip just one record. It skipped all of them. And this is why those tens of thousands of records weren't deleted from slave.

An un-graceful shutdown (eg, power failure) leave many things in questionable state. Look at sync_binlog as a likely villain.

Sachin Singh Sachin Singh 1935 bronze badges · Answer 3 · 2016-06-13 14:33:23Z

To counter such issues in production environment, where consistency of data is priority, if slave servers are being actively used for read purpose by applications, it's a best practice to deploy an automated discrepancy check and sync script (as a daily cron) between master and slave server using pt-table-sync (available with Percona toolkit).

I have tested and deployed such scripts in my production environment which are having huge databases (more than 100 GB), so i don't need to worry in such failure and disaster cases as these script when run will check the discrepancies in number of records between master and slave and will automatically sync the databases and will provide the necessary stats if needed.

Stack Exchange Network

MySQL row-based replication - Delete Fails

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

MySQL row-based replication - Delete Fails

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions