0

We have a Master - Master replication setup, MySQL 5.6. Only one Master is used, the other is for backup and failover (we'll call that the slave). The binlog_format is set to ROW, autoincrement settings are made to avoid conflicts.

The problem: Slave is halted due to Duplicate key errors.

We debugged the cause to be that bulk deletes made by cron jobs on the Master did not run (completely?!?) on slave. We are talking about tens of thousands of records that are NOT deleted from the slave. We did not find errors in the MySQL error log.

That leads to unsyncronized replica and errors when inserts are made using PK that should have been deleted on the Slave.

The tables are MyISAM.

Any idea why the bulk delete doesn't propagate properly on the replica?

asked Mar 29, 2016 at 14:31
4
  • seem you have multi threaded slave and on 5.6 you do not have variable slave_preserve_commit_order which often leads issues as yours..are you using multi threaded slave? Commented Mar 29, 2016 at 14:35
  • @NawazSohail It seems slave_preserve_commit_order is introduced in MySQL 5.7.5 and the slave_parallel_workers is set to 0 on both servers Commented Mar 29, 2016 at 14:45
  • "Commit order" -- Isn't that InnoDB only? Commented Mar 30, 2016 at 23:56
  • Switch to InnoDB, then see if you can use PARTITIONing for bulk delete. Commented Mar 30, 2016 at 23:56

3 Answers 3

0

In the absence of any obvious, logical reason why this might be happening, I'm going to shamelessly invoke the MyISAM boogey-man.

This question reminds me of one from a few years back. It's not a duplicate question, but the underlying mechanisms could be similar. The workaround I provided in that case was specifically based on the premise of duplicate "unique" values existing in a table.

I suspect you have latent, undetected defects in your MyISAM tables, where there are actually -- via an unknown (to me) mechanism -- duplicate primary (or unique) key values hiding below the surface.

When MyISAM selects or deletes a row by primary key, it won't see these duplicates, because as soon as it finds one row, it stops looking, because there "can't" be more to find... yet, if you SELECT * you'll get the duplicates... and deleting one of the rows would then unmask the other.

One way to test this that might work would be this:

mysql> CREATE TABLE test_table LIKE real_table;
mysql> INSERT INTO test_table SELECT * FROM real_table;

If this is indeed at the root of what's going on, you might get "lucky" and get a duplicate key error, which would prove the theory... because a duplicate key error should be impossible if the original table's data is intact.

You could, of course, review the binary logs using mysqlbinlog, to confirm that the deletions were logged... but I suspect you will find that the deletes did get logged correctly... but after the rows were deleted, the phantom rows were then visible, and caused the subsequent replication error.

answered Mar 30, 2016 at 2:01
1
  • It seems that this is not the case as we found tens of thousands of not deleted rows and we could not find any duplicates. Commented Mar 30, 2016 at 6:31
0

After some more digging we found the problem.

At some point this week, there was a restart of the main server. Because of this restart some of the inserts/updates/deletes weren't written to the binlog. We don't know why yet.
So, some changes weren't replicated on the secondary server.

Some duplicate key error appeared on slave when inserts were made on master. Also, when the cron jobs (bulk delete) were executed, we got a record not found error. We decided to skip these errors.

When we skipped the error generated by the delete operation, it didn't skip just one record. It skipped all of them. And this is why those tens of thousands of records weren't deleted from slave.

answered Mar 30, 2016 at 10:42
1
  • 1
    An un-graceful shutdown (eg, power failure) leave many things in questionable state. Look at sync_binlog as a likely villain. Commented Mar 30, 2016 at 23:55
0

To counter such issues in production environment, where consistency of data is priority, if slave servers are being actively used for read purpose by applications, it's a best practice to deploy an automated discrepancy check and sync script (as a daily cron) between master and slave server using pt-table-sync (available with Percona toolkit).

I have tested and deployed such scripts in my production environment which are having huge databases (more than 100 GB), so i don't need to worry in such failure and disaster cases as these script when run will check the discrepancies in number of records between master and slave and will automatically sync the databases and will provide the necessary stats if needed.

answered Jun 13, 2016 at 14:33

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.