Distributed Availability Group Direct Seeding FAILED, failure_state SQL Error, failure_state 2

Question 1

We just started setting up Distributed Availability groups to replicate our production databases into a new reporting cluster. The first availability group that we setup for replication worked great without any issues, however when we then moved on to the second availability group with much larger databases (over 3TB total) it took much longer and two of the 5 databases failed. We setup the distributed availability group to use direct seeding and when querying the sys.dm_hadr_automatic_seeding table it indicates the current_state as FAILED, with failure_state 2 (SQL Error) or 21 (Seeding Check Message Timeout):

dm_hadr_automatic_seeding

What can we do to troubleshoot this issue?

Question 2

that link says it is fixed in CU1 for 2017 but i am on CU17 and still an issue. Use Backup and restore option to avoid this nightmare! I sometimes had to retry like 7 times to get a db onto the AG using automaitc seeding

Question 3

The AlwaysOn Professional blog has some general troubleshooting steps for direct seeding and also includes some details about trace flag 9567 to enable compression during seeding, but I didn't find any details about the SQL Error or Seeding Timeout.

We previously have had issues with large databases causing problems in availability groups, but this usually is resolved by applying the latest transaction logs from the primary against the replica.

In this case the databases were listed on the secondary availability group as recovering, so I tried applying the latest transaction log backups from the primary and then joining the database to the secondary availability group:

--Restore transaction logs from primary and stay in recovery mode. Multiple backup files may need to be restored from oldest to newest.
RESTORE LOG stackoverflow from disk = '\\Backups\SQL\_Trans\StackOverflow_AG\StackOverflow\StackOverflow_LOG_20170810_175400.trn' WITH NORECOVERY;
ALTER DATABASE stackoverflow SET HADR AVAILABILITY GROUP = [StackOverflow_RAG];
ALTER DATABASE stackoverflow SET HADR RESUME;

This worked for both of the failed databases and fixed the replication issues. Our reporting cluster now has all databases kept in sync from the primary availability group:

enter image description here

Question 4

I see you are using opserver in your screenshot. I am struggling to get it working from ground up. Do you mind writing a blog post on how external people can get this up and running + do you store historical data so that you can go back and see what happened last week ? Appreciate it and it will be huge help if you can blog a series :-)

Question 5

There's a bug in SQL Server 2016/2017 that is still not fixed, just hit this problem on SP2 with CU2. Doubt trace flag 9567 is the solution, as my databases were small, 1GB.

https://support.microsoft.com/en-us/help/4040519/fix-automatic-seeding-in-availability-groups-randomly-causes-error-411

Note the number_of_attempts in the query below, value is 1 for me, I'd expect a few retries before the direct seeding gives up, but no idea where to set that.

USE [master];
SELECT TOP 100 * FROM sys.dm_hadr_automatic_seeding WHERE current_state = 'FAILED' ORDER BY start_time DESC;

Greg Bray Greg Bray 1,0533 gold badges13 silver badges20 bronze badges · Answer 1 · 2017-08-10 18:52:19Z

The AlwaysOn Professional blog has some general troubleshooting steps for direct seeding and also includes some details about trace flag 9567 to enable compression during seeding, but I didn't find any details about the SQL Error or Seeding Timeout.

We previously have had issues with large databases causing problems in availability groups, but this usually is resolved by applying the latest transaction logs from the primary against the replica.

In this case the databases were listed on the secondary availability group as recovering, so I tried applying the latest transaction log backups from the primary and then joining the database to the secondary availability group:

--Restore transaction logs from primary and stay in recovery mode. Multiple backup files may need to be restored from oldest to newest.
RESTORE LOG stackoverflow from disk = '\\Backups\SQL\_Trans\StackOverflow_AG\StackOverflow\StackOverflow_LOG_20170810_175400.trn' WITH NORECOVERY;
ALTER DATABASE stackoverflow SET HADR AVAILABILITY GROUP = [StackOverflow_RAG];
ALTER DATABASE stackoverflow SET HADR RESUME;

This worked for both of the failed databases and fixed the replication issues. Our reporting cluster now has all databases kept in sync from the primary availability group:

enter image description here

I see you are using opserver in your screenshot. I am struggling to get it working from ground up. Do you mind writing a blog post on how external people can get this up and running + do you store historical data so that you can go back and see what happened last week ? Appreciate it and it will be huge help if you can blog a series :-)

score 0 · Answer 2 · 2018-08-03 18:37:54Z

There's a bug in SQL Server 2016/2017 that is still not fixed, just hit this problem on SP2 with CU2. Doubt trace flag 9567 is the solution, as my databases were small, 1GB.

https://support.microsoft.com/en-us/help/4040519/fix-automatic-seeding-in-availability-groups-randomly-causes-error-411

Note the number_of_attempts in the query below, value is 1 for me, I'd expect a few retries before the direct seeding gives up, but no idea where to set that.

USE [master];
SELECT TOP 100 * FROM sys.dm_hadr_automatic_seeding WHERE current_state = 'FAILED' ORDER BY start_time DESC;

Stack Exchange Network

Distributed Availability Group Direct Seeding FAILED, failure_state SQL Error, failure_state 2

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Distributed Availability Group Direct Seeding FAILED, failure_state SQL Error, failure_state 2

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions