SQL Server Assertion error On Availability Groups

Question 1

We are having constant memory dumps from our SQL 2019 cu21 instance. We upgraded this instance to cu26 (latest patch) hoping this can resolve the issue but it did not.

The error log is filled with this error. We have Availability Group configured on this server.

SQL Server Assertion: File: <HadrAvailabilityGroupStore.cpp>, line=373 Failed Assertion = 'cbDecoded == cbDecodedData'. This error may be timing-related. If the error persists after rerunning the statement, use DBCC CHECKDB to check the database for structural integrity, or restart the server to ensure in-memory data structures are not corrupted.

Any idea on what could be causing this and how to address this?

History

This happened in staging environment. We are migrating from old staging server to new staging server. The old staging server is windows 2016 and SQL 2019. The new staging server is windows 2022 and SQL2022. There are 15 AGs (each DB is in its own AG) on old staging server and 15 AGs on the new staging server. We created a DAG between the old and new AG. All this work was done few weeks ago and working fine as we verified that by comparing the LSNs between the AGs (old and new). We verified the DAG status before initiating the failover and it was fine.

After we started the migration, On the global primary (Old AG), the first memory dump reported "Access Violation occurred writing address" and the Old AG's rolled over. The instance never recovered from that event and the AG's were stuck in resolving state ( they would go offline\not synchronizing state etc). We could not even connect to both the old staging instance. The new staging instance is fine. Based on the above suggestion, I removed few AGs from the WSFC and it stabilized after that. There are 5 AGs left on the old infra and they are stable now.

Other Observations

We have about 15 AGs on these servers ( 2 replicas). After dropping few AG's, the memory dumps stopped and instance has stabilized. We dropped the AG's sort of randomly. So, I assume, we must have dropped the AG or few AGs that have the corrupted registry.

The old staging servers were on - SQL 2019 CU 21. The first memory dump was - Access Violation occurred writing address 0000000000000000 The command in the input buffer that generated the access violation was " Drop Availability Group <> Noticed this fix in CU22 which talks about Access violation issue when dropping DAG if the AG is in suspect state. I am wondering if something like that might have happened.

New Questions

I reviewed the registry settings for these AG, they have entries in the configuration folder. I am not sure what they is correct or expected. 1. Is it possible for DAG migration to have caused this? 2. Is it possible for something like this happen even on regular AG ( no DAG)? 3. Can we remove the AG from WSFC, if we cannot access the instance from SQL?

Question 2

Original Response

SQL Server Assertion: File: <HadrAvailabilityGroupStore.cpp>, line=373 Failed Assertion = 'cbDecoded == cbDecodedData'

What this is saying is that the data read (in this case, assuming a WSFC integrated AG) from the registry was decoded and the number of bytes in the registry entry and the number of bytes decoded were not the same. That's... not a good thing.

The dump might be some help, but without private symbols it's going to take some time to find the structures.

If this happens on a single node (move the primary to another node) then either there is software on that node that is bad/interfering[antivirus,etc]/broken (including SQL Server) or the registry entry became corrupt (or manually edited).

If this happens on all nodes, then most likely the registry was corrupted or there is a 3rd party item interfering.

You'll have to investigate the configuration data in the registry by going to the cluster database location for that replica and for the availability group and see if it looks like the other ones. HKEY_LOCAL_MACHINE\Cluster\Resources\{AG_RESOURCE_GUID}\Configuration

You should be able to reset the registry entry by removing the replica and adding it back in. This should generate a new replica configuration key.

Response To New Questions

The first memory dump was - Access Violation occurred writing address 0000000000000000 The command in the input buffer that generated the access violation was " Drop Availability Group <> Noticed this fix in CU22 which talks about Access violation issue when dropping DAG if the AG is in suspect state.

It's highly doubtful as the issue clearly stated it was an issue decoding data from the registry. Could there be some interaction between these? Eh. Unlikely (possible, but highly unlikely).

Is it possible for DAG migration to have caused this?

It's doubtful, this is quite a common scenario and used constantly. Not to say it couldn't happen just that these code paths are highly exercised.

The error was with the data in the registry, it's more likely it was someone or something mucking around in the registry (or even bad drivers) than it was the migration.

Is it possible for something like this happen even on regular AG ( no DAG)?

Yes, absolutely. The thing is that a distributed availability group doesn't have a corresponding WSFC resource, so the distributed availability group can't be the immediate source of the problem.

Can we remove the AG from WSFC, if we cannot access the instance from SQL?

There is nothing stopping you from removing the availability group. It'll stop the registry keys from being read (it should clean them up) and thus stop the issue from occurring. Re-creating it should create a new resource guid and thus new key data and thus no issues. If it happens immediately after re-creating it then you'll want to get some tracing, such as procmon, to see what's possibly interfering with this process.

Question 3

Hello Sean, It is happening on both the nodes. All AG's are down and we cannot even expand the "Availbility Group" folder in the object explorer. Most of the time, we cannot even connect to the instance. We were migrating from sql 2019 to sql 2022 using DAG and this happened on the old AG(sql2019) in the middle of the migration.

Question 4

@SqlData Can you post/upload/whatever the replica configuration values from the registry? Something is definitely hosed.

Question 5

Thank you Sean for answering the questions. Is there we can do to ensure this won't happen during prod migration? Anything specific to check for in the registry? We failed over the prod AG's recently due to maintenance work and they came online without issues. We do not have DAG set up on prod dbs yet.

Question 6

Since there was no data gathered to figure out what happened or where it may have happened, there's no data to suggest you can do anything proactive. @SqlData

Sean Gallardy Sean Gallardy 38.5k3 gold badges49 silver badges91 bronze badges · Answer 1 · 2024-04-25 11:19:36Z

Original Response

SQL Server Assertion: File: <HadrAvailabilityGroupStore.cpp>, line=373 Failed Assertion = 'cbDecoded == cbDecodedData'

What this is saying is that the data read (in this case, assuming a WSFC integrated AG) from the registry was decoded and the number of bytes in the registry entry and the number of bytes decoded were not the same. That's... not a good thing.

The dump might be some help, but without private symbols it's going to take some time to find the structures.

If this happens on a single node (move the primary to another node) then either there is software on that node that is bad/interfering[antivirus,etc]/broken (including SQL Server) or the registry entry became corrupt (or manually edited).

If this happens on all nodes, then most likely the registry was corrupted or there is a 3rd party item interfering.

You'll have to investigate the configuration data in the registry by going to the cluster database location for that replica and for the availability group and see if it looks like the other ones. HKEY_LOCAL_MACHINE\Cluster\Resources\{AG_RESOURCE_GUID}\Configuration

You should be able to reset the registry entry by removing the replica and adding it back in. This should generate a new replica configuration key.

Response To New Questions

The first memory dump was - Access Violation occurred writing address 0000000000000000 The command in the input buffer that generated the access violation was " Drop Availability Group <> Noticed this fix in CU22 which talks about Access violation issue when dropping DAG if the AG is in suspect state.

It's highly doubtful as the issue clearly stated it was an issue decoding data from the registry. Could there be some interaction between these? Eh. Unlikely (possible, but highly unlikely).

Is it possible for DAG migration to have caused this?

It's doubtful, this is quite a common scenario and used constantly. Not to say it couldn't happen just that these code paths are highly exercised.

The error was with the data in the registry, it's more likely it was someone or something mucking around in the registry (or even bad drivers) than it was the migration.

Is it possible for something like this happen even on regular AG ( no DAG)?

Yes, absolutely. The thing is that a distributed availability group doesn't have a corresponding WSFC resource, so the distributed availability group can't be the immediate source of the problem.

Can we remove the AG from WSFC, if we cannot access the instance from SQL?

There is nothing stopping you from removing the availability group. It'll stop the registry keys from being read (it should clean them up) and thus stop the issue from occurring. Re-creating it should create a new resource guid and thus new key data and thus no issues. If it happens immediately after re-creating it then you'll want to get some tracing, such as procmon, to see what's possibly interfering with this process.

Hello Sean, It is happening on both the nodes. All AG's are down and we cannot even expand the "Availbility Group" folder in the object explorer. Most of the time, we cannot even connect to the instance. We were migrating from sql 2019 to sql 2022 using DAG and this happened on the old AG(sql2019) in the middle of the migration.
@SqlData Can you post/upload/whatever the replica configuration values from the registry? Something is definitely hosed.
Thank you Sean for answering the questions. Is there we can do to ensure this won't happen during prod migration? Anything specific to check for in the registry? We failed over the prod AG's recently due to maintenance work and they came online without issues. We do not have DAG set up on prod dbs yet.
Since there was no data gathered to figure out what happened or where it may have happened, there's no data to suggest you can do anything proactive. @SqlData

Stack Exchange Network

SQL Server Assertion error On Availability Groups

History

Other Observations

New Questions

1 Answer 1

Original Response

Response To New Questions

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

SQL Server Assertion error On Availability Groups

History

Other Observations

New Questions

1 Answer 1

Original Response

Response To New Questions

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions