Backups/CHECKDB errors - "Cannot find server certificate"

Question 1

I have been exchanging TDE certificates on various servers to make restores easier - some had the wrong name, the wrong thumbprint, or even both. This involves transferring databases to a temporary cert, dropping old certs, creating the intended cert (from the certificate & key files), and transferring databases to it. This has all gone fine and dandy, except for databases on one server.

On executing DBCC CHECKDB(SomeDatabase) WITH NO_INFOMSGS, ALL_ERRORMSGS, DATA_PURITY, I receive this:

Msg 33111, Level 16, State 3, Line 106
Cannot find server certificate with thumbprint '<old removed thumbprint>'.
Msg 1823, Level 16, State 2, Line 106
A database snapshot cannot be created because it failed to start.
Msg 1823, Level 16, State 8, Line 106
A database snapshot cannot be created because it failed to start.
Msg 7928, Level 16, State 1, Line 106
The database snapshot for online checks could not be created. Either the reason is given in a previous error or one of the underlying volumes does not support sparse files or alternate streams. Attempting to get exclusive access to run checks offline.

If rerun with info messages, it does the expected CHECKDB behavior afterwards, so it appears to get the necessary access for offline.

Backups have a similar output. I typically use the SSMS UI for ad hoc backups, which scripts out as this command: BACKUP DATABASE [SomeDatabase] TO DISK = N'some\path\SomeDatabase.bak' WITH DIFFERENTIAL , NOFORMAT, NOINIT, NAME = N'SomeDatabase-Full Database Backup', SKIP, NOREWIND, NOUNLOAD, STATS = 10. This results in:

Msg 33111, Level 16, State 3, Line 108
Cannot find server certificate with thumbprint '<old removed thumbprint>'.
Msg 3013, Level 16, State 1, Line 108
BACKUP DATABASE is terminating abnormally.

The same thing happens with full and copy-only backups.

Observing the encryption status:

SELECT a.dbid, a.name AS DatabaseName
,b.encryption_state
,CASE b.encryption_state
WHEN 0 THEN 'No database encryption key present, no encryption'
WHEN 1 THEN 'Unencrypted - Encryption enabled, but not turned on'
WHEN 2 THEN 'Encryption in progress'
WHEN 3 THEN 'Encrypted'
WHEN 4 THEN 'Key change in progress'
WHEN 5 THEN 'Decryption in progress'
WHEN 6 THEN 'Protection change in progress (The certificate or asymmetric key that is encrypting the database encryption key is being changed.)'
END AS encryption_state_desc
,percent_complete
,encryptor_type
,key_algorithm
,key_length
,encryptor_thumbprint
,create_date
,regenerate_date
,modify_date
,set_date
,opened_date
FROM master.dbo.sysdatabases a
LEFT JOIN sys.dm_database_encryption_keys b
ON a.dbid = b.database_id
ORDER BY b.encryption_state desc,
DatabaseName;

The failing databases are in state 1 (database encryption key created but encryption not turned on) and show the new TdeCert thumbprint as the encryptor_thumbprint column. There are a few databases that are in state 3 (encrypted), and those will happily back up without the old certificate. If I simply put the certificate back on the server, then CHECKDB/backups will happily execute - no additional commands against the database are required.

Questions

Why is it still trying to use the old certificate? I thought that since the database encryption key was encrypted by the new certificate, there would be no need for the old certificate anymore.
Why does simply putting the old cert back on the server make things work again? I would have expected to have to change something in the database again.
Why isn't there an error when I remove the old certificate if something is using it?
Is there a likely reason that I encountered this behavior on only one server?

General information

Version: Microsoft SQL Server 2019 (RTM-CU27) (KB5037331) - 15.0.4375.4 (X64) ... Standard Edition
TDE is set up with symmetric keys, no external key vaults.
Everything is on-prem, no Azure/cloud factors.
This is a non-prod server; all databases use the SIMPLE recovery model (no transaction logs).
These errors occur in automated jobs as well as statements executed directly in SSMS. Both myself and the account used for jobs have sysadmin on the server.
This is happening on a single server. Every other server I have exchanged certificates on (including ones on the same version) has not had this problem, and I cannot replicate this problem on other servers.
The old cert had the same name as the certificate I would like it to use. I wouldn't expect this to matter - it didn't matter to other servers - but given that something seems to be "stuck", it may be relevant.

Tests

Failure: CHECKDB outputs an error message and goes to work offline
Success: No error message, commands execute successfully

Fully enable encryption (ALTER DATABASE SomeDatabase SET ENCRYPTION ON) - failure
Just bring the old cert back, don't alter any databases - success
1. But as soon as the old cert is dropped again, it goes back to failure
Change database encryption key to something else, then run CHECKDB
1. New temporary certificate - failure
2. Restoring the desired certificate and changing the DEK to that cert - success
3. Transfer back to temporary cert while restored cert is still present - success
4. Drop old/restored cert - failure
Regenerate database encryption key - failure
Get rid of the new standard TdeCert certificate and rerun the transfer script - still a failure
Drop the database encryption key, then run CHECKDB - success
1. Create a new database encryption key with any cert - success

Related items that didn't solve my problem

This question demonstrates something quite similar, but doesn't seem relevant as I am on a different version, several releases after the described fix.
This Microsoft article describing an error where "Cannot find server certificate" occurred when specifying COMPRESSION and MAXTRANSFERSIZE options. The backup command I am using does not specify these options, and the described issue was resolved in older versions.
There are many questions and articles out there about encountering "Cannot find server certificate" when attempting a restore, none of which appear relevant as I am not doing a restore.

Question 2

Totally removing TDE would involve dropping the DEKs. This will make the database work "permanently", whether or not the old cert is present for any part of the process (test 6). I'm not opposed to this as a method of resolution, but am curious as to the cause (and why changing the key doesn't fix it but dropping the key does).

Question 3

The way that rotating DEK protecting certificates work and the way TDE works requires that the old certificate be available for some time after the change, unless you've properly planned log truncation and backups. Let me explain.

The certificate which protects the DEK is stamped in the database configuration, as such it knows which certificate is currently protecting it and thus can attempt to find it in the instance master (or in the case of a contained AG, also the replicated master). When a new certificate is used to protect the DEK, the old certificate thumbprint is moved to the historical certificate protection spot and the new one overwrites the current protector spot in the database configuration, there is only ever history for 2 protectors (you'll get an error in certain situations if you attempt to rotate keys too quickly).

When the protection of the DEK is changed, the current VLF which has a stamp of similar information in its header, needs to end and thus the VLF is ended at this point with the old protector. The next VLF is used and this is stamped with the new protector.

The reason you're receiving errors running CheckDB and Backups is due to those processes needing to go back to VLFs which are protected by the old certificates. To fix this you'll need to put the old certificates back and run the log backups/truncate the logs until only the new protector is used. Then you can remove the old protector from the instance - but note that you'll still need it to restore backups, etc., - when you've validated that the old protector isn't needed by using the vlf_encryptor_thumbprint column in sys.dm_db_log_info and either joining it or looking up the thumbprints in sys.certificates in master.

Question 4

If the problem if that it needs the old certificate to read VLFs, why does dropping the key also resolve the matter? The DROP DEK page mentions that the transaction log may remain encrypted, so I would expect to have the same problem. Dropping the key did not make a discernible difference on sys.dm_db_log_info. whether I had shrunk the files before or not.

Question 5

Dropping the keys may switch VLFs, depending on the recovery model of the database this may be enough (multiple vlf switches) to cause log truncation (simple) or past the needed VLFs via checkpoint and other internal items. @ImperviousInclemency

Sean Gallardy Sean Gallardy 38.5k3 gold badges49 silver badges91 bronze badges · Accepted Answer · 2024-07-03 21:04:35Z

The way that rotating DEK protecting certificates work and the way TDE works requires that the old certificate be available for some time after the change, unless you've properly planned log truncation and backups. Let me explain.

The certificate which protects the DEK is stamped in the database configuration, as such it knows which certificate is currently protecting it and thus can attempt to find it in the instance master (or in the case of a contained AG, also the replicated master). When a new certificate is used to protect the DEK, the old certificate thumbprint is moved to the historical certificate protection spot and the new one overwrites the current protector spot in the database configuration, there is only ever history for 2 protectors (you'll get an error in certain situations if you attempt to rotate keys too quickly).

When the protection of the DEK is changed, the current VLF which has a stamp of similar information in its header, needs to end and thus the VLF is ended at this point with the old protector. The next VLF is used and this is stamped with the new protector.

The reason you're receiving errors running CheckDB and Backups is due to those processes needing to go back to VLFs which are protected by the old certificates. To fix this you'll need to put the old certificates back and run the log backups/truncate the logs until only the new protector is used. Then you can remove the old protector from the instance - but note that you'll still need it to restore backups, etc., - when you've validated that the old protector isn't needed by using the vlf_encryptor_thumbprint column in sys.dm_db_log_info and either joining it or looking up the thumbprints in sys.certificates in master.

If the problem if that it needs the old certificate to read VLFs, why does dropping the key also resolve the matter? The DROP DEK page mentions that the transaction log may remain encrypted, so I would expect to have the same problem. Dropping the key did not make a discernible difference on sys.dm_db_log_info. whether I had shrunk the files before or not.
Dropping the keys may switch VLFs, depending on the recovery model of the database this may be enough (multiple vlf switches) to cause log truncation (simple) or past the needed VLFs via checkpoint and other internal items. @ImperviousInclemency

Stack Exchange Network

Backups/CHECKDB errors - "Cannot find server certificate"

Questions

General information

Tests

Related items that didn't solve my problem

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions