BACKUP WITH CHECKSUM works but DBCC CHECKDB detects problems

Question 1

A server I administer recently experienced a power outage that caused issues with the SSRS ReportServerTempDB. After the outage, the nightly job that runs Ola's DatabaseIntegriyCheck started reporting problems:

sqlcmd -E -S $(ESCAPE_SQUOTE(SRVR)) -d Master 
 -Q "EXECUTE [dbo].[DatabaseIntegrityCheck] 
 @Databases = 'USER_DATABASES,SYSTEM_DATABASES', 
 @LogToTable = 'Y'" -b

Command: DBCC CHECKDB ([ReportServerTempDB]) WITH NO_INFOMSGS, ALL_ERRORMSGS, DATA_PURITY
Msg 8979, Level 16, State 1, Server XXXXXXX, Line 1
Table error: Object ID 133575514, index ID 1, partition ID 72057594040156160, alloc unit ID 72057594041466880 (type In-row data). Page (1:130168) is missing references from parent (unknown) and previous (page (1:123239)) nodes. Possible bad root entry in system catalog.
Msg 8979, Level 16, State 1, Server XXXXXXX, Line 1
Table error: Object ID 133575514, index ID 1, partition ID 72057594040156160, alloc unit ID 72057594041466880 (type In-row data). Page (1:130169) is missing references from parent (unknown) and previous (page (1:123239)) nodes. Possible bad root entry in system catalog.
Msg 8979, Level 16, State 1, Server XXXXXXX, Line 1
Table error: Object ID 133575514, index ID 1, partition ID 72057594040156160, alloc unit ID 72057594041466880 (type In-row data). Page (1:130170) is missing references from parent (unkn...
Process Exit Code 1. The step failed.

However, our nightly backup job continued to report success every time it ran. SSRS didn't seem to have any problems either. The backup job does have multiple backup commands rolled up into one step:

BACKUP DATABASE [ReportServer] TO DISK = N'C:\Backup-SQLServer\ReportServer.bak' WITH NOFORMAT, COMPRESSION, CHECKSUM, INIT, NAME = N'ReportServer-Full Database Backup', SKIP, NOREWIND, NOUNLOAD, STATS = 10
GO
BACKUP DATABASE [ReportServerTempDB] TO DISK = N'C:\Backup-SQLServer\ReportServerTempDB.bak' WITH NOFORMAT, COMPRESSION, CHECKSUM, INIT, NAME = N'ReportServerTempDB-Full Database Backup', SKIP, NOREWIND, NOUNLOAD, STATS = 10
GO
BACKUP DATABASE [TEST-PH] TO DISK = N'C:\Backup-SQLServer\TEST-PH.bak' WITH NOFORMAT, COMPRESSION, CHECKSUM, INIT, NAME = N'TEST-PH-Full Database Backup', SKIP, NOREWIND, NOUNLOAD, STATS = 10
GO

Step two of our backup job checks the files. It ran every night and reported success each time:

RESTORE VERIFYONLY FROM DISK= N'C:\Backup-SQLServer\ReportServer.bak'
GO
RESTORE VERIFYONLY FROM DISK= N'C:\Backup-SQLServer\ReportServerTempDB.bak'
GO
RESTORE VERIFYONLY FROM DISK= N'C:\Backup-SQLServer\TEST-PH.bak'
GO

This went on for a few days until a convenient downtime. I resolved the issue by restoring/overwriting ReportServerTempDB with a nightly backup from prior to the outage.

Our databases' PAGE_VERIFY_OPTION is set to CHECKSUM.

What scenarios would explain this behavior?

Question 2

Check if memory module has gone bad due to power outage. In this case the backup with checksum will succeed but CHECKDB will report corruption. What does windows logs and sql server error logs report ? There will be some clue as to what happened when sql server came back after power outage.

Question 3

Just using restore verifyonly is not going to help much, you have to use checksum while taking backup and then while verifying integrity you have to use both verifyonly and checksum. This will do a lot more checks as compared to plain verifyonly

Question 4

sp_BlitzErik wrote in a comment (since deleted):

...is asking [about the PAGE_VERIFY_OPTION] because of this: Blitz Results: Page Verification Not Optimal. The CHECKSUM option for backups will only throw errors if you're using the correct page verification option. It can sometimes be suboptimal if your database came from an older version of SQL Server.

And it will only throw errors, even with the CHECKSUM option in use for the page verify option, if the page became corrupted after writing to disk. And I don't suppose they're necessarily asking the question just because of that post - my guess is they have the option set to CHECKSUM already.

The scenario described in the question can and does happen when a page corruption happens in memory (either because of bad memory, a memory scribbler, or a SQL Server corruption bug), and then the corrupt page is written to disk with a valid checksum.

Subsequent reads of the page, including testing of the page checksum, will succeed. However, when DBCC CHECKDB (or a hapless query) runs, the corruption will become evident. In this case, it doesn't matter whether the BACKUP and RESTORE statements use CHECKSUM; they won't detect that corruption.

This is why you cannot just rely on using CHECKSUM everywhere and not run consistency checks.

Hope this helps explain what you saw.

From the errors you posted, it looks like the root page of that clustered index got whacked somehow.

Question 5

Would rebuilding that index have been an acceptable way to resolve this specific situation?

Question 6

Well, rebuilding the index would have removed the corruption, but not identified or resolved the root cause of the corruption. I'd run full memory diagnostics on that server next time it gets rebooted.

Paul S. Randal Paul S. Randal 1,0608 silver badges7 bronze badges · Accepted Answer · 2017-10-04 13:00:06Z

sp_BlitzErik wrote in a comment (since deleted):

...is asking [about the PAGE_VERIFY_OPTION] because of this: Blitz Results: Page Verification Not Optimal. The CHECKSUM option for backups will only throw errors if you're using the correct page verification option. It can sometimes be suboptimal if your database came from an older version of SQL Server.

And it will only throw errors, even with the CHECKSUM option in use for the page verify option, if the page became corrupted after writing to disk. And I don't suppose they're necessarily asking the question just because of that post - my guess is they have the option set to CHECKSUM already.

The scenario described in the question can and does happen when a page corruption happens in memory (either because of bad memory, a memory scribbler, or a SQL Server corruption bug), and then the corrupt page is written to disk with a valid checksum.

Subsequent reads of the page, including testing of the page checksum, will succeed. However, when DBCC CHECKDB (or a hapless query) runs, the corruption will become evident. In this case, it doesn't matter whether the BACKUP and RESTORE statements use CHECKSUM; they won't detect that corruption.

This is why you cannot just rely on using CHECKSUM everywhere and not run consistency checks.

Hope this helps explain what you saw.

From the errors you posted, it looks like the root page of that clustered index got whacked somehow.

Would rebuilding that index have been an acceptable way to resolve this specific situation?
Well, rebuilding the index would have removed the corruption, but not identified or resolved the root cause of the corruption. I'd run full memory diagnostics on that server next time it gets rebooted.

Stack Exchange Network

BACKUP WITH CHECKSUM works but DBCC CHECKDB detects problems

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

BACKUP WITH CHECKSUM works but DBCC CHECKDB detects problems

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions