A server I administer recently experienced a power outage that caused issues with the SSRS ReportServerTempDB
. After the outage, the nightly job that runs Ola's DatabaseIntegriyCheck
started reporting problems:
sqlcmd -E -S $(ESCAPE_SQUOTE(SRVR)) -d Master
-Q "EXECUTE [dbo].[DatabaseIntegrityCheck]
@Databases = 'USER_DATABASES,SYSTEM_DATABASES',
@LogToTable = 'Y'" -b
Command: DBCC CHECKDB ([ReportServerTempDB]) WITH NO_INFOMSGS, ALL_ERRORMSGS, DATA_PURITY
Msg 8979, Level 16, State 1, Server XXXXXXX, Line 1
Table error: Object ID 133575514, index ID 1, partition ID 72057594040156160, alloc unit ID 72057594041466880 (type In-row data). Page (1:130168) is missing references from parent (unknown) and previous (page (1:123239)) nodes. Possible bad root entry in system catalog.
Msg 8979, Level 16, State 1, Server XXXXXXX, Line 1
Table error: Object ID 133575514, index ID 1, partition ID 72057594040156160, alloc unit ID 72057594041466880 (type In-row data). Page (1:130169) is missing references from parent (unknown) and previous (page (1:123239)) nodes. Possible bad root entry in system catalog.
Msg 8979, Level 16, State 1, Server XXXXXXX, Line 1
Table error: Object ID 133575514, index ID 1, partition ID 72057594040156160, alloc unit ID 72057594041466880 (type In-row data). Page (1:130170) is missing references from parent (unkn...
Process Exit Code 1. The step failed.
However, our nightly backup job continued to report success every time it ran. SSRS didn't seem to have any problems either. The backup job does have multiple backup commands rolled up into one step:
BACKUP DATABASE [ReportServer] TO DISK = N'C:\Backup-SQLServer\ReportServer.bak' WITH NOFORMAT, COMPRESSION, CHECKSUM, INIT, NAME = N'ReportServer-Full Database Backup', SKIP, NOREWIND, NOUNLOAD, STATS = 10
GO
BACKUP DATABASE [ReportServerTempDB] TO DISK = N'C:\Backup-SQLServer\ReportServerTempDB.bak' WITH NOFORMAT, COMPRESSION, CHECKSUM, INIT, NAME = N'ReportServerTempDB-Full Database Backup', SKIP, NOREWIND, NOUNLOAD, STATS = 10
GO
BACKUP DATABASE [TEST-PH] TO DISK = N'C:\Backup-SQLServer\TEST-PH.bak' WITH NOFORMAT, COMPRESSION, CHECKSUM, INIT, NAME = N'TEST-PH-Full Database Backup', SKIP, NOREWIND, NOUNLOAD, STATS = 10
GO
Step two of our backup job checks the files. It ran every night and reported success each time:
RESTORE VERIFYONLY FROM DISK= N'C:\Backup-SQLServer\ReportServer.bak'
GO
RESTORE VERIFYONLY FROM DISK= N'C:\Backup-SQLServer\ReportServerTempDB.bak'
GO
RESTORE VERIFYONLY FROM DISK= N'C:\Backup-SQLServer\TEST-PH.bak'
GO
This went on for a few days until a convenient downtime. I resolved the issue by restoring/overwriting ReportServerTempDB
with a nightly backup from prior to the outage.
Our databases' PAGE_VERIFY_OPTION
is set to CHECKSUM
.
What scenarios would explain this behavior?
1 Answer 1
sp_BlitzErik wrote in a comment (since deleted):
...is asking [about the
PAGE_VERIFY_OPTION
] because of this: Blitz Results: Page Verification Not Optimal. TheCHECKSUM
option for backups will only throw errors if you're using the correct page verification option. It can sometimes be suboptimal if your database came from an older version of SQL Server.
And it will only throw errors, even with the CHECKSUM
option in use for the page verify option, if the page became corrupted after writing to disk. And I don't suppose they're necessarily asking the question just because of that post - my guess is they have the option set to CHECKSUM
already.
The scenario described in the question can and does happen when a page corruption happens in memory (either because of bad memory, a memory scribbler, or a SQL Server corruption bug), and then the corrupt page is written to disk with a valid checksum.
Subsequent reads of the page, including testing of the page checksum, will succeed. However, when DBCC CHECKDB
(or a hapless query) runs, the corruption will become evident. In this case, it doesn't matter whether the BACKUP
and RESTORE
statements use CHECKSUM
; they won't detect that corruption.
This is why you cannot just rely on using CHECKSUM
everywhere and not run consistency checks.
Hope this helps explain what you saw.
From the errors you posted, it looks like the root page of that clustered index got whacked somehow.
-
Would rebuilding that index have been an acceptable way to resolve this specific situation?poke– poke2017年10月05日 14:49:21 +00:00Commented Oct 5, 2017 at 14:49
-
Well, rebuilding the index would have removed the corruption, but not identified or resolved the root cause of the corruption. I'd run full memory diagnostics on that server next time it gets rebooted.Paul S. Randal– Paul S. Randal2017年10月06日 15:18:42 +00:00Commented Oct 6, 2017 at 15:18
Explore related questions
See similar questions with these tags.
verifyonly
is not going to help much, you have to usechecksum
while taking backup and then while verifying integrity you have to use bothverifyonly and checksum
. This will do a lot more checks as compared to plain verifyonly