I have a postgres 9.3 db running on ubuntu server.
About a month ago we've had some problems with the hardware on the server which was resolved by our VPS hosting company.
The problem was resolved fairly quickly and everything seemed to be working fine.
We have backups running using barman that is on a different server - and backups and restores were working (I checked).
The first sign of data corruption was few days ago: I decided to do a full pg_dump on our DB as I do every once in a while, and it failed (invalid page header in block...) - The data seemed to have been corrupted a long time ago - around the time of the hardware problem (that was the date on the corrupted record). I resorted to finding the corrupted record and I deleted it and restored it manually.
After that I was able to do a full pg_dump.
In order to check for additional corruption - I've set up different db server from a backup and run pg_repack on all the tables to verify that I'm able to rebuild all the indexes and tables.
My questions are:
1. How can I be certain that I don't have any additional corruption in my DB?
2. What can I do to periodically check my data integrity?
3. What else can I do to verify the integrity of our DB besides dumping the whole DB and re-indexing it (which I already did)?
P.S - I don't have block checksums enabled.
3 Answers 3
If you have initialized your cluster with checksum, you can control this while backuping with pg_backrest
.
PostgreSQL has supported page-level checksums since 9.3. If page checksums are enabled pgBackRest will validate the checksums for every file that is copied during a backup.
You will find more information about cheksums here: https://www.postgresql.org/docs/current/static/app-initdb.html#APP-INITDB-DATA-CHECKSUMS
You can also make a pg_dump
to /dev/null
(if you don't have enough space to keep a logical dump). This should ensure you that you can read each data block.
PGDG had a discussion this year beacuse some wanted to activate checksum by default. You can find it there: http://www.postgresql-archive.org/Checksums-by-default-td5940158.html
If I may, there are now new tools for this:
- pg_verify_checksum (v 11)
- pg_checksum (v 12)
Both can activate or deactivate checksums on the cluster as well as run checks on an offline cluster. If you are worried about corruption you can run either on a whole cluster just portion(s) of it. Note that from personal experience, pg_verify_checksum works fine on older versions and pg_checksum is expect to have the same behavior.
Also eventhough it is now quite old, I would advise reading this post: http://blog.ringerc.id.au/2012/10/avoiding-postgresql-database-corruption.html
As stated elsewhere, there are no corruption-specific tools in core PostgreSQL similar to those found in SQL Server (DBCC CHECKDB
) or Oracle. Which does not mean that you cannot look after corruption and try to prevent it as much as possible. A few thoughts:
- Use good hardware. This hardware should properly honor
fsync
. Not tunable in your case, but important nevertheless. - Use checksums when you initialize your cluster. Sure there is a little penalty for doing so, but it really is worth the hassle. You could create a new cluster, dump and reload your database in this new instance and enjoy a more robust environment. It is available since 9.3, so fine for you.
- Run regular, automated logical backups with
pg_dump
. It's good to have an automated physical backup with Barman, but this will not check corruption. Onlypg_dump
(which will read all rows in all tables) will give you insurance against table corruption. This is something you should definetely automate. If you don't have the place, redirectpg_dump
to/dev/null
. Corrupted indexes can always be rebuilt if you have a correct table. I think there are some index corruption finding tools out there.
-
1I also had a corrupted index which did not cause pg_dump to fail - only pg_repack/vacuum full where failing because of this - so I guess that just doing a pg_dump is not enough (maybe automate a pg_dump and a pg_restore on a different server?) I will most probably re-create the cluster with checksums and restore it with pg_restore - my main problem is the downtime that comes along with it. Is it possible to enable checksum only on a slave server with streaming replication? or will replication not work in that case?Rubinsh– Rubinsh2017年05月15日 08:33:02 +00:00Commented May 15, 2017 at 8:33
-
1It's always good to have an automated backup/restore mechanism and I can only encourage you to have one. If you want to replicate between a cluster without checksum and one with checksum and minimize downtime, you will have to go with logical replication (Slony, Londiste, Bucardo, ...). It will be long to set up but the switch will be shorter.KookieMonster– KookieMonster2017年05月15日 08:57:25 +00:00Commented May 15, 2017 at 8:57
-
When you're "pg_dumping", you read only data, not indexes. With pg_restore, it will recreate your index from scratch and you still won't know if your index is corrupted on your production server. Of course, having an automated backup/restore mechanism is good, but I suggest to use physical backup instead of logical.Arkhena– Arkhena2017年05月16日 07:11:45 +00:00Commented May 16, 2017 at 7:11
-
I agree with your first part, I even said "Only pg_dump [...] will give you insurance against table corruption". And I highly recommend having a physical backup solution such as the excellent barman, but please plan a logical backup from time to time too if you care about about corruption.KookieMonster– KookieMonster2017年05月16日 07:25:09 +00:00Commented May 16, 2017 at 7:25
Explore related questions
See similar questions with these tags.
pg_dump
and later onpg_restore
everything on a different server, I'd check that theSELECT count(*) FROM every table
coincide with the source (and/or your exepctations)... If that's meaningful, Id' assume that the restored database is not corrupt. Use this new one... If that's not meaningful...