This is on our dev systems. We send ZFS snapshots from production to dev systems, clone the received snapshot, then run a full reindex on PostgreSQL and then take an offline ZFS snapshot of the database. (@fresh-index snapshot) At this point everything works great, and we've done this for at least 8 years. We do this so we can do development and revert quickly.
It's only after we start running our new development db schema upgrades and data migrations and then do the rollback to @fresh-index we get corruption. We get errors when refreshing a materialized view, for example: "ERROR: expected one dependency record for TOAST table, found 0". We also can't re-run our conversion script in PG , it will error with info like "already exits", when it shouldn't after rollback to an offline snapshot. It's almost as if something isn't fully written to disk, or maybe something is stored by Postgres outside the data dir.
What doesn't work, always, is rolling back to that fresh index point. This is what I'm trying to understand, and would like to work.
What does work, is to re-clone the received snapshot and start over. What also works, is after schema updates, any other further snapshots, rollbacks work, providing they are not before the schema updates.
What also works, is rolling back snapshots before we do schema and data upgrades.
It's only A) doing big schema/data updates, and B) rollback to the fresh index.
I can even do online snapshots, or offline snapshots, same difference. This happens on 2 dev systems, both same setup on ZFS mirror's, on enterprise SSD's.
I'm at a loss as to why Postgres is not able to switch back to a previous state, especially when the service is shutdown.
I've tried adding sync=always on the filesystem, same difference. When I take the snapshots I run "sync" and also "zpool sync", then sleep for up to 30 seconds, doesn't seem to help.
I've added some config options in Postgres to limit dirty buffers and also fsync on WAL writes. But even still, why would any of that matter if I do a shutdown on Postres and take the snapshot offline?
I have Postgres 15.7 currently. On Alma Linux 8. ZFS is, I think, 2.1.15. This is running in a virtual machine (KVM) and the disks are passthrough to the guest with cache=none. This is happening on 2 systems the same way. Postgres data directory is in one place, and all WAL and log directories, one ZFS filesystem.
If there's any config's or log info I can provide to help make sense of it, let me know. I feel like this is something with Linux or ZFS, or something incorrectly configured, or just me.
Thanks for any help.
EDIT - POSSIBLY, PARTIALLY SOLVED: I may be having some success by taking online snapshots with zfs in sync=always. Plus, the backup script runs pg_backup_start() but adds 30 second sleep before that and again after that, then file system sync and zpool sync (which shouldn't be needed, but doing anyway), then runs a snapshot.
I still can't explain why offline snapshot doesn't rollback though. Maybe ZFS caching and timing. I need to do more testing, but I'll be happy if I can at least rollback with the current process.
Also, my backup script was making a CHECKPOINT and then pg_backup_start(), but watching log output shows that running a checkpoint in pg_backup_start anyway. So I took that out, maybe that was causing some issues.
1 Answer 1
EDIT: The real problem found. ZFS and/or Linux were silently not unmounting the file system on ZFS rollbacks. The rollback worked, but if any files were open or the directory was in use, it wouldn't dismount, leaving cached data in place. I only found it because I manually dismounted the file system and it gave me an error. Which led me to the log file I was tailing. Not a database problem at all, just my fault.
As for the other solutions below, all of that helped and I'm using it, except the sync=always, I'm skipping that because it's too slow. The improved backup/snapshot script works great and rollbacks work perfectly now, providing I test the dismounts work.
Original...
I'm marking this as solved, although I'm not 100% satisfied because offline snapshots need more testing. My primary needs are resolved because I can stress the system and rollback the file system to previous state and it works 100% of the time now.
What I changed that works well:
- Get enterprise SSD's! Consumer SSD's just cannot handle the sustained writes.
- Set ZFS file system property for "sync" to "always".
This does slow down our development conversions and data loads, but regular use of the system during development seems to be the same speeds. - We changed our backup script to work this way while taking a ZFS snapshot. We now take and online snapshot, rather than shutting down first. Here's the key script snippet.
psql -U $PG_USER $PG_DB <<EOF
BEGIN;
\! echo "Allowing disks to settle writes, then pg_backup_start ..."
\! sleep 30
SELECT pg_backup_start('$BACKUP_LABEL', true);
\! echo "FS SYNC..."; sleep 30; sync; sync; zpool sync $ZFS_POOL; sleep 30
\! zfs snapshot $ZFS_DATASET@$BACKUP_LABEL && echo "ZFS snapshot created: $ZFS_DATASET@$BACKUP_LABEL" >> "$LOG_FILE"
SELECT pg_backup_stop();
COMMIT;
EOF
- We set one Postgres config parameter: wal_sync_method = fsync. If I understand that one, it forces WAL and metadata to do full writes of buffers and disk syncs.
Note that with pg_backup_start() we are setting "true" as the second argument. My understanding is that causes checkpoints and buffer writes to happen immediately. Maybe I am using that part wrong, but it seems to me that you'd want this to happen to get all data synced to disk faster for the snapshot to run. Plus, its working with my stress tests and rollbacks.
All file system rollbacks worked perfectly for running our schema updates over and over again after the changes. Each snapshot now takes roughly 90 seconds (due to the "sleep" commands).
At a later date I may retest the offline snapshots situation.
pg_wal
?) on a different file system? That would be a problem.