0

This is on our dev systems. We send ZFS snapshots from production to dev systems, clone the received snapshot, then run a full reindex on PostgreSQL and then take an offline ZFS snapshot of the database. (@fresh-index snapshot) At this point everything works great, and we've done this for at least 8 years. We do this so we can do development and revert quickly.

It's only after we start running our new development db schema upgrades and data migrations and then do the rollback to @fresh-index we get corruption. We get errors when refreshing a materialized view, for example: "ERROR: expected one dependency record for TOAST table, found 0". We also can't re-run our conversion script in PG , it will error with info like "already exits", when it shouldn't after rollback to an offline snapshot. It's almost as if something isn't fully written to disk, or maybe something is stored by Postgres outside the data dir.

What doesn't work, always, is rolling back to that fresh index point. This is what I'm trying to understand, and would like to work.

What does work, is to re-clone the received snapshot and start over. What also works, is after schema updates, any other further snapshots, rollbacks work, providing they are not before the schema updates.

What also works, is rolling back snapshots before we do schema and data upgrades.

It's only A) doing big schema/data updates, and B) rollback to the fresh index.

I can even do online snapshots, or offline snapshots, same difference. This happens on 2 dev systems, both same setup on ZFS mirror's, on enterprise SSD's.

I'm at a loss as to why Postgres is not able to switch back to a previous state, especially when the service is shutdown.

I've tried adding sync=always on the filesystem, same difference. When I take the snapshots I run "sync" and also "zpool sync", then sleep for up to 30 seconds, doesn't seem to help.

I've added some config options in Postgres to limit dirty buffers and also fsync on WAL writes. But even still, why would any of that matter if I do a shutdown on Postres and take the snapshot offline?

I have Postgres 15.7 currently. On Alma Linux 8. ZFS is, I think, 2.1.15. This is running in a virtual machine (KVM) and the disks are passthrough to the guest with cache=none. This is happening on 2 systems the same way. Postgres data directory is in one place, and all WAL and log directories, one ZFS filesystem.

If there's any config's or log info I can provide to help make sense of it, let me know. I feel like this is something with Linux or ZFS, or something incorrectly configured, or just me.

Thanks for any help.

EDIT - POSSIBLY, PARTIALLY SOLVED: I may be having some success by taking online snapshots with zfs in sync=always. Plus, the backup script runs pg_backup_start() but adds 30 second sleep before that and again after that, then file system sync and zpool sync (which shouldn't be needed, but doing anyway), then runs a snapshot.

I still can't explain why offline snapshot doesn't rollback though. Maybe ZFS caching and timing. I need to do more testing, but I'll be happy if I can at least rollback with the current process.

Also, my backup script was making a CHECKPOINT and then pg_backup_start(), but watching log output shows that running a checkpoint in pg_backup_start anyway. So I took that out, maybe that was causing some issues.

asked Dec 22, 2024 at 16:29
2
  • Since you are asking a database audience, you should explain terms like "re-clone the received snapshot" or "offline ZFS snapshot" in greater detail. Also, what exactly is the "rollback to @fresh-index"? Is that resetting the file system to the point where you took the snapshot? If yes, and there was no problem at this point earlier, perhaps there is a problem with this rollback procedure or the file system. Another idea: Does the whole database reside on a single file system or ar certain parts of it (pg_wal?) on a different file system? That would be a problem. Commented Dec 23, 2024 at 7:46
  • Thanks for the questions. Yes, sorry, I realize this question could be directed to the ZFS/Storage/Sysadmin communities as well, but I'm unsure where the problem lies and started here. Yes, a rollback in ZFS terms means you revert to the file system snapshot named @fresh-index. The problem could easily be there. About the pg_wal dir, yes all PG data is in one "file system" or directory. The good news is, after many, many test loops, I may have this resolved to the point I'm satisfied. I do want to test more with offline snapshots, but I need more time later. Commented Dec 23, 2024 at 20:23

1 Answer 1

0

EDIT: The real problem found. ZFS and/or Linux were silently not unmounting the file system on ZFS rollbacks. The rollback worked, but if any files were open or the directory was in use, it wouldn't dismount, leaving cached data in place. I only found it because I manually dismounted the file system and it gave me an error. Which led me to the log file I was tailing. Not a database problem at all, just my fault.

As for the other solutions below, all of that helped and I'm using it, except the sync=always, I'm skipping that because it's too slow. The improved backup/snapshot script works great and rollbacks work perfectly now, providing I test the dismounts work.

Original...

I'm marking this as solved, although I'm not 100% satisfied because offline snapshots need more testing. My primary needs are resolved because I can stress the system and rollback the file system to previous state and it works 100% of the time now.

What I changed that works well:

  1. Get enterprise SSD's! Consumer SSD's just cannot handle the sustained writes.
  2. Set ZFS file system property for "sync" to "always".
    This does slow down our development conversions and data loads, but regular use of the system during development seems to be the same speeds.
  3. We changed our backup script to work this way while taking a ZFS snapshot. We now take and online snapshot, rather than shutting down first. Here's the key script snippet.
psql -U $PG_USER $PG_DB <<EOF
BEGIN;
\! echo "Allowing disks to settle writes, then pg_backup_start ..."
\! sleep 30
SELECT pg_backup_start('$BACKUP_LABEL', true);
\! echo "FS SYNC..."; sleep 30; sync; sync; zpool sync $ZFS_POOL; sleep 30
\! zfs snapshot $ZFS_DATASET@$BACKUP_LABEL && echo "ZFS snapshot created: $ZFS_DATASET@$BACKUP_LABEL" >> "$LOG_FILE"
SELECT pg_backup_stop();
COMMIT;
EOF
  1. We set one Postgres config parameter: wal_sync_method = fsync. If I understand that one, it forces WAL and metadata to do full writes of buffers and disk syncs.

Note that with pg_backup_start() we are setting "true" as the second argument. My understanding is that causes checkpoints and buffer writes to happen immediately. Maybe I am using that part wrong, but it seems to me that you'd want this to happen to get all data synced to disk faster for the snapshot to run. Plus, its working with my stress tests and rollbacks.

All file system rollbacks worked perfectly for running our schema updates over and over again after the changes. Each snapshot now takes roughly 90 seconds (due to the "sleep" commands).

At a later date I may retest the offline snapshots situation.

answered Dec 23, 2024 at 21:34

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.