Here is what I am doing:
- I have a Postgres 16 docker container which is continuously archiving WAL files
- After
00012
and00013
were archived, I take a base-backup of the database usingpg_basebackup
which generates a00014..backup
file (these are example files and actual WALs are longer I know) - Now I copy over the base backup and the archived WALs into another Postgres 16 docker container which is newly created from same docker image (including version)
- I make
postgres
user as owner for all these files - I remove the WALs from the
pg_wal
directory of base backup and update the restore_command ofpostgresql.conf
(additionally I also undo the archive configs which were there on primary) - I remove
00012
and00013
WALs as those are already there in base backup, so now archived WALs is only00014
and the .backup file created - I create a
recovery.signal
file, which is empty and in the data directory - Then finally I change the name of the current
pgdata
topgdata_ini
and the backup directory aspgdata
so that it acts as my data directory - Then I stop the container and start again but database startup fails due to
invalid checkpoint record
andcould not find required checkpoint record
Can someone point out what I am doing wrong here?
1 Answer 1
Turns out that the invalid checkpoint record happened because the database system identifiers of the two containers were different, which we can check by doing select system_identifier from pg_control_system();
Now this may sound obvious but it has a slight caveat. When we do PITR, the first step is to load the basebackup, and after loading this backup and renaming it to pgdata
, but before restarting, the identifier actually ends up being the same as per the above query. (Even though it is actually not for the system)
So we have to do the following to get around this:
- Make two copies of the basebackup on new container
- Restart the container with one of the backups so that system identifier truly reflects the old one but it is not up-to-date
- Then copy the archived WAL files and update the second copy of the basebackup (removing
pg_wal
files, updatingrestore_command
in conf, addingrecovery.signal
etc) - Now, we can rename this second basebackup copy as
pgdata
and restart the container
Now, it will correctly pick up the archived files and recover up-to-date data because the system identifiers are actually same. It just needs one extra restart with the initial basebackup.