0

I'm studying postgresql recently. Now I'm learning how to configure PG replication with slot. -- So I'm following this sample -- https://girders.org/postgresql/2021/11/05/setup-postgresql14-replication/

But all attempts are failed with "No such file or directory" when there's new tran raised from primary side. I can't find the issue and the fix. Please kind have a look and help.

Environment --

VM1 -- 100.70.224.70/23 -- primary side
VM2 -- 100.70.225.241/23 -- replica
PG version -- 14.10

Steps:

  1. On the both sides , there're a PG instance already, the PGDATA are both /var/lib/pgsql/data and the archive folder are both /tmp/pgbak Both are configured archive with same configuration --
archive_command = 'test ! -f /tmp/pgbak/%f && cp %p /tmp/pgbak/%f'
archive_timeout = '1min'
archive_mode = 'on'
archive_cleanup_command = 'pg_archivecleanup archivelocation %r'
restore_command = 'cp /tmp/pgbak/%f %p'
  1. On replica, clean thePG data folder -- rm -rf /var/lib/pgsql/data/* and also clean the archive folder -- rm -rf /tmp/pgbak
  2. On primary , edit postgresql.conf --
wal_level = replica
max_wal_senders = 10
wal_keep_size = '1GB'
wal_compression = on
  1. On primary, create the replication user --
 createuser -U postgres --replication repl
  1. On primary, edit the pg_hba.conf added 2 lines--
host all all 0.0.0.0/0 trust
host replication all 0.0.0.0/0 trust

And restart the PG instance on primary. Then tested on replica with psql , it works fine.

  1. Create replication slot on primary --
select * from pg_create_physical_replication_slot('db02_repl_slot');
  1. On replica, make a basebackup --
pg_basebackup --pgdata /var/lib/pgsql/data --format=p --write-recovery-conf --checkpoint=fast --label=mffb --progress --host=100.70.224.70 -R --username=repl
  1. On replica, add the replication slot to postgresql.auto.conf --
primary_conninfo = 'user=repl passfile=''/var/lib/pgsql/.pgpass'' channel_binding=prefer host=100.70.224.70 port=5432 sslmode=prefer sslcompression=0 sslcertmode=allow sslsni=1 ssl_min_protocol_version=TLSv1.2 gssencmode=prefer krbsrvname=postgres gssdelegation=0 target_session_attrs=any load_balance_hosts=disable application_name=db02.repl'
primary_slot_name = 'db02_repl_slot'
  1. Start the PG instance on replica, and check the slot status on primary, it looks fine --
eisendb=# select slot_name, slot_type, active, wal_status from pg_replication_slots;
 slot_name | slot_type | active | wal_status
----------------+-----------+--------+------------
 db02_repl_slot | physical | t | reserved
  1. Then I tested with data modification on primary, then I found no data transferred to replica, and from the error log file on replica , I found this error --
2024年01月01日 12:30:07.066 UTC [4737]CONTEXT: WAL redo at D5/75000060 for Standby/RUNNING_XACTS: nextXid 10361 latestCompletedXid 10360 oldestRunningXid 10361
2024年01月01日 12:30:07.066 UTC [4737]DEBUG: executing restore command "cp /tmp/pgbak/00000001000000D500000076 pg_wal/RECOVERYXLOG"
2024年01月01日 12:30:07.068 UTC [4741]DEBUG: checkpointer updated shared memory configuration values
cp: cannot stat '/tmp/pgbak/00000001000000D500000076': No such file or directory
2024年01月01日 12:30:07.069 UTC [4737]DEBUG: could not restore file "00000001000000D500000076" from archive: child process exited with exit code 1
2024年01月01日 12:30:07.069 UTC [4737]DEBUG: prune KnownAssignedXids to 10361
And this is the log records filtered the debug messages --
2024年01月01日 13:09:12.622 UTC [8367]LOG: database system was interrupted; last known up at 2024年01月01日 13:05:59 UTC
cp: cannot stat '/tmp/pgbak/00000002.history': No such file or directory
2024年01月01日 13:09:12.634 UTC [8367]LOG: entering standby mode
cp: cannot stat '/tmp/pgbak/00000001000000D500000082': No such file or directory
2024年01月01日 13:09:12.639 UTC [8367]LOG: redo starts at D5/82000028
2024年01月01日 13:09:12.640 UTC [8367]LOG: consistent recovery state reached at D5/82000138
2024年01月01日 13:09:12.640 UTC [8362]LOG: database system is ready to accept read-only connections
cp: cannot stat '/tmp/pgbak/00000001000000D500000083': No such file or directory
2024年01月01日 13:09:12.649 UTC [8374]LOG: started streaming WAL from primary at D5/83000000 on timeline 1
2024年01月01日 13:10:28.316 UTC [8367]LOG: recovery stopping before commit of transaction 10363, time 2024年01月01日 13:10:28.315075+00
2024年01月01日 13:10:28.316 UTC [8367]LOG: pausing at the end of recovery
2024年01月01日 13:10:28.316 UTC [8367]HINT: Execute pg_wal_replay_resume() to promote.

It looks the replica is searching for some archived wal in the archive folder but no found. And meanwhile I checked the /tmp/pgbak on replica and find it's empty as well...I'm not familiar with PG replication details, so I'm thinking if there's any error in my configuration so that the archived wal on primary can't be replicated to replica? If yes, please kind correct me. Thanks in advance.

asked Jan 1, 2024 at 14:42
5
  • I don't see any error here. Those are just informational messages. It is normal to request a file which does not exist. The error would be in how it responds to that, but you didn't include enough of the log file to show that. Commented Jan 1, 2024 at 16:08
  • The blog post you link to is for a streaming-only setup. It doesn't have a wal archive. Clearly you have an archive configured (or at least have attempted to), but you haven't said how or why. Commented Jan 1, 2024 at 16:23
  • @jjanes Thanks for your comments. Now I've added the archive config and the log file messages to the post. Please have a look again. Thanks Commented Jan 2, 2024 at 1:26
  • 1
    It looks like you have a recovery_target_* parameter set on the standby, so the recovery is suspended when you reach that point. Commented Jan 2, 2024 at 15:13
  • @LaurenzAlbe Thanks a lot. Yes! It's a legacy parameter left on primary for a PITR test before. After remove it, the replication slot works fine. Please send a reply and I'll mark it as correct answer. Thanks again. Commented Jan 3, 2024 at 0:57

1 Answer 1

1

That looks very much like you have one of the recovery_target_* parameters set on the standby, so that recovery stops at that point. The setting was probably copied from the primary when you ran pg_basebackup.

answered Jan 3, 2024 at 10:38

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.