postgresql replication failed with "No such file or directory" for wal file

Question 1

I'm studying postgresql recently. Now I'm learning how to configure PG replication with slot. -- So I'm following this sample -- https://girders.org/postgresql/2021/11/05/setup-postgresql14-replication/

But all attempts are failed with "No such file or directory" when there's new tran raised from primary side. I can't find the issue and the fix. Please kind have a look and help.

Environment --

VM1 -- 100.70.224.70/23 -- primary side
VM2 -- 100.70.225.241/23 -- replica
PG version -- 14.10

Steps:

On the both sides , there're a PG instance already, the PGDATA are both /var/lib/pgsql/data and the archive folder are both /tmp/pgbak Both are configured archive with same configuration --

archive_command = 'test ! -f /tmp/pgbak/%f && cp %p /tmp/pgbak/%f'
archive_timeout = '1min'
archive_mode = 'on'
archive_cleanup_command = 'pg_archivecleanup archivelocation %r'
restore_command = 'cp /tmp/pgbak/%f %p'

On replica, clean thePG data folder -- rm -rf /var/lib/pgsql/data/* and also clean the archive folder -- rm -rf /tmp/pgbak
On primary , edit postgresql.conf --

wal_level = replica
max_wal_senders = 10
wal_keep_size = '1GB'
wal_compression = on

On primary, create the replication user --

 createuser -U postgres --replication repl

On primary, edit the pg_hba.conf added 2 lines--

host all all 0.0.0.0/0 trust
host replication all 0.0.0.0/0 trust

And restart the PG instance on primary. Then tested on replica with psql , it works fine.

Create replication slot on primary --

select * from pg_create_physical_replication_slot('db02_repl_slot');

On replica, make a basebackup --

pg_basebackup --pgdata /var/lib/pgsql/data --format=p --write-recovery-conf --checkpoint=fast --label=mffb --progress --host=100.70.224.70 -R --username=repl

On replica, add the replication slot to postgresql.auto.conf --

primary_conninfo = 'user=repl passfile=''/var/lib/pgsql/.pgpass'' channel_binding=prefer host=100.70.224.70 port=5432 sslmode=prefer sslcompression=0 sslcertmode=allow sslsni=1 ssl_min_protocol_version=TLSv1.2 gssencmode=prefer krbsrvname=postgres gssdelegation=0 target_session_attrs=any load_balance_hosts=disable application_name=db02.repl'
primary_slot_name = 'db02_repl_slot'

Start the PG instance on replica, and check the slot status on primary, it looks fine --

eisendb=# select slot_name, slot_type, active, wal_status from pg_replication_slots;
 slot_name | slot_type | active | wal_status
----------------+-----------+--------+------------
 db02_repl_slot | physical | t | reserved

Then I tested with data modification on primary, then I found no data transferred to replica, and from the error log file on replica , I found this error --

2024年01月01日 12:30:07.066 UTC [4737]CONTEXT: WAL redo at D5/75000060 for Standby/RUNNING_XACTS: nextXid 10361 latestCompletedXid 10360 oldestRunningXid 10361
2024年01月01日 12:30:07.066 UTC [4737]DEBUG: executing restore command "cp /tmp/pgbak/00000001000000D500000076 pg_wal/RECOVERYXLOG"
2024年01月01日 12:30:07.068 UTC [4741]DEBUG: checkpointer updated shared memory configuration values
cp: cannot stat '/tmp/pgbak/00000001000000D500000076': No such file or directory
2024年01月01日 12:30:07.069 UTC [4737]DEBUG: could not restore file "00000001000000D500000076" from archive: child process exited with exit code 1
2024年01月01日 12:30:07.069 UTC [4737]DEBUG: prune KnownAssignedXids to 10361
And this is the log records filtered the debug messages --
2024年01月01日 13:09:12.622 UTC [8367]LOG: database system was interrupted; last known up at 2024年01月01日 13:05:59 UTC
cp: cannot stat '/tmp/pgbak/00000002.history': No such file or directory
2024年01月01日 13:09:12.634 UTC [8367]LOG: entering standby mode
cp: cannot stat '/tmp/pgbak/00000001000000D500000082': No such file or directory
2024年01月01日 13:09:12.639 UTC [8367]LOG: redo starts at D5/82000028
2024年01月01日 13:09:12.640 UTC [8367]LOG: consistent recovery state reached at D5/82000138
2024年01月01日 13:09:12.640 UTC [8362]LOG: database system is ready to accept read-only connections
cp: cannot stat '/tmp/pgbak/00000001000000D500000083': No such file or directory
2024年01月01日 13:09:12.649 UTC [8374]LOG: started streaming WAL from primary at D5/83000000 on timeline 1
2024年01月01日 13:10:28.316 UTC [8367]LOG: recovery stopping before commit of transaction 10363, time 2024年01月01日 13:10:28.315075+00
2024年01月01日 13:10:28.316 UTC [8367]LOG: pausing at the end of recovery
2024年01月01日 13:10:28.316 UTC [8367]HINT: Execute pg_wal_replay_resume() to promote.

It looks the replica is searching for some archived wal in the archive folder but no found. And meanwhile I checked the /tmp/pgbak on replica and find it's empty as well...I'm not familiar with PG replication details, so I'm thinking if there's any error in my configuration so that the archived wal on primary can't be replicated to replica? If yes, please kind correct me. Thanks in advance.

Question 2

I don't see any error here. Those are just informational messages. It is normal to request a file which does not exist. The error would be in how it responds to that, but you didn't include enough of the log file to show that.

Question 3

The blog post you link to is for a streaming-only setup. It doesn't have a wal archive. Clearly you have an archive configured (or at least have attempted to), but you haven't said how or why.

Question 4

@jjanes Thanks for your comments. Now I've added the archive config and the log file messages to the post. Please have a look again. Thanks

Question 5

It looks like you have a recovery_target_* parameter set on the standby, so the recovery is suspended when you reach that point.

Question 6

@LaurenzAlbe Thanks a lot. Yes! It's a legacy parameter left on primary for a PITR test before. After remove it, the replication slot works fine. Please send a reply and I'll mark it as correct answer. Thanks again.

Question 7

That looks very much like you have one of the recovery_target_* parameters set on the standby, so that recovery stops at that point. The setting was probably copied from the primary when you ran pg_basebackup.

Laurenz Albe Laurenz Albe 62k4 gold badges57 silver badges93 bronze badges · Accepted Answer · 2024-01-03 10:38:47Z

That looks very much like you have one of the recovery_target_* parameters set on the standby, so that recovery stops at that point. The setting was probably copied from the primary when you ran pg_basebackup.

Stack Exchange Network

postgresql replication failed with "No such file or directory" for wal file

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

postgresql replication failed with "No such file or directory" for wal file

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions