We have noticed logical replication delay at times(even non peak hours also) in postgres16 bidirectional replication.
Even though we configured bi directional, traffic is from prod1 to prod2 only.
During delay (150GB) no extra traffic,no n/w issue.no IO issue and no tx running more than 1 sec(This is oltp app all tx are in milliseconds).
Noticed from pg_stat_replication_slots total_tx rate is going near zero as shown in graph.not able to understand Why logical decoding going to zero?
grafana logical replication Lag
Only observed **aggressive vacuum **during lag time apart from nothing observed.
And the wait event Walsenderwritedata is noticed from pg_stat_activity
Want to know ,Why logical decoding went low?
During the issue noticed fallowing log from Subscription end:
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
27.69 0.168862 5 29348 epoll_wait
25.18 0.153546 2 73853 73521 recvfrom
19.43 0.118466 2 44281 29343 sendto
16.32 0.099533 3 29238 read
10.11 0.061676 5 12255 pread64
1.27 0.007728 1 4827 rt_sigprocmask
------ ----------- ----------- --------- --------- ----------------
100.00 0.609811 3 193802 102864 total
sendto: resource unavailable errors
-
Perhaps the answer can be found in the Postgres log file; did you have a chance to look there?mustaccio– mustaccio2024年05月21日 15:00:12 +00:00Commented May 21, 2024 at 15:00
-
I verified, nothing much ,noticed from OS level few issues, updated the same above. ThanksRK DBArchitect– RK DBArchitect2024年05月22日 10:15:27 +00:00Commented May 22, 2024 at 10:15
1 Answer 1
The documentation describes WalSenderWriteData
as
Waiting for any activity when processing replies from WAL receiver in WAL sender process.
So perhaps there are network problems, or replay on the subscriber is delayed by a lock or something else. See what the apply workers on the subscriber are doing.
The many errors in the sendto
and recvfrom
system calls indicate a network problem. So you should fix your network, then your problem will probably be solved.
-
Thanks for the response @Laurenz,I have edited the response and copied the Subscription behavior during the time.RK DBArchitect– RK DBArchitect2024年05月22日 10:14:30 +00:00Commented May 22, 2024 at 10:14
-
You should edit the question and add information there, preferably not as an image. Some more clarity would also be useful. For example, is the aggressive autovacuum on the publisher or the subscriber? Does the problem only happen during aggressive autovacuum?Laurenz Albe– Laurenz Albe2024年05月22日 10:49:20 +00:00Commented May 22, 2024 at 10:49
-
It happens at times(bi weekly once or twice),no service running &no long running tx more than 1 sec.RK DBArchitect– RK DBArchitect2024年05月31日 16:50:33 +00:00Commented May 31, 2024 at 16:50
-
And noticed last_end_time in subscription is very slowly moving(If we know why this is happening ,we got the issue)RK DBArchitect– RK DBArchitect2024年05月31日 17:58:15 +00:00Commented May 31, 2024 at 17:58
-
typo ,it's latest_end_time from pg_stat_statements is not upto date ,rest all are with almost with current time. select received_lsn,last_msg_send_time,last_msg_receipt_time,latest_end_lsn,latest_end_time from pg_catalog.pg_stat_subscription;RK DBArchitect– RK DBArchitect2024年06月01日 03:16:00 +00:00Commented Jun 1, 2024 at 3:16
Explore related questions
See similar questions with these tags.