0

We have noticed logical replication delay at times(even non peak hours also) in postgres16 bidirectional replication.

Even though we configured bi directional, traffic is from prod1 to prod2 only.

During delay (150GB) no extra traffic,no n/w issue.no IO issue and no tx running more than 1 sec(This is oltp app all tx are in milliseconds).

Noticed from pg_stat_replication_slots total_tx rate is going near zero as shown in graph.not able to understand Why logical decoding going to zero?

grafana logical replication Lag

Only observed **aggressive vacuum **during lag time apart from nothing observed.

And the wait event Walsenderwritedata is noticed from pg_stat_activity

Want to know ,Why logical decoding went low?

During the issue noticed fallowing log from Subscription end:

% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
 27.69 0.168862 5 29348 epoll_wait
 25.18 0.153546 2 73853 73521 recvfrom
 19.43 0.118466 2 44281 29343 sendto
 16.32 0.099533 3 29238 read
 10.11 0.061676 5 12255 pread64
 1.27 0.007728 1 4827 rt_sigprocmask
------ ----------- ----------- --------- --------- ----------------
100.00 0.609811 3 193802 102864 total
sendto: resource unavailable errors
Laurenz Albe
62k4 gold badges57 silver badges93 bronze badges
asked May 21, 2024 at 14:10
2
  • Perhaps the answer can be found in the Postgres log file; did you have a chance to look there? Commented May 21, 2024 at 15:00
  • I verified, nothing much ,noticed from OS level few issues, updated the same above. Thanks Commented May 22, 2024 at 10:15

1 Answer 1

0

The documentation describes WalSenderWriteData as

Waiting for any activity when processing replies from WAL receiver in WAL sender process.

So perhaps there are network problems, or replay on the subscriber is delayed by a lock or something else. See what the apply workers on the subscriber are doing.

The many errors in the sendto and recvfrom system calls indicate a network problem. So you should fix your network, then your problem will probably be solved.

answered May 22, 2024 at 6:24
5
  • Thanks for the response @Laurenz,I have edited the response and copied the Subscription behavior during the time. Commented May 22, 2024 at 10:14
  • You should edit the question and add information there, preferably not as an image. Some more clarity would also be useful. For example, is the aggressive autovacuum on the publisher or the subscriber? Does the problem only happen during aggressive autovacuum? Commented May 22, 2024 at 10:49
  • It happens at times(bi weekly once or twice),no service running &no long running tx more than 1 sec. Commented May 31, 2024 at 16:50
  • And noticed last_end_time in subscription is very slowly moving(If we know why this is happening ,we got the issue) Commented May 31, 2024 at 17:58
  • typo ,it's latest_end_time from pg_stat_statements is not upto date ,rest all are with almost with current time. select received_lsn,last_msg_send_time,last_msg_receipt_time,latest_end_lsn,latest_end_time from pg_catalog.pg_stat_subscription; Commented Jun 1, 2024 at 3:16

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.