Logical replication delay in postgres16(Bidirectional)

Question 1

We have noticed logical replication delay at times(even non peak hours also) in postgres16 bidirectional replication.

Even though we configured bi directional, traffic is from prod1 to prod2 only.

During delay (150GB) no extra traffic,no n/w issue.no IO issue and no tx running more than 1 sec(This is oltp app all tx are in milliseconds).

Noticed from pg_stat_replication_slots total_tx rate is going near zero as shown in graph.not able to understand Why logical decoding going to zero?

grafana logical replication Lag

Only observed **aggressive vacuum **during lag time apart from nothing observed.

And the wait event Walsenderwritedata is noticed from pg_stat_activity

Want to know ,Why logical decoding went low?

During the issue noticed fallowing log from Subscription end:

% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
 27.69 0.168862 5 29348 epoll_wait
 25.18 0.153546 2 73853 73521 recvfrom
 19.43 0.118466 2 44281 29343 sendto
 16.32 0.099533 3 29238 read
 10.11 0.061676 5 12255 pread64
 1.27 0.007728 1 4827 rt_sigprocmask
------ ----------- ----------- --------- --------- ----------------
100.00 0.609811 3 193802 102864 total
sendto: resource unavailable errors

Question 2

Perhaps the answer can be found in the Postgres log file; did you have a chance to look there?

Question 3

I verified, nothing much ,noticed from OS level few issues, updated the same above. Thanks

Question 4

The documentation describes WalSenderWriteData as

Waiting for any activity when processing replies from WAL receiver in WAL sender process.

So perhaps there are network problems, or replay on the subscriber is delayed by a lock or something else. See what the apply workers on the subscriber are doing.

The many errors in the sendto and recvfrom system calls indicate a network problem. So you should fix your network, then your problem will probably be solved.

Question 5

Thanks for the response @Laurenz,I have edited the response and copied the Subscription behavior during the time.

Question 6

You should edit the question and add information there, preferably not as an image. Some more clarity would also be useful. For example, is the aggressive autovacuum on the publisher or the subscriber? Does the problem only happen during aggressive autovacuum?

Question 7

It happens at times(bi weekly once or twice),no service running &no long running tx more than 1 sec.

Question 8

And noticed last_end_time in subscription is very slowly moving(If we know why this is happening ,we got the issue)

Question 9

typo ,it's latest_end_time from pg_stat_statements is not upto date ,rest all are with almost with current time. select received_lsn,last_msg_send_time,last_msg_receipt_time,latest_end_lsn,latest_end_time from pg_catalog.pg_stat_subscription;

Laurenz Albe Laurenz Albe 62k4 gold badges57 silver badges93 bronze badges · Answer 1 · 2024-05-22 06:24:11Z

0

The documentation describes WalSenderWriteData as

Waiting for any activity when processing replies from WAL receiver in WAL sender process.

So perhaps there are network problems, or replay on the subscriber is delayed by a lock or something else. See what the apply workers on the subscriber are doing.

The many errors in the sendto and recvfrom system calls indicate a network problem. So you should fix your network, then your problem will probably be solved.

Share

Improve this answer

edited Jun 3, 2024 at 6:00

answered May 22, 2024 at 6:24

Laurenz Albe's user avatar

Laurenz Albe Laurenz Albe

62k4 gold badges57 silver badges93 bronze badges

5

Thanks for the response @Laurenz,I have edited the response and copied the Subscription behavior during the time.

RK DBArchitect
– RK DBArchitect

2024年05月22日 10:14:30 +00:00
Commented May 22, 2024 at 10:14
You should edit the question and add information there, preferably not as an image. Some more clarity would also be useful. For example, is the aggressive autovacuum on the publisher or the subscriber? Does the problem only happen during aggressive autovacuum?

Laurenz Albe
– Laurenz Albe

2024年05月22日 10:49:20 +00:00
Commented May 22, 2024 at 10:49
It happens at times(bi weekly once or twice),no service running &no long running tx more than 1 sec.

RK DBArchitect
– RK DBArchitect

2024年05月31日 16:50:33 +00:00
Commented May 31, 2024 at 16:50
And noticed last_end_time in subscription is very slowly moving(If we know why this is happening ,we got the issue)

RK DBArchitect
– RK DBArchitect

2024年05月31日 17:58:15 +00:00
Commented May 31, 2024 at 17:58
typo ,it's latest_end_time from pg_stat_statements is not upto date ,rest all are with almost with current time. select received_lsn,last_msg_send_time,last_msg_receipt_time,latest_end_lsn,latest_end_time from pg_catalog.pg_stat_subscription;

RK DBArchitect
– RK DBArchitect

2024年06月01日 03:16:00 +00:00
Commented Jun 1, 2024 at 3:16

Add a comment |

Stack Exchange Network

Logical replication delay in postgres16(Bidirectional)

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Logical replication delay in postgres16(Bidirectional)

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions