Postgresql 11 logical replication - stuck in `catchup` state

Question 1

I'm running two postgresql 11 servers - master and slave (setup with logical replication).

The problem I'm facing is that today after weeks of uninterrupted work slave got out of sync with this error message:

2019年09月16日 07:39:44.332 CEST [30117] ERROR: could not send data to WAL stream: server closed the connection unexpectedly
 This probably means the server terminated abnormally
 before or while processing the request.
2019年09月16日 07:39:44.539 CEST [12932] LOG: logical replication apply worker for subscription "logical_from_master" has started
2019年09月16日 07:39:44.542 CEST [27972] LOG: background worker "logical replication worker" (PID 30117) exited with exit code 1

I did see this error message before and my process was to increase wal_sender_timeout on master (more details on this here: logical replication in postgresql - "server closed the connection unexpectedly")

So then I wanted to restore replication however the state of replication is stuck on catchup:

master=# select * from pg_stat_replication;
 pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | backend_xmin | state | sent_lsn | write_lsn | flush_lsn | replay_lsn | write_lag | flush_lag | replay_lag | sync_priority | sync_state
-------+----------+---------+-------------------+---------------+-----------------+-------------+-------------------------------+--------------+---------+--------------+--------------+--------------+--------------+-----------------+-----------------+-----------------+---------------+------------
 86864 | 16680 | my_user | logical_from_master | 10.10.10.10 | | 46110 | 2019年09月16日 12:45:56.491325+02 | | catchup | D55/FA04D4B8 | D55/F9E74158 | D55/F9E44CD8 | D55/F9E74030 | 00:00:03.603104 | 00:00:03.603104 | 00:00:03.603104 | 0 | async
(1 row)

I tried to restart slave a few times, with different combinations of subscription enabled and disabled - nothing helps, the replication status keeps staying on catchup. I can see sent_lsn and write_lsn values are changing so something is being sent through...

This is my slave configuration:

wal_level=logical
max_replication_slots=2
max_logical_replication_workers=4
wal_receiver_timeout=1200000

And this is my master:

wal_level=logical
max_replication_slots=10
max_wal_senders=10
# maximum wait time in milliseconds that the walsender process on the active master
# waits for a status message from the walreceiver process on the standby master.
wal_sender_timeout=1200000

I have no idea what to do (even worst, at this stage I have no idea what to check next...)

Can you help me understand what should I do to make my slave to catch up so it's back to streaming state?

Edit (12 hours later)

When I checked in the morning synchronisation was still in catchup state

master=# select * from pg_stat_replication;
 pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | backend_xmin | state | sent_lsn | write_lsn | flush_lsn | replay_lsn | write_lag | flush_lag | replay_lag | sync_priority | sync_state
-------+----------+---------+-------------------+---------------+-----------------+-------------+-------------------------------+--------------+---------+--------------+--------------+--------------+--------------+-----------+-----------+------------+---------------+------------
 12965 | 16680 | my_user | logical_from_master | 10.10.10.10 | | 46630 | 2019年09月17日 06:40:18.801262+02 | | catchup | D56/248E13A0 | D56/247E3908 | D56/247E3908 | D56/247E3908 | | | | 0 | async
(1 row)

But when I checked again 60 seconds later the results set was empty...

Logs now show multiple incarnations of the same error:

2019年09月16日 22:43:33.841 CEST [20260] ERROR: could not receive data from WAL stream: server closed the connection unexpectedly
 This probably means the server terminated abnormally
 before or while processing the request.
2019年09月16日 22:43:33.959 CEST [26087] LOG: background worker "logical replication worker" (PID 20260) exited with exit code 1
2019年09月16日 22:43:34.112 CEST [3510] LOG: logical replication apply worker for subscription "logical_from_master" has started
(...)

In order to make replication to show up as catchup on master I now have to restart slave first...

Edit (in response to @LaurenzAlbe comment)

I have rebuilt the replica yesterday morning and observed replication failing again starting from 19:53pm. Logs for master and for replica below:

2019年09月18日 19:15:13.767 CEST [8611] LOG: logical replication table synchronization worker for subscription "logical_replica_from_master", table "lasttable" has finished
2019年09月18日 19:54:14.875 CEST [11469] ERROR: could not send data to WAL stream: server closed the connection unexpectedly
 This probably means the server terminated abnormally
 before or while processing the request.
2019年09月18日 19:54:14.969 CEST [10330] LOG: logical replication apply worker for subscription "logical_replica_from_master" has started
2019年09月18日 19:54:15.031 CEST [11217] LOG: background worker "logical replication worker" (PID 11469) exited with exit code 1

Corresponding log from master:

2019年09月18日 19:50:36.386 CEST,,,111051,,5d826e6a.1b1cb,1,,2019年09月18日 19:50:34 CEST,138/28493452,0,LOG,00000,"automatic vacuum of table ""my_db.pg_toast.pg_toast_22314"": index scans: 0
pages: 0 removed, 8949 remain, 0 skipped due to pins, 0 skipped frozen
tuples: 0 removed, 43798 remain, 43783 are dead but not yet removable, oldest xmin: 3141915780
buffer usage: 17925 hits, 0 misses, 0 dirtied
avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s
system usage: CPU: user: 0.04 s, system: 0.05 s, elapsed: 1.88 s",,,,,,,,,""
2019年09月18日 19:51:36.402 CEST,,,1714,,5d826ea6.6b2,1,,2019年09月18日 19:51:34 CEST,316/16529009,0,LOG,00000,"automatic vacuum of table ""my_db.pg_toast.pg_toast_22314"": index scans: 0
pages: 0 removed, 8949 remain, 0 skipped due to pins, 0 skipped frozen
tuples: 0 removed, 43798 remain, 43795 are dead but not yet removable, oldest xmin: 3141915780
buffer usage: 17925 hits, 0 misses, 0 dirtied
avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s
system usage: CPU: user: 0.01 s, system: 0.07 s, elapsed: 1.87 s",,,,,,,,,""
2019年09月18日 19:52:36.421 CEST,,,2649,,5d826ee2.a59,1,,2019年09月18日 19:52:34 CEST,153/19807659,0,LOG,00000,"automatic vacuum of table ""my_db.pg_toast.pg_toast_22314"": index scans: 0
pages: 0 removed, 8949 remain, 0 skipped due to pins, 0 skipped frozen
tuples: 0 removed, 43798 remain, 43795 are dead but not yet removable, oldest xmin: 3141915780
buffer usage: 17924 hits, 0 misses, 0 dirtied
avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s
system usage: CPU: user: 0.03 s, system: 0.05 s, elapsed: 1.87 s",,,,,,,,,""
2019年09月18日 19:53:36.424 CEST,,,2945,,5d826f1e.b81,1,,2019年09月18日 19:53:34 CEST,317/15405278,0,LOG,00000,"automatic vacuum of table ""my_db.pg_toast.pg_toast_22314"": index scans: 0
pages: 0 removed, 8949 remain, 0 skipped due to pins, 0 skipped frozen
tuples: 0 removed, 43798 remain, 43795 are dead but not yet removable, oldest xmin: 3141915780
buffer usage: 17924 hits, 0 misses, 0 dirtied
avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s
system usage: CPU: user: 0.03 s, system: 0.05 s, elapsed: 1.88 s",,,,,,,,,""
2019年09月18日 19:54:15.123 CEST,"core","my_db",3073,"10.194.132.16:50372",5d826f47.c01,1,"idle",2019年09月18日 19:54:15 CEST,317/0,0,LOG,00000,"starting logical decoding for slot ""logical_replica_from_master""","Streaming transactions committing after D5B/7A4D40, reading WAL from D5B/7A4D40.",,,,,,,,"logical_replica_from_master"
2019年09月18日 19:54:15.124 CEST,"core","my_db",3073,"10.194.132.16:50372",5d826f47.c01,2,"idle",2019年09月18日 19:54:15 CEST,317/0,0,LOG,00000,"logical decoding found consistent point at D5B/7A4D40","There are no running transactions.",,,,,,,,"logical_replica_from_master"
2019年09月18日 19:54:36.442 CEST,,,3152,,5d826f5a.c50,1,,2019年09月18日 19:54:34 CEST,362/5175766,0,LOG,00000,"automatic vacuum of table ""my_db.pg_toast.pg_toast_22314"": index scans: 0
pages: 0 removed, 8949 remain, 0 skipped due to pins, 0 skipped frozen
tuples: 0 removed, 43798 remain, 43795 are dead but not yet removable, oldest xmin: 3141915780
buffer usage: 17924 hits, 0 misses, 0 dirtied
avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s
system usage: CPU: user: 0.02 s, system: 0.06 s, elapsed: 1.88 s",,,,,,,,,""

Then around midnight on slave:

2019年09月19日 00:16:48.167 CEST [10330] ERROR: could not send data to WAL stream: server closed the connection unexpectedly
 This probably means the server terminated abnormally
 before or while processing the request.
2019年09月19日 00:16:48.276 CEST [19530] LOG: logical replication apply worker for subscription "logical_replica_from_master" has started
2019年09月19日 00:16:48.324 CEST [11217] LOG: background worker "logical replication worker" (PID 10330) exited with exit code 1

and corresponding log on master:

2019年09月19日 00:15:41.104 CEST,,,74257,,5d82ac89.12211,1,,2019年09月19日 00:15:37 CEST,78/34511468,0,LOG,00000,"automatic vacuum of table ""my_db.pg_toast.pg_toast_22314"": index scans: 0
pages: 0 removed, 13603 remain, 0 skipped due to pins, 0 skipped frozen
tuples: 0 removed, 64816 remain, 64813 are dead but not yet removable, oldest xmin: 3141915780
buffer usage: 27234 hits, 0 misses, 1 dirtied
avg read rate: 0.000 MB/s, avg write rate: 0.003 MB/s
system usage: CPU: user: 0.03 s, system: 0.08 s, elapsed: 2.85 s",,,,,,,,,""
2019年09月19日 00:16:13.688 CEST,,,35656,,5d382555.8b48,11190,,2019年07月24日 11:31:01 CEST,,0,LOG,00000,"checkpoint complete: wrote 1748 buffers (0.0%); 0 WAL file(s) added, 0 removed, 1 recycled; write=174.932 s, sync=0.000 s, total=174.936 s; sync files=75, longest=0.000 s, average=0.000 s; distance=11366 kB, estimate=13499 kB",,,,,,,,,""
2019年09月19日 00:16:41.121 CEST,,,75038,,5d82acc5.1251e,1,,2019年09月19日 00:16:37 CEST,185/19338019,0,LOG,00000,"automatic vacuum of table ""my_db.pg_toast.pg_toast_22314"": index scans: 0
pages: 0 removed, 13603 remain, 0 skipped due to pins, 0 skipped frozen
tuples: 0 removed, 64816 remain, 64813 are dead but not yet removable, oldest xmin: 3141915780
buffer usage: 27233 hits, 0 misses, 0 dirtied
avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s
system usage: CPU: user: 0.04 s, system: 0.07 s, elapsed: 2.85 s",,,,,,,,,""
2019年09月19日 00:16:48.335 CEST,"core","my_db",75294,"10.194.132.16:50480",5d82acd0.1261e,1,"idle",2019年09月19日 00:16:48 CEST,315/0,0,LOG,00000,"starting logical decoding for slot ""logical_replica_from_master""","Streaming transactions committing after D5B/1D1F1C0, reading WAL from D5B/1CA07F8.",,,,,,,,"logical_replica_from_master"
2019年09月19日 00:16:48.335 CEST,"core","my_db",75294,"10.194.132.16:50480",5d82acd0.1261e,2,"idle",2019年09月19日 00:16:48 CEST,315/0,0,LOG,00000,"logical decoding found consistent point at D5B/1CA07F8","There are no running transactions.",,,,,,,,"logical_replica_from_master"
2019年09月19日 00:17:41.141 CEST,,,75484,,5d82ad01.126dc,1,,2019年09月19日 00:17:37 CEST,330/18178915,0,LOG,00000,"automatic vacuum of table ""my_db.pg_toast.pg_toast_22314"": index scans: 0
pages: 0 removed, 13613 remain, 0 skipped due to pins, 0 skipped frozen
tuples: 0 removed, 64866 remain, 64863 are dead but not yet removable, oldest xmin: 3141915780
buffer usage: 27254 hits, 0 misses, 0 dirtied
avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s
system usage: CPU: user: 0.04 s, system: 0.07 s, elapsed: 2.85 s",,,,,,,,,""

Edit

Following error message (on master) appears exactly after wal_sender_timeout time (from the time replication is enabled on slave):

2019年09月19日 13:33:58.015 CEST,"core","nzdb",112432,"10.194.132.16:50886",5d8362f5.1b730,5,"idle",2019年09月19日 13:13:57 CEST,379/2076197,0,LOG,00000,"terminating walsender process due to replication timeout",,,,,"slot ""logical_replica_from_master"", output plugin ""pgoutput"", in the change callback, associated LSN D5B/6782CF0",,,"WalSndCheckTimeOut, walsender.c:2100","logical_replica_from_master"

Edit

I added more RAM to that server but the observation is still the same - after wal_sender_timeout worker on slave logs the error mentioned above and on master I'm left with following in pg_stat_replication:

 pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | backend_xmin | state | sent_lsn | write_lsn | flush_lsn | replay_lsn | write_lag | flush_lag | replay_lag | sync_priority | sync_state
-------+----------+---------+--------------------------------------------+---------------+-----------------+-------------+-------------------------------+--------------+---------+----------+-----------+-----------+------------+-----------+-----------+------------+---------------+------------
 87820 | 16680 | core | logical_replica_from_master_27004_sync_21691 | 10.10.10.10 | | 55548 | 2019年09月19日 15:31:40.032662+02 | 3142872730 | startup | | | | | | | | 0 | async
(1 row)

Then after a very long while it's back to catchup but with different sent_lsn

If I run INSERT to test table on master then I do not see changes on slave.

Question 2

"I can see sent_lsn and write_lsn values are changing so something is being sent through" Sounds like it got quite a ways behind, and is just taking a while to catch up. How busy (in terms of writes) is the master? Can you figure out the bottleneck on the replica? Maybe you can use pg_prewarm or something to speed up the application of the logs.

Question 3

pg_stat_wal_receiver is empty because WAL receivers are only run with physical replication. That's normal. I guess that replication is in catchup state regularly because it keeps being interrupted. The problem seems to be on the master server, which closes the connections unexpectedly. Could you add the master's log from the time of the slave error log entries to the question?

Question 4

There is no indication of a problem in the master's log that anything is wrong. Perhaps a network problem (firewall) that drop connections?

Question 5

Is there a reason you care? If it is permanently catching up, but is only 3 seconds behind, what problems does that cause? Is it better for it to have caught up completely at one time, but now be hours behind because it fell behind again?

Question 6

Not that I know.

Question 7

As you already found out your master database is too busy for single replication worker to handle all of the changes.

You need to cluster your tables - but make sure you do it in such way that tables with foreign keys are handled with the same worker, otherwise you might get into situation where foreign key constraint will prevent data from being inserted into one table because table foreign key points at had not been updated yet.

Question 8

I underestimated how busy the master database was and that's because in July I observed replication working without errors. Apparently July is the month of holidays so the database sustained minimally less load for the problem not to manifest.

One of my colleagues pointed out that there is multiple processes writing to that database concurrently hence maybe single WAL sender is simply not coping with volume of information. This was very valid advice and afterwards I was scratching my head trying to think why have I not thought about that in the first place. @jjanes touched base in first comment about this too. I have put too much trust in how postgres adapts even with default options to such different workloads.

So what I'm doing now is to avoid using CREATE PUBLICATION .. FOR ALL TABLES and instead create multiple publications with multiple corresponding subscriptions on replica side.

Question 9

How to handle duplicate key error on original primary node which is now serving as standby node. So when subscription is started on primary and secondary is publisher, then subscription fails with duplicate keys error as primary already has the data before node went down

Question 10

I was using replicas to share the load to serve read-only clients, I never tested this in scenario you describe.. But my guess is that probably your new primary will have to stream everything back to your new secondary, but please don't count my word on this as I have not been using logical replicas in your scenario..

Rafel Bennassar Rafel Bennassar 2361 silver badge2 bronze badges · Accepted Answer · 2019-09-26 05:13:31Z

As you already found out your master database is too busy for single replication worker to handle all of the changes.

You need to cluster your tables - but make sure you do it in such way that tables with foreign keys are handled with the same worker, otherwise you might get into situation where foreign key constraint will prevent data from being inserted into one table because table foreign key points at had not been updated yet.

Stack Exchange Network

Postgresql 11 logical replication - stuck in `catchup` state

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Postgresql 11 logical replication - stuck in `catchup` state

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions