I'm running two postgresql 11 servers - master and slave (setup with logical replication).
The problem I'm facing is that today after weeks of uninterrupted work slave got out of sync with this error message:
2019年09月16日 07:39:44.332 CEST [30117] ERROR: could not send data to WAL stream: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
2019年09月16日 07:39:44.539 CEST [12932] LOG: logical replication apply worker for subscription "logical_from_master" has started
2019年09月16日 07:39:44.542 CEST [27972] LOG: background worker "logical replication worker" (PID 30117) exited with exit code 1
I did see this error message before and my process was to increase wal_sender_timeout
on master (more details on this here: logical replication in postgresql - "server closed the connection unexpectedly")
So then I wanted to restore replication however the state of replication is stuck on catchup:
master=# select * from pg_stat_replication;
pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | backend_xmin | state | sent_lsn | write_lsn | flush_lsn | replay_lsn | write_lag | flush_lag | replay_lag | sync_priority | sync_state
-------+----------+---------+-------------------+---------------+-----------------+-------------+-------------------------------+--------------+---------+--------------+--------------+--------------+--------------+-----------------+-----------------+-----------------+---------------+------------
86864 | 16680 | my_user | logical_from_master | 10.10.10.10 | | 46110 | 2019年09月16日 12:45:56.491325+02 | | catchup | D55/FA04D4B8 | D55/F9E74158 | D55/F9E44CD8 | D55/F9E74030 | 00:00:03.603104 | 00:00:03.603104 | 00:00:03.603104 | 0 | async
(1 row)
I tried to restart slave a few times, with different combinations of subscription enabled and disabled - nothing helps, the replication status keeps staying on catchup
. I can see sent_lsn
and write_lsn
values are changing so something is being sent through...
This is my slave configuration:
wal_level=logical
max_replication_slots=2
max_logical_replication_workers=4
wal_receiver_timeout=1200000
And this is my master:
wal_level=logical
max_replication_slots=10
max_wal_senders=10
# maximum wait time in milliseconds that the walsender process on the active master
# waits for a status message from the walreceiver process on the standby master.
wal_sender_timeout=1200000
I have no idea what to do (even worst, at this stage I have no idea what to check next...)
Can you help me understand what should I do to make my slave to catch up so it's back to streaming
state?
Edit (12 hours later)
When I checked in the morning synchronisation was still in catchup
state
master=# select * from pg_stat_replication;
pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | backend_xmin | state | sent_lsn | write_lsn | flush_lsn | replay_lsn | write_lag | flush_lag | replay_lag | sync_priority | sync_state
-------+----------+---------+-------------------+---------------+-----------------+-------------+-------------------------------+--------------+---------+--------------+--------------+--------------+--------------+-----------+-----------+------------+---------------+------------
12965 | 16680 | my_user | logical_from_master | 10.10.10.10 | | 46630 | 2019年09月17日 06:40:18.801262+02 | | catchup | D56/248E13A0 | D56/247E3908 | D56/247E3908 | D56/247E3908 | | | | 0 | async
(1 row)
But when I checked again 60 seconds later the results set was empty...
Logs now show multiple incarnations of the same error:
2019年09月16日 22:43:33.841 CEST [20260] ERROR: could not receive data from WAL stream: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
2019年09月16日 22:43:33.959 CEST [26087] LOG: background worker "logical replication worker" (PID 20260) exited with exit code 1
2019年09月16日 22:43:34.112 CEST [3510] LOG: logical replication apply worker for subscription "logical_from_master" has started
(...)
In order to make replication to show up as catchup
on master I now have to restart slave first...
Edit (in response to @LaurenzAlbe comment)
I have rebuilt the replica yesterday morning and observed replication failing again starting from 19:53pm. Logs for master and for replica below:
2019年09月18日 19:15:13.767 CEST [8611] LOG: logical replication table synchronization worker for subscription "logical_replica_from_master", table "lasttable" has finished
2019年09月18日 19:54:14.875 CEST [11469] ERROR: could not send data to WAL stream: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
2019年09月18日 19:54:14.969 CEST [10330] LOG: logical replication apply worker for subscription "logical_replica_from_master" has started
2019年09月18日 19:54:15.031 CEST [11217] LOG: background worker "logical replication worker" (PID 11469) exited with exit code 1
Corresponding log from master:
2019年09月18日 19:50:36.386 CEST,,,111051,,5d826e6a.1b1cb,1,,2019年09月18日 19:50:34 CEST,138/28493452,0,LOG,00000,"automatic vacuum of table ""my_db.pg_toast.pg_toast_22314"": index scans: 0
pages: 0 removed, 8949 remain, 0 skipped due to pins, 0 skipped frozen
tuples: 0 removed, 43798 remain, 43783 are dead but not yet removable, oldest xmin: 3141915780
buffer usage: 17925 hits, 0 misses, 0 dirtied
avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s
system usage: CPU: user: 0.04 s, system: 0.05 s, elapsed: 1.88 s",,,,,,,,,""
2019年09月18日 19:51:36.402 CEST,,,1714,,5d826ea6.6b2,1,,2019年09月18日 19:51:34 CEST,316/16529009,0,LOG,00000,"automatic vacuum of table ""my_db.pg_toast.pg_toast_22314"": index scans: 0
pages: 0 removed, 8949 remain, 0 skipped due to pins, 0 skipped frozen
tuples: 0 removed, 43798 remain, 43795 are dead but not yet removable, oldest xmin: 3141915780
buffer usage: 17925 hits, 0 misses, 0 dirtied
avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s
system usage: CPU: user: 0.01 s, system: 0.07 s, elapsed: 1.87 s",,,,,,,,,""
2019年09月18日 19:52:36.421 CEST,,,2649,,5d826ee2.a59,1,,2019年09月18日 19:52:34 CEST,153/19807659,0,LOG,00000,"automatic vacuum of table ""my_db.pg_toast.pg_toast_22314"": index scans: 0
pages: 0 removed, 8949 remain, 0 skipped due to pins, 0 skipped frozen
tuples: 0 removed, 43798 remain, 43795 are dead but not yet removable, oldest xmin: 3141915780
buffer usage: 17924 hits, 0 misses, 0 dirtied
avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s
system usage: CPU: user: 0.03 s, system: 0.05 s, elapsed: 1.87 s",,,,,,,,,""
2019年09月18日 19:53:36.424 CEST,,,2945,,5d826f1e.b81,1,,2019年09月18日 19:53:34 CEST,317/15405278,0,LOG,00000,"automatic vacuum of table ""my_db.pg_toast.pg_toast_22314"": index scans: 0
pages: 0 removed, 8949 remain, 0 skipped due to pins, 0 skipped frozen
tuples: 0 removed, 43798 remain, 43795 are dead but not yet removable, oldest xmin: 3141915780
buffer usage: 17924 hits, 0 misses, 0 dirtied
avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s
system usage: CPU: user: 0.03 s, system: 0.05 s, elapsed: 1.88 s",,,,,,,,,""
2019年09月18日 19:54:15.123 CEST,"core","my_db",3073,"10.194.132.16:50372",5d826f47.c01,1,"idle",2019年09月18日 19:54:15 CEST,317/0,0,LOG,00000,"starting logical decoding for slot ""logical_replica_from_master""","Streaming transactions committing after D5B/7A4D40, reading WAL from D5B/7A4D40.",,,,,,,,"logical_replica_from_master"
2019年09月18日 19:54:15.124 CEST,"core","my_db",3073,"10.194.132.16:50372",5d826f47.c01,2,"idle",2019年09月18日 19:54:15 CEST,317/0,0,LOG,00000,"logical decoding found consistent point at D5B/7A4D40","There are no running transactions.",,,,,,,,"logical_replica_from_master"
2019年09月18日 19:54:36.442 CEST,,,3152,,5d826f5a.c50,1,,2019年09月18日 19:54:34 CEST,362/5175766,0,LOG,00000,"automatic vacuum of table ""my_db.pg_toast.pg_toast_22314"": index scans: 0
pages: 0 removed, 8949 remain, 0 skipped due to pins, 0 skipped frozen
tuples: 0 removed, 43798 remain, 43795 are dead but not yet removable, oldest xmin: 3141915780
buffer usage: 17924 hits, 0 misses, 0 dirtied
avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s
system usage: CPU: user: 0.02 s, system: 0.06 s, elapsed: 1.88 s",,,,,,,,,""
Then around midnight on slave:
2019年09月19日 00:16:48.167 CEST [10330] ERROR: could not send data to WAL stream: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
2019年09月19日 00:16:48.276 CEST [19530] LOG: logical replication apply worker for subscription "logical_replica_from_master" has started
2019年09月19日 00:16:48.324 CEST [11217] LOG: background worker "logical replication worker" (PID 10330) exited with exit code 1
and corresponding log on master:
2019年09月19日 00:15:41.104 CEST,,,74257,,5d82ac89.12211,1,,2019年09月19日 00:15:37 CEST,78/34511468,0,LOG,00000,"automatic vacuum of table ""my_db.pg_toast.pg_toast_22314"": index scans: 0
pages: 0 removed, 13603 remain, 0 skipped due to pins, 0 skipped frozen
tuples: 0 removed, 64816 remain, 64813 are dead but not yet removable, oldest xmin: 3141915780
buffer usage: 27234 hits, 0 misses, 1 dirtied
avg read rate: 0.000 MB/s, avg write rate: 0.003 MB/s
system usage: CPU: user: 0.03 s, system: 0.08 s, elapsed: 2.85 s",,,,,,,,,""
2019年09月19日 00:16:13.688 CEST,,,35656,,5d382555.8b48,11190,,2019年07月24日 11:31:01 CEST,,0,LOG,00000,"checkpoint complete: wrote 1748 buffers (0.0%); 0 WAL file(s) added, 0 removed, 1 recycled; write=174.932 s, sync=0.000 s, total=174.936 s; sync files=75, longest=0.000 s, average=0.000 s; distance=11366 kB, estimate=13499 kB",,,,,,,,,""
2019年09月19日 00:16:41.121 CEST,,,75038,,5d82acc5.1251e,1,,2019年09月19日 00:16:37 CEST,185/19338019,0,LOG,00000,"automatic vacuum of table ""my_db.pg_toast.pg_toast_22314"": index scans: 0
pages: 0 removed, 13603 remain, 0 skipped due to pins, 0 skipped frozen
tuples: 0 removed, 64816 remain, 64813 are dead but not yet removable, oldest xmin: 3141915780
buffer usage: 27233 hits, 0 misses, 0 dirtied
avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s
system usage: CPU: user: 0.04 s, system: 0.07 s, elapsed: 2.85 s",,,,,,,,,""
2019年09月19日 00:16:48.335 CEST,"core","my_db",75294,"10.194.132.16:50480",5d82acd0.1261e,1,"idle",2019年09月19日 00:16:48 CEST,315/0,0,LOG,00000,"starting logical decoding for slot ""logical_replica_from_master""","Streaming transactions committing after D5B/1D1F1C0, reading WAL from D5B/1CA07F8.",,,,,,,,"logical_replica_from_master"
2019年09月19日 00:16:48.335 CEST,"core","my_db",75294,"10.194.132.16:50480",5d82acd0.1261e,2,"idle",2019年09月19日 00:16:48 CEST,315/0,0,LOG,00000,"logical decoding found consistent point at D5B/1CA07F8","There are no running transactions.",,,,,,,,"logical_replica_from_master"
2019年09月19日 00:17:41.141 CEST,,,75484,,5d82ad01.126dc,1,,2019年09月19日 00:17:37 CEST,330/18178915,0,LOG,00000,"automatic vacuum of table ""my_db.pg_toast.pg_toast_22314"": index scans: 0
pages: 0 removed, 13613 remain, 0 skipped due to pins, 0 skipped frozen
tuples: 0 removed, 64866 remain, 64863 are dead but not yet removable, oldest xmin: 3141915780
buffer usage: 27254 hits, 0 misses, 0 dirtied
avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s
system usage: CPU: user: 0.04 s, system: 0.07 s, elapsed: 2.85 s",,,,,,,,,""
Edit
Following error message (on master) appears exactly after wal_sender_timeout
time (from the time replication is enabled on slave):
2019年09月19日 13:33:58.015 CEST,"core","nzdb",112432,"10.194.132.16:50886",5d8362f5.1b730,5,"idle",2019年09月19日 13:13:57 CEST,379/2076197,0,LOG,00000,"terminating walsender process due to replication timeout",,,,,"slot ""logical_replica_from_master"", output plugin ""pgoutput"", in the change callback, associated LSN D5B/6782CF0",,,"WalSndCheckTimeOut, walsender.c:2100","logical_replica_from_master"
Edit
I added more RAM to that server but the observation is still the same - after wal_sender_timeout
worker on slave logs the error mentioned above and on master I'm left with following in pg_stat_replication
:
pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | backend_xmin | state | sent_lsn | write_lsn | flush_lsn | replay_lsn | write_lag | flush_lag | replay_lag | sync_priority | sync_state
-------+----------+---------+--------------------------------------------+---------------+-----------------+-------------+-------------------------------+--------------+---------+----------+-----------+-----------+------------+-----------+-----------+------------+---------------+------------
87820 | 16680 | core | logical_replica_from_master_27004_sync_21691 | 10.10.10.10 | | 55548 | 2019年09月19日 15:31:40.032662+02 | 3142872730 | startup | | | | | | | | 0 | async
(1 row)
Then after a very long while it's back to catchup but with different sent_lsn
If I run INSERT
to test table on master then I do not see changes on slave.
2 Answers 2
As you already found out your master database is too busy for single replication worker to handle all of the changes.
You need to cluster your tables - but make sure you do it in such way that tables with foreign keys are handled with the same worker, otherwise you might get into situation where foreign key constraint will prevent data from being inserted into one table because table foreign key points at had not been updated yet.
I underestimated how busy the master database was and that's because in July I observed replication working without errors. Apparently July is the month of holidays so the database sustained minimally less load for the problem not to manifest.
One of my colleagues pointed out that there is multiple processes writing to that database concurrently hence maybe single WAL sender is simply not coping with volume of information. This was very valid advice and afterwards I was scratching my head trying to think why have I not thought about that in the first place. @jjanes touched base in first comment about this too. I have put too much trust in how postgres adapts even with default options to such different workloads.
So what I'm doing now is to avoid using CREATE PUBLICATION .. FOR ALL TABLES
and instead create multiple publications with multiple corresponding subscriptions on replica side.
-
How to handle duplicate key error on original primary node which is now serving as standby node. So when subscription is started on primary and secondary is publisher, then subscription fails with duplicate keys error as primary already has the data before node went downnishi– nishi2021年08月04日 12:26:28 +00:00Commented Aug 4, 2021 at 12:26
-
I was using replicas to share the load to serve read-only clients, I never tested this in scenario you describe.. But my guess is that probably your new primary will have to stream everything back to your new secondary, but please don't count my word on this as I have not been using logical replicas in your scenario..Greg0ry– Greg0ry2021年08月04日 14:30:34 +00:00Commented Aug 4, 2021 at 14:30
pg_stat_wal_receiver
is empty because WAL receivers are only run with physical replication. That's normal. I guess that replication is incatchup
state regularly because it keeps being interrupted. The problem seems to be on the master server, which closes the connections unexpectedly. Could you add the master's log from the time of the slave error log entries to the question?