4

I have a PostgreSQL 9.5 cluster with master and standby server that are replicated using repmgr. Master is configured to keep 4000 WAL files:

wal_level = logical
hot_standby = on
archive_command = 'test ! -f /mnt/share/psql/archive/psql/%f && cp %p /mnt/share/psql/archive/psql/%f'
max_worker_processes = 10 
max_replication_slots = 10
max_wal_senders = 10
wal_keep_segments = 4000

At some point the standby server falls behind master (I have no idea why is this happening):

2017年01月28日 23:49:24 UTC ERROR: current transaction is aborted, commands ignored until end of transaction block
2017年01月28日 23:49:24 UTC STATEMENT: 
2017年01月28日 23:51:10 UTC LOG: invalid magic number 0000 in log segment 0000000200001E5A000000C8, offset 5201920
2017年01月28日 23:51:10 UTC LOG: started streaming WAL from primary at 1E5A/C8000000 on timeline 2
2017年01月29日 00:04:59 UTC FATAL: could not send data to WAL stream: server closed the connection unexpectedly
 This probably means the server terminated abnormally
 before or while processing the request.
2017年01月29日 00:16:10 UTC LOG: invalid magic number 0000 in log segment 0000000200001E5B000000C5, offset 5242880
2017年01月29日 00:16:10 UTC LOG: started streaming WAL from primary at 1E5B/C5000000 on timeline 2
2017年01月29日 00:50:07 UTC FATAL: could not send data to WAL stream: server closed the connection unexpectedly
 This probably means the server terminated abnormally
 before or while processing the request.
2017年01月29日 00:50:07 UTC LOG: invalid magic number 0000 in log segment 0000000200001E5D0000000A, offset 5373952
2017年01月29日 00:50:22 UTC LOG: started streaming WAL from primary at 1E5D/A000000 on timeline 2
2017年01月29日 03:29:35 UTC FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 0000000200001E64000000B7 has already been removed

Obviously master removed needed WAL 0000000200001E64000000B7, but the file is still present in the archive. Anyway, after standby restart the database seems to reach consistent state:

...
2017年01月29日 13:45:35 UTC LOG: restored log file "0000000200001E64000000B1" from archive
2017年01月29日 13:45:36 UTC LOG: restored log file "0000000200001E64000000B2" from archive
2017年01月29日 13:45:36 UTC LOG: restored log file "0000000200001E64000000B3" from archive
2017年01月29日 13:45:36 UTC LOG: restored log file "0000000200001E64000000B4" from archive
2017年01月29日 13:45:37 UTC LOG: restored log file "0000000200001E64000000B5" from archive
2017年01月29日 13:45:37 UTC LOG: restored log file "0000000200001E64000000B6" from archive
2017年01月29日 13:45:37 UTC LOG: restored log file "0000000200001E64000000B7" from archive
2017年01月29日 13:45:38 UTC LOG: consistent recovery state reached at 1E64/B7DFFD78
2017年01月29日 13:45:38 UTC LOG: recovery stopping after reaching consistency
2017年01月29日 13:45:38 UTC LOG: recovery has paused
2017年01月29日 13:45:38 UTC HINT: Execute pg_xlog_replay_resume() to continue.
2017年01月29日 13:45:38 UTC LOG: database system is ready to accept read only connections

PostgreSQL suggest to execute pg_xlog_replay_resume(), but this would lead to promoting standby to master, thus a split-brain situation.

$ repmgr cluster show
Role | Name | Upstream | Connection String
----------+-------------------------|-------------------------|-----------------------------------------
* master | psql01a | | host=psql01a user=repmgr
 standby | psql01b | psql01a | host=psql01b user=repmgr

recovery.conf:

restore_command = 'cp /mnt/share/psql/archive/psql/%f %p'
recovery_target_inclusive = true
recovery_target = 'immediate'
recovery_target_timeline = 'latest'
standby_mode = on
primary_conninfo = 'user=repmgr port=5432 sslmode=prefer sslcompression=1 krbsrvname=postgres host=psql01a application_name=psql01b password=ZDIzNjk2OTM2MWYyNjNiYzk5ZDVhMWIw'
recovery_min_apply_delay = 0

PostgreSQL 9.5 comes with a new parameter recovery_target_action that can have 3 values in recovery.conf:

  • pause (default) recovery will be paused
  • shutdown will stop the server after reaching the recovery target
  • promote means the recovery process will finish and the server will start to accept connections

promote seems to be logical choice according to the documentation, however it leads again to split-brain situation:

2017年01月29日 19:31:27 UTC LOG: consistent recovery state reached at 1E64/B7DFFD78
2017年01月29日 19:31:27 UTC LOG: recovery stopping after reaching consistency
2017年01月29日 19:31:27 UTC LOG: redo done at 1E64/B7DFFD78
2017年01月29日 19:31:27 UTC LOG: last completed transaction was at log time 2017年01月29日 00:32:06.442239+00
2017年01月29日 19:31:27 UTC LOG: database system is ready to accept read only connections
cp: cannot stat '/mnt/share/psql/archive/psql/00000003.history': No such file or directory
2017年01月29日 19:31:27 UTC LOG: selected new timeline ID: 3
cp: cannot stat '/mnt/share/psql/archive/psql/00000002.history': No such file or directory
2017年01月29日 19:31:28 UTC LOG: archive recovery complete
$ repmgr cluster show
Role | Name | Upstream | Connection String
----------+-------------------------|-------------------------|-----------------------------------------
* master | psql01a | | host=psql01a user=repmgr
* master | psql01b | psql01a | host=psql01b user=repmgr

The question is how to resume WAL streaming without promoting standby to master? Is it possible to do this in an automatic manner?

asked Jan 29, 2017 at 14:22

1 Answer 1

1

Don't specify a recovery_target. Then recovery_target_action doesn't matter.

answered Jan 30, 2017 at 1:20
2
  • Thanks, I'll give it a try. Is it good idea to have recovery.conf present even in case that replication is working? Commented Jan 30, 2017 at 9:09
  • You need it around to set primary_conninfo, and standby_mode. Those are the only two entries you need if you are purely streaming (not optionally getting from WAL archive as well) Commented Jan 30, 2017 at 15:18

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.