How to restore PostgreSQL standby server from archive?

Question 1

I have a PostgreSQL 9.5 cluster with master and standby server that are replicated using repmgr. Master is configured to keep 4000 WAL files:

wal_level = logical
hot_standby = on
archive_command = 'test ! -f /mnt/share/psql/archive/psql/%f && cp %p /mnt/share/psql/archive/psql/%f'
max_worker_processes = 10 
max_replication_slots = 10
max_wal_senders = 10
wal_keep_segments = 4000

At some point the standby server falls behind master (I have no idea why is this happening):

2017年01月28日 23:49:24 UTC ERROR: current transaction is aborted, commands ignored until end of transaction block
2017年01月28日 23:49:24 UTC STATEMENT: 
2017年01月28日 23:51:10 UTC LOG: invalid magic number 0000 in log segment 0000000200001E5A000000C8, offset 5201920
2017年01月28日 23:51:10 UTC LOG: started streaming WAL from primary at 1E5A/C8000000 on timeline 2
2017年01月29日 00:04:59 UTC FATAL: could not send data to WAL stream: server closed the connection unexpectedly
 This probably means the server terminated abnormally
 before or while processing the request.
2017年01月29日 00:16:10 UTC LOG: invalid magic number 0000 in log segment 0000000200001E5B000000C5, offset 5242880
2017年01月29日 00:16:10 UTC LOG: started streaming WAL from primary at 1E5B/C5000000 on timeline 2
2017年01月29日 00:50:07 UTC FATAL: could not send data to WAL stream: server closed the connection unexpectedly
 This probably means the server terminated abnormally
 before or while processing the request.
2017年01月29日 00:50:07 UTC LOG: invalid magic number 0000 in log segment 0000000200001E5D0000000A, offset 5373952
2017年01月29日 00:50:22 UTC LOG: started streaming WAL from primary at 1E5D/A000000 on timeline 2
2017年01月29日 03:29:35 UTC FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 0000000200001E64000000B7 has already been removed

Obviously master removed needed WAL 0000000200001E64000000B7, but the file is still present in the archive. Anyway, after standby restart the database seems to reach consistent state:

...
2017年01月29日 13:45:35 UTC LOG: restored log file "0000000200001E64000000B1" from archive
2017年01月29日 13:45:36 UTC LOG: restored log file "0000000200001E64000000B2" from archive
2017年01月29日 13:45:36 UTC LOG: restored log file "0000000200001E64000000B3" from archive
2017年01月29日 13:45:36 UTC LOG: restored log file "0000000200001E64000000B4" from archive
2017年01月29日 13:45:37 UTC LOG: restored log file "0000000200001E64000000B5" from archive
2017年01月29日 13:45:37 UTC LOG: restored log file "0000000200001E64000000B6" from archive
2017年01月29日 13:45:37 UTC LOG: restored log file "0000000200001E64000000B7" from archive
2017年01月29日 13:45:38 UTC LOG: consistent recovery state reached at 1E64/B7DFFD78
2017年01月29日 13:45:38 UTC LOG: recovery stopping after reaching consistency
2017年01月29日 13:45:38 UTC LOG: recovery has paused
2017年01月29日 13:45:38 UTC HINT: Execute pg_xlog_replay_resume() to continue.
2017年01月29日 13:45:38 UTC LOG: database system is ready to accept read only connections

PostgreSQL suggest to execute pg_xlog_replay_resume(), but this would lead to promoting standby to master, thus a split-brain situation.

$ repmgr cluster show
Role | Name | Upstream | Connection String
----------+-------------------------|-------------------------|-----------------------------------------
* master | psql01a | | host=psql01a user=repmgr
 standby | psql01b | psql01a | host=psql01b user=repmgr

recovery.conf:

restore_command = 'cp /mnt/share/psql/archive/psql/%f %p'
recovery_target_inclusive = true
recovery_target = 'immediate'
recovery_target_timeline = 'latest'
standby_mode = on
primary_conninfo = 'user=repmgr port=5432 sslmode=prefer sslcompression=1 krbsrvname=postgres host=psql01a application_name=psql01b password=ZDIzNjk2OTM2MWYyNjNiYzk5ZDVhMWIw'
recovery_min_apply_delay = 0

PostgreSQL 9.5 comes with a new parameter recovery_target_action that can have 3 values in recovery.conf:

pause (default) recovery will be paused
shutdown will stop the server after reaching the recovery target
promote means the recovery process will finish and the server will start to accept connections

promote seems to be logical choice according to the documentation, however it leads again to split-brain situation:

2017年01月29日 19:31:27 UTC LOG: consistent recovery state reached at 1E64/B7DFFD78
2017年01月29日 19:31:27 UTC LOG: recovery stopping after reaching consistency
2017年01月29日 19:31:27 UTC LOG: redo done at 1E64/B7DFFD78
2017年01月29日 19:31:27 UTC LOG: last completed transaction was at log time 2017年01月29日 00:32:06.442239+00
2017年01月29日 19:31:27 UTC LOG: database system is ready to accept read only connections
cp: cannot stat '/mnt/share/psql/archive/psql/00000003.history': No such file or directory
2017年01月29日 19:31:27 UTC LOG: selected new timeline ID: 3
cp: cannot stat '/mnt/share/psql/archive/psql/00000002.history': No such file or directory
2017年01月29日 19:31:28 UTC LOG: archive recovery complete
$ repmgr cluster show
Role | Name | Upstream | Connection String
----------+-------------------------|-------------------------|-----------------------------------------
* master | psql01a | | host=psql01a user=repmgr
* master | psql01b | psql01a | host=psql01b user=repmgr

The question is how to resume WAL streaming without promoting standby to master? Is it possible to do this in an automatic manner?

Question 2

Don't specify a recovery_target. Then recovery_target_action doesn't matter.

Question 3

Thanks, I'll give it a try. Is it good idea to have recovery.conf present even in case that replication is working?

Question 4

You need it around to set primary_conninfo, and standby_mode. Those are the only two entries you need if you are purely streaming (not optionally getting from WAL archive as well)

jjanes jjanes 42.4k3 gold badges44 silver badges54 bronze badges · Accepted Answer · 2017-01-30 01:20:01Z

1

Don't specify a recovery_target. Then recovery_target_action doesn't matter.

Share

Improve this answer

answered Jan 30, 2017 at 1:20

jjanes's user avatar

jjanes jjanes

42.4k3 gold badges44 silver badges54 bronze badges

2

Thanks, I'll give it a try. Is it good idea to have recovery.conf present even in case that replication is working?

Tombart
– Tombart

2017年01月30日 09:09:50 +00:00
Commented Jan 30, 2017 at 9:09
You need it around to set primary_conninfo, and standby_mode. Those are the only two entries you need if you are purely streaming (not optionally getting from WAL archive as well)

jjanes
– jjanes

2017年01月30日 15:18:26 +00:00
Commented Jan 30, 2017 at 15:18

Add a comment |

Stack Exchange Network

How to restore PostgreSQL standby server from archive?

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

How to restore PostgreSQL standby server from archive?

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions