I have a PostgreSQL 9.5 cluster with master
and standby
server that are replicated using repmgr
. Master is configured to keep 4000
WAL files:
wal_level = logical
hot_standby = on
archive_command = 'test ! -f /mnt/share/psql/archive/psql/%f && cp %p /mnt/share/psql/archive/psql/%f'
max_worker_processes = 10
max_replication_slots = 10
max_wal_senders = 10
wal_keep_segments = 4000
At some point the standby
server falls behind master (I have no idea why is this happening):
2017年01月28日 23:49:24 UTC ERROR: current transaction is aborted, commands ignored until end of transaction block
2017年01月28日 23:49:24 UTC STATEMENT:
2017年01月28日 23:51:10 UTC LOG: invalid magic number 0000 in log segment 0000000200001E5A000000C8, offset 5201920
2017年01月28日 23:51:10 UTC LOG: started streaming WAL from primary at 1E5A/C8000000 on timeline 2
2017年01月29日 00:04:59 UTC FATAL: could not send data to WAL stream: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
2017年01月29日 00:16:10 UTC LOG: invalid magic number 0000 in log segment 0000000200001E5B000000C5, offset 5242880
2017年01月29日 00:16:10 UTC LOG: started streaming WAL from primary at 1E5B/C5000000 on timeline 2
2017年01月29日 00:50:07 UTC FATAL: could not send data to WAL stream: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
2017年01月29日 00:50:07 UTC LOG: invalid magic number 0000 in log segment 0000000200001E5D0000000A, offset 5373952
2017年01月29日 00:50:22 UTC LOG: started streaming WAL from primary at 1E5D/A000000 on timeline 2
2017年01月29日 03:29:35 UTC FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 0000000200001E64000000B7 has already been removed
Obviously master removed needed WAL 0000000200001E64000000B7
, but the file is still present in the archive.
Anyway, after standby
restart the database seems to reach consistent state:
...
2017年01月29日 13:45:35 UTC LOG: restored log file "0000000200001E64000000B1" from archive
2017年01月29日 13:45:36 UTC LOG: restored log file "0000000200001E64000000B2" from archive
2017年01月29日 13:45:36 UTC LOG: restored log file "0000000200001E64000000B3" from archive
2017年01月29日 13:45:36 UTC LOG: restored log file "0000000200001E64000000B4" from archive
2017年01月29日 13:45:37 UTC LOG: restored log file "0000000200001E64000000B5" from archive
2017年01月29日 13:45:37 UTC LOG: restored log file "0000000200001E64000000B6" from archive
2017年01月29日 13:45:37 UTC LOG: restored log file "0000000200001E64000000B7" from archive
2017年01月29日 13:45:38 UTC LOG: consistent recovery state reached at 1E64/B7DFFD78
2017年01月29日 13:45:38 UTC LOG: recovery stopping after reaching consistency
2017年01月29日 13:45:38 UTC LOG: recovery has paused
2017年01月29日 13:45:38 UTC HINT: Execute pg_xlog_replay_resume() to continue.
2017年01月29日 13:45:38 UTC LOG: database system is ready to accept read only connections
PostgreSQL suggest to execute pg_xlog_replay_resume()
, but this would lead to promoting standby to master, thus a split-brain situation.
$ repmgr cluster show
Role | Name | Upstream | Connection String
----------+-------------------------|-------------------------|-----------------------------------------
* master | psql01a | | host=psql01a user=repmgr
standby | psql01b | psql01a | host=psql01b user=repmgr
recovery.conf
:
restore_command = 'cp /mnt/share/psql/archive/psql/%f %p'
recovery_target_inclusive = true
recovery_target = 'immediate'
recovery_target_timeline = 'latest'
standby_mode = on
primary_conninfo = 'user=repmgr port=5432 sslmode=prefer sslcompression=1 krbsrvname=postgres host=psql01a application_name=psql01b password=ZDIzNjk2OTM2MWYyNjNiYzk5ZDVhMWIw'
recovery_min_apply_delay = 0
PostgreSQL 9.5 comes with a new parameter recovery_target_action
that can have 3 values in recovery.conf
:
pause
(default) recovery will be pausedshutdown
will stop the server after reaching the recovery targetpromote
means the recovery process will finish and the server will start to accept connections
promote
seems to be logical choice according to the documentation, however it leads again to split-brain situation:
2017年01月29日 19:31:27 UTC LOG: consistent recovery state reached at 1E64/B7DFFD78
2017年01月29日 19:31:27 UTC LOG: recovery stopping after reaching consistency
2017年01月29日 19:31:27 UTC LOG: redo done at 1E64/B7DFFD78
2017年01月29日 19:31:27 UTC LOG: last completed transaction was at log time 2017年01月29日 00:32:06.442239+00
2017年01月29日 19:31:27 UTC LOG: database system is ready to accept read only connections
cp: cannot stat '/mnt/share/psql/archive/psql/00000003.history': No such file or directory
2017年01月29日 19:31:27 UTC LOG: selected new timeline ID: 3
cp: cannot stat '/mnt/share/psql/archive/psql/00000002.history': No such file or directory
2017年01月29日 19:31:28 UTC LOG: archive recovery complete
$ repmgr cluster show
Role | Name | Upstream | Connection String
----------+-------------------------|-------------------------|-----------------------------------------
* master | psql01a | | host=psql01a user=repmgr
* master | psql01b | psql01a | host=psql01b user=repmgr
The question is how to resume WAL streaming without promoting standby to master? Is it possible to do this in an automatic manner?
1 Answer 1
Don't specify a recovery_target
. Then recovery_target_action
doesn't matter.
-
Thanks, I'll give it a try. Is it good idea to have
recovery.conf
present even in case that replication is working?Tombart– Tombart2017年01月30日 09:09:50 +00:00Commented Jan 30, 2017 at 9:09 -
You need it around to set primary_conninfo, and standby_mode. Those are the only two entries you need if you are purely streaming (not optionally getting from WAL archive as well)jjanes– jjanes2017年01月30日 15:18:26 +00:00Commented Jan 30, 2017 at 15:18
Explore related questions
See similar questions with these tags.