postgres 9.4 timeline issue

Question 1

We've had to move around the master a bit. It started on server 01 got moved to 02 (which was a slave) We need to move it again so we built 04 and trying to slave it off 02 and getting the following errors.

2018年02月25日 17:00:08 UTC FATAL: highest timeline 3 of the primary is behind recovery timeline 4
2018年02月25日 17:00:13 UTC FATAL: highest timeline 3 of the primary is behind recovery timeline 4
2018年02月25日 17:00:18 UTC FATAL: highest timeline 3 of the primary is behind recovery timeline 4

The initial dump happened like

pg_basebackup --verbose --progress -d "host=10.132.x.x user=backup password=...." -D /var/lib/postgresql/9.4/main/ -l 'instance restore' --xlog-method=stream

recovery file looks like

restore_command = 'if [ -f /srv/postgresql/archive/${DATASET}/%f ]; then cp /srv/postgresql/archive/${DATASET}/%f %p; else aws s3 cp --quiet s3://company-backups/postgresql/${DATASET}/archive/%f %p; fi'
standby_mode = 'on'
primary_conninfo = 'host=10.132.x.x user=backup password=....'
recovery_target_timeline = 'latest'
trigger_file = '/var/lib/postgresql/9.4/main/failover'

Question 2

"we built 04 and trying to slave it off 02" How exactly did you build 04 and how did you try to "slave it off 02"?

Question 3

You probbaly need to describe to us how that "move to 02" happened, because you might have done something wrong there.

Question 4

I never did it just trying to bring up 4 but from a psql history it looks like pg_terminate_backend was run to terminate the streaming replication from 2->1 when 2 became master and traffic was just shifted there

Question 5

What do you mean you never did? You obviously did try something to get these errors. I'm asking what exactly you did because it's not at all obvious. Has server 02 been working as a master? When and where from was this backup taken? What steps did you do to get this error at server 04? And why you say the replication was from 02->01? Your question says that it was 01 master, 02 salve, so it should have been 01->02.

Question 6

As it sounds, you are in a split brain situation. The original master (01) was never stopped from being master, and after the promotion of 02, it became just another master.

Fixing such issues pre-9.5 is not so easy (at that version pg_rewind became an element of the PostgreSQL ecosystem) - you will need some manual cleanup, most probably. What is certain is if you got writes to 01 after promotion of 02, they will be lost (or the writes on 02, depending what you choose to do).

I'd take a logical dump from both 01 and 02 to start (to check if there is anything that has to be manually replayed from 01 to 02), stop 01 altogether, remove the older timelines WAL segments from the archive (well, you can move them somewhere else just in case) and then try to build a slave based on 02 again.

You can also use pg_xlogdump to see which relations (tables, indexes, etc.) got writes since the split brain started. (Note that from version 10 the utility name is pg_waldump.)

Question 7

Thanks for for the detailed answer and makes perfect sense. Is there anyway to cut losses and just say ignore the timeline error?

Question 8

No, you cannot ignore it. Imagine some value 'A' changed to 'B' here and 'C' there - how you fix this? Even if you can match timestamps, you will miss the fact that the outcome on the same master could be different than on two separate ones.

Question 9

I'm only asking because 01 has been destroyed. They were never managed with something like pg_pool.

Question 10

Look around in /srv/postgresql/archive/${DATASET}/ to see if you find files that start 00000003. Those are the ones belonging to timeline 3.

Question 11

@Mike you have to figure out where the timeline 4 is coming from. Check on each host what is written to pg_xlog (normally inside the data_directory). If the confusion is too big, you can also dump and then stop all different masters, and then restore the best one, optionally applying the differences from other dumps.

score 4 · Accepted Answer · 2018-02-26 10:10:32Z

As it sounds, you are in a split brain situation. The original master (01) was never stopped from being master, and after the promotion of 02, it became just another master.

Fixing such issues pre-9.5 is not so easy (at that version pg_rewind became an element of the PostgreSQL ecosystem) - you will need some manual cleanup, most probably. What is certain is if you got writes to 01 after promotion of 02, they will be lost (or the writes on 02, depending what you choose to do).

I'd take a logical dump from both 01 and 02 to start (to check if there is anything that has to be manually replayed from 01 to 02), stop 01 altogether, remove the older timelines WAL segments from the archive (well, you can move them somewhere else just in case) and then try to build a slave based on 02 again.

You can also use pg_xlogdump to see which relations (tables, indexes, etc.) got writes since the split brain started. (Note that from version 10 the utility name is pg_waldump.)

Thanks for for the detailed answer and makes perfect sense. Is there anyway to cut losses and just say ignore the timeline error?
No, you cannot ignore it. Imagine some value 'A' changed to 'B' here and 'C' there - how you fix this? Even if you can match timestamps, you will miss the fact that the outcome on the same master could be different than on two separate ones.
I'm only asking because 01 has been destroyed. They were never managed with something like pg_pool.
Look around in /srv/postgresql/archive/${DATASET}/ to see if you find files that start 00000003. Those are the ones belonging to timeline 3.
@Mike you have to figure out where the timeline 4 is coming from. Check on each host what is written to pg_xlog (normally inside the data_directory). If the confusion is too big, you can also dump and then stop all different masters, and then restore the best one, optionally applying the differences from other dumps.

Stack Exchange Network

postgres 9.4 timeline issue

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

postgres 9.4 timeline issue

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions