I am running an RDS instance of PostgreSQL 9.6.3: select version();
returns PostgreSQL 9.6.3 on x86_64-pc-linux-gnu [...]
.
I have found that after issuing a stop from the RDS console followed by a start the database always reports that the database system was not properly shutdown, requiring recovery. This behavior has been independently verified by at least one other RDS PostgreSQL user (https://forums.aws.amazon.com/message.jspa?messageID=809401#809401).
The query
select name, setting
from pg_settings
where name in ('fsync', 'wal_sync_method', 'synchronous_commit');
reports that fsync = on
, wal_sync_method = fdatasync
(correct for a Linux system), and synchronous_commit = on
.
I see the following in the logs after issuing a stop:
2017年10月12日 16:37:36 UTC::@:[3464]:LOG: received fast shutdown request
2017年10月12日 16:37:36 UTC::@:[3464]:LOG: aborting any active transactions
2017年10月12日 16:37:36 UTC::@:[3515]:LOG: autovacuum launcher shutting down
2017年10月12日 16:37:36 UTC::@:[3512]:LOG: shutting down
2017年10月12日 16:37:36 UTC::@:[3512]:LOG: checkpoint starting: shutdown immediate
2017年10月12日 16:37:36 UTC::@:[3512]:LOG: checkpoint complete: wrote 1 buffers (0.0%); 0 transaction log file(s) added, 0 removed, 1 recycled; write=0.007 s, sync=0.002 s, total=0.145 s; sync files=1, longest=0.002 s, average=0.002 s; distance=16384 kB, estimate=16402 kB
2017年10月12日 16:37:36 UTC::@:[3464]:LOG: database system is shut down
And the following after issuing a start:
2017年10月12日 17:05:33 UTC::@:[3293]:LOG: database system was interrupted; last known up at 2017年10月12日 16:37:50 UTC
2017年10月12日 17:05:33 UTC::@:[3293]:LOG: database system was not properly shut down; automatic recovery in progress
2017年10月12日 17:05:33 UTC::@:[3293]:LOG: redo starts at 165/1160
2017年10月12日 17:05:33 UTC::@:[3293]:LOG: unexpected pageaddr 164/D2000000 in log segment 000000010000016500000003, offset 0
2017年10月12日 17:05:33 UTC::@:[3293]:LOG: redo done at 165/20000A0
2017年10月12日 17:05:33 UTC::@:[3293]:LOG: last completed transaction was at log time 2017年10月12日 16:50:53.823582+00
2017年10月12日 17:05:33 UTC::@:[3293]:LOG: checkpoint starting: end-of-recovery immediate
2017年10月12日 17:05:33 UTC::@:[3293]:LOG: checkpoint complete: wrote 2 buffers (0.0%); 0 transaction log file(s) added, 0 removed, 3 recycled; write=0.029 s, sync=0.002 s, total=0.046 s; sync files=2, longest=0.002 s, average=0.001 s; distance=49147 kB, estimate=49147 kB
2017年10月12日 17:05:33 UTC::@:[3293]:LOG: MultiXact member wraparound protections are now enabled
Given my understanding of PostgreSQL's (fast) shutdown and startup sequences, the above log messages seem to indicate that PostgreSQL is writing and completing a final checkpoint before shutdown, and then successfully shutting down.
However, based on the code at xlog.c line 6023 and xlog.c line 6503 (https://github.com/postgres/postgres/blob/ca9cfed883333d5801716eb01cf28b6b5be2b5cd/src/backend/access/transam/xlog.c; can't post more than 2 links), corresponding to the database system was interrupted [...]
and database system was not properly shut down [...]
log lines, respectively, it would appear that at least the pg_control
file was not flushed to disk. This worries me, as I'm expecting Amazon have taken care to make sure that the volumes on which PostgreSQL data and log files are stored are fit-for-purpose (i.e. they don't report fsync
completion until it is really complete).
Is there another explanation for this behavior?
1 Answer 1
An answer to this question was provided on the AWS Developer Forums thread to which it was originally posted.
Summary
RDS PostgreSQL (as of the versions available in April 2018) doesn't always recover after restart; however, if the RDS instance does not shutdown within an unspecified timeout, RDS will force terminate the instance, requiring PostgreSQL to recover on reboot.
Full text of reply
Thanks for using RDS!
I saw this thread referenced somewhere else and thought I'd give a quick input. I work closely with the PostgreSQL service team and I can confirm that the behavior you're observing is correct. As of today (April 2018) it's the case that sometimes, if shutdown hasn't completed quickly enough, the RDS automation will time-out and force termination. Various behaviors like this have in fact existed for a long time, so you may have observed it before. In the handful of specific cases that have been discussed on this thread, I can't really comment on why PostgreSQL (or other database engines) took longer than usual to shut down - there are probably many different possible causes for this.
To be clear, the specifics about if and how and when a force-termination occurs might change over time depending on many variables. What will never change is the RDS team's commitment to durability and availability of customers' data. Meaning that (1) our engineers carefully design with safety in mind, avoiding things that might increase risk of corruption in any way and (2) our engineers are always mindful that downtime - whether measured in minutes or seconds or milliseconds - is worth minimizing as far as safely possible.
I hope this helps, to hear that what you're observing is there by design!
-Jeremy
Explore related questions
See similar questions with these tags.
fsync
, that's just my hypothesis. It's also possible the instance terminates before the database engine canfsync
thepg_control
file. My understanding is thatfsync
reporting success before the data is actually written to disk is not unheard of: some drive controllers cache writes. I suspect RDS PostgreSQL uses EBS, which is SAN (or SAN like), which complicates things.