Recover PostgreSQL database from WAL errors on startup?

Question 1

I'm trying to set up an OpenStreetMap server on an Ubuntu 12.04 machine using the Ubuntu packages listed at switch2osm.org. I initially installed and set up everything using a Northeast US-only map extract, but now I want to install the entire planet of maps. I downloaded planet-latest.osm.bz2 and ran osm2pgsql --slim -C 60000 planet-latest.osm.bz2 as a user with write permission to the database; this was the same command that worked to install us-northeast.osm.pbf earlier. I came back the next day to find this command appeared to finish successfully, but for some reason the rendering daemon wasn't generating new tiles from the new data. I tried restarting renderd, and when that had no effect I tried restarting the PostgreSQL server with sudo /etc/init.d/postgresql restart. However, server startup failed with the following errors in the log:

2012年07月13日 18:54:59 UTC WARNING: page 1525147 of relation base/16385/477861 was uninitialized
2012年07月13日 18:54:59 UTC WARNING: page 2247965 of relation base/16385/477861 was uninitialized
...500 more lines like this...
2012年07月13日 18:54:59 UTC WARNING: page 2262926 of relation base/16385/477861 was uninitialized
2012年07月13日 18:54:59 UTC PANIC: WAL contains references to invalid pages
2012年07月13日 18:55:00 UTC LOG: startup process (PID 22826) was terminated by signal 6: Aborted

(Pastebin of entire log here).

There isn't much information on these kinds of errors on the Internet, but from what I can find it seems to mean that either my indexes are corrupted or my Write-Ahead-Log is. The only way to fix corrupted indexes, though, is to start the database in single-user mode and rebuild them, and I can't even do that because I get the same fatal errors even when I start in single-user mode with indexing disabled.

Is there any way for me to delete the Write-Ahead Log and force the server to start up "from scratch", or a fix for this kind of corruption that doesn't require first starting the database successfully?

Alternatively, is there a way for me to delete the database and just re-import all the planet data, given that I can't start the server to execute the DROP DATABASE command?

UPDATE:

Following Craig Ringer's suggestion, I went and looked through the database logs from before the WAL errors started occurring to see if I could find any suspicious behavior. In the log from immediately before the first instance of WAL errors, I found these suspicious-looking lines:

2012年07月13日 00:20:51 UTC LOG: received fast shutdown request
2012年07月13日 00:20:51 UTC LOG: aborting any active transactions
2012年07月13日 00:20:51 UTC FATAL: terminating connection due to administrator command
2012年07月13日 00:20:51 UTC FATAL: terminating connection due to administrator command
2012年07月13日 00:20:51 UTC FATAL: terminating connection due to administrator command
2012年07月13日 00:20:51 UTC FATAL: terminating connection due to administrator command
2012年07月13日 00:20:54 UTC FATAL: terminating connection due to administrator command
2012年07月13日 00:20:54 UTC STATEMENT: CREATE TABLE planet_osm_polygon_tmp AS SELECT * 
FROM planet_osm_polygon ORDER BY way;
2012年07月13日 00:20:55 UTC FATAL: terminating connection due to administrator command
2012年07月13日 00:20:55 UTC STATEMENT: CREATE INDEX planet_osm_ways_nodes ON planet_osm_ways 
USING gin (nodes) WITH (FASTUPDATE=OFF);
2012年07月13日 00:20:57 UTC FATAL: terminating connection due to administrator command
2012年07月13日 00:20:57 UTC STATEMENT: CREATE TABLE planet_osm_line_tmp AS SELECT * 
FROM planet_osm_line ORDER BY way;
2012年07月13日 00:21:51 UTC LOG: received immediate shutdown request
2012年07月13日 00:21:52 UTC WARNING: terminating connection because of crash of another
server process
2012年07月13日 00:21:52 UTC DETAIL: The postmaster has commanded this server process 
to roll back the current transaction and exit, because another server process 
exited abnormally and possibly corrupted shared memory.
2012年07月13日 00:21:52 UTC HINT: In a moment you should be able to reconnect to the 
database and repeat your command.
2012年07月13日 00:21:52 UTC LOG: could not send data to client: Broken pipe
2012年07月13日 00:21:58 UTC WARNING: terminating connection because of crash of another 
server process
2012年07月13日 00:21:58 UTC DETAIL: The postmaster has commanded this server process 
to roll back the current transaction and exit, because another server process
exited abnormally and possibly corrupted shared memory.
2012年07月13日 00:21:58 UTC HINT: In a moment you should be able to reconnect to the
 database and repeat your command.
2012年07月13日 00:21:58 UTC LOG: could not send data to client: Broken pipe

(Pastebin of the entire log is here)

When it says "terminating connection due to administrator command," I assume that was my command to restart the database server. But it looks like the shutdown somehow failed horribly, resulting in corruption of shared memory. This doesn't make sense, because I restarted it "cleanly," using the /etc/init.d/postgres restart script, not an abrupt kill or manually logging in as postgres. Am I interpreting this log incorrectly? Or is there actually a problem with using /etc/init.d/postgres restart to restart a PostgreSQL server?

(Please note, since my question was moved to Database Admin, where I'm a "new user," I no longer have the ability to upvote your answers. This doesn't mean I don't appreciate the help).

Question 2

Likely more appropriate for the Server or DBA stacks.

Question 3

I thought about that, but then I noticed there was a fairly large postgresql tag here, and I don't have an account on ServerFault.

Question 4

I forgot to ask: What's the exact PostgreSQL version in use? 9.1.[?]

Question 5

@CraigRinger According to psql, it's 9.1.4.

Question 6

UPDATE: it looks like this is a bug in the Debian/Ubuntu packaging of PostgreSQL, where the init scripts - extremely unsafely - kill -9 the postmaster and remove postmaster.pid. See this post on pgsql-general.

See:

Personally, I've gone and edited my init scripts to get rid of this rather hairy and dangerous code.

The original answer

Please go back in the logs to before the restart and see if you can find any errors. WAL corruption absolutely should not happen, so if it has it's important to look into why. If you can upload a copy of the whole log to a pastebin or something that'd be really handy.

The only time where WAL corruption is an accepted possibility with PostgreSQL is if you are running with fsync=off set in PostgreSQL.conf and your system crashes or unexpectedly loses power. If that's not the cause, it'd be really good to look into what happened.

Please do not use pg_resetxlog without some idea why your xlogs are damaged. If the transaction logs become damaged something is badly wrong and you need to find out what. If you band-aid it now, you might be bitten by it later when you care about the data.

The transaction logs exist for a reason and just removing them can leave your tables and indexes in an inconsistent, damaged state. After a pg_resetxlog it's a very good idea to pg_dumpall, drop your cluster, re-initdb, and reload the DB. As I said, though, this should not happen and you should look back in the logs for clues about what could've happened.

Now read the comments

Question 7

Unfortunately, by the time I saw this post I had already run pg_resetxlog and started a pg_dump of my database. When it finishes I'll go through the logs from before I started having WAL errors and see if I can find any clues. I'm not running with fsync=off so I'm equally puzzled as to how the WAL got corrupted.

Question 8

I updated my question to include the relevant logs. I hope this helps. (Actually I did that yesterday, but I just realized StackExchange doesn't give you a notification when I edit the question, like it does when I comment).

Question 9

@Edward That's only 220 lines, and doesn't appear to end with postmaster termination. Sure that's the whole log? It looks like maybe a backend crashed during shutdown - which shouldn't hurt anything, but shouldn't happen either. The log seems to end before the shutdown finishes though.

Question 10

Yes, that's the entire log file. The very next log file in the sequence starts with the line 2012年07月13日 00:31:22 UTC LOG: database system was interrupted; last known up at 2012年07月13日 00:15:20 UTC so there are no more messages from the shutdown process. (I added a pastebin of the next log since it's also the one where you see the WAL errors).

Question 11

Thanks for posting that question to the mailing list, though there doesn't seem to be any responses to it yet. In the meantime, I can tell you that I used the /etc/init.d/postgresql restart script to restart PostgresSQL, and never used kill -9. (Question updated, too).

Question 12

To clear the WAL files, see pg_resetxlog. The data directory on Ubuntu 12 should be /var/lib/postgresql/9.1/main

Note that pg_resetxlog is located in /usr/lib/postgresql/9.1/bin, not in /usr/bin, so it's not necessarily in $PATH. Also it should be run as the postgres user.

To clear and recreate the entire cluster if you don't care about the data, run:

pg_dropcluster 9.1 main

To recreate the cluster:

pg_createcluster 9.1 main

Craig Ringer Craig Ringer 57.9k6 gold badges162 silver badges194 bronze badges · Accepted Answer · 2012-07-14 02:04:13Z

UPDATE: it looks like this is a bug in the Debian/Ubuntu packaging of PostgreSQL, where the init scripts - extremely unsafely - kill -9 the postmaster and remove postmaster.pid. See this post on pgsql-general.

See:

Personally, I've gone and edited my init scripts to get rid of this rather hairy and dangerous code.

The original answer

Please go back in the logs to before the restart and see if you can find any errors. WAL corruption absolutely should not happen, so if it has it's important to look into why. If you can upload a copy of the whole log to a pastebin or something that'd be really handy.

The only time where WAL corruption is an accepted possibility with PostgreSQL is if you are running with fsync=off set in PostgreSQL.conf and your system crashes or unexpectedly loses power. If that's not the cause, it'd be really good to look into what happened.

Please do not use pg_resetxlog without some idea why your xlogs are damaged. If the transaction logs become damaged something is badly wrong and you need to find out what. If you band-aid it now, you might be bitten by it later when you care about the data.

The transaction logs exist for a reason and just removing them can leave your tables and indexes in an inconsistent, damaged state. After a pg_resetxlog it's a very good idea to pg_dumpall, drop your cluster, re-initdb, and reload the DB. As I said, though, this should not happen and you should look back in the logs for clues about what could've happened.

Now read the comments

Unfortunately, by the time I saw this post I had already run pg_resetxlog and started a pg_dump of my database. When it finishes I'll go through the logs from before I started having WAL errors and see if I can find any clues. I'm not running with fsync=off so I'm equally puzzled as to how the WAL got corrupted.
I updated my question to include the relevant logs. I hope this helps. (Actually I did that yesterday, but I just realized StackExchange doesn't give you a notification when I edit the question, like it does when I comment).
@Edward That's only 220 lines, and doesn't appear to end with postmaster termination. Sure that's the whole log? It looks like maybe a backend crashed during shutdown - which shouldn't hurt anything, but shouldn't happen either. The log seems to end before the shutdown finishes though.
Yes, that's the entire log file. The very next log file in the sequence starts with the line 2012年07月13日 00:31:22 UTC LOG: database system was interrupted; last known up at 2012年07月13日 00:15:20 UTC so there are no more messages from the shutdown process. (I added a pastebin of the next log since it's also the one where you see the WAL errors).
Thanks for posting that question to the mailing list, though there doesn't seem to be any responses to it yet. In the meantime, I can tell you that I used the /etc/init.d/postgresql restart script to restart PostgresSQL, and never used kill -9. (Question updated, too).

Stack Exchange Network

Recover PostgreSQL database from WAL errors on startup?

2 Answers 2

The original answer

Now read the comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Recover PostgreSQL database from WAL errors on startup?

2 Answers 2

The original answer

Now read the comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions