I am preparing a Postgres Docker image based on the official image. I have an application that fills the database with initial data (this takes about an hour for ~300MB of data). I want to make an image out of that Postgres database so that the data is already initialized. This works fine, but due to the nature of the WAL, the PG_DATA directory is larger than necessary. For Docker image distribution, minifying PG_DATA is important. This will add up if I later think about adding a couple more layers of data, which all add their own set of WAL files that are not strictly necessary.
What options do I have to minimize PG_DATA? My main question is about minimizing the WAL, at the end I talk about minimizing other areas of PG_DATA.
Delete the WAL
When is it safe for me to delete the WAL?
To be clear, this is a throw-away container filled with some initial data. It’s not intended to be a productive system. archive_mode = off
, wal_level = minimal
, no replication.
I am trying to keep the WAL small by the following config:
wal_keep_size = 0
wal_recycle = off
min_wal_size = 32MB
max_wal_senders = 0 # necessary due to wal_level = minimal
However, even after issuing CHECKPOINT;
, the pg_wal
directory is still ~200MB big.
Why is Postgres not deleting the files? Based on the settings, I would assume that only 32MB should be kept.
After CHECKPOINT;
and stopping the server, is it safe to delete all files from pg_wal
?
From what I can tell from other questions and answers, Postgres should automatically delete the WAL files. Some Q&As I am already aware, but do not talk about my specific problem and use-case:
- How can I solve postgresql problem after deleting wal files?
- https://stackoverflow.com/questions/49650016/how-to-reduce-wal-file-count-in-edb-postgresql-9-6-instance
- https://stackoverflow.com/questions/35144403/which-postgresql-wal-files-can-i-safely-remove-from-the-wal-archive-folder
- https://stackoverflow.com/questions/49539938/postgres-wal-file-not-getting-deleted
- https://stackoverflow.com/questions/68858808/postgresql-how-to-safely-remove-files-inside-pg-wal-directory
These questions are mostly concerned about archive, backup, replication, PITR. But that does not apply to my use-case.
Minimizing PG_DATA
My idea is that after executing CHECKPOINT;
and clearing the WAL, I can also do a VACUUM FULL;
to release any space to the file system. This seems to work as intended, I see the base
directory getting a bit smaller. Of course, I expect this to have the most effect after deleting/updating data, not after mostly inserting rows during initialization.
Are there even more options to minimize PG_DATA?
2 Answers 2
After you have populated your database, shut it down cleanly (important!) and run
pg_resetwal -D /path/to/data/directory
That will truncate your WAL. Never manually delete WAL files.
-
Exactly what I was looking for! jjanes’ answer also contains very useful information.Martin Nyolt– Martin Nyolt2022年02月11日 07:25:22 +00:00Commented Feb 11, 2022 at 7:25
At the end of a checkpoint, the checkpointer looks at how much WAL was used recently, and recycles obsolete WAL files (by renaming them with names "from the future") to be ready for them to be reused. It deletes any extra obsolete WAL files beyond the number it wants to recycle. The problem is that during bulk loading, WAL is getting written very fast so a lot of files get recycled. But when bulk loading is suddenly done and WAL is no longer getting written fast, those recycled files are still there. The next checkpoint, when things are quieter, sees it no longer needs so many recycled WAL files, but it is not willing to delete the existing recycled ahead ones, it just declines to recycle even more by deleting the now-obsolete (past) ones. But since things were quite, there will be few if any of those.
Eventually that oversupply of recycled ahead files would get used up and deleted, but if the server is now very quiet, it will take a very long time to use up the supply.
As Laurenz says, you can use pg_resetwal to remove extra files (both past and future) for you. You could remove them manually, but if you screw it up your data might be toast. You could also turn off wal_recycle while bulk loading. This means every WAL file will be created at the time of need, so none will get recycled. This will slow down (or maybe speed up) the bulk load by some amount that depends on your OS and FS.
-
Very good explanation! I actually disabled
wal_recycle
after the initial load. Now I understand why that did not work. I need to keep it disabled from the start.Martin Nyolt– Martin Nyolt2022年02月11日 07:29:23 +00:00Commented Feb 11, 2022 at 7:29
base
dir is 300M,pg_wal
is 200M. Ideally, I would like to deletepg_wal
completely, but 16M or maybe 32M would also be acceptable.max_wal_size
to something lowish and see if it has a huge performance impact. And to live with the image size. Thanks a lot!