I have a situation on my Postgres 13 db on AWS.
- 8TB of storage
- 60GB of memory that it isn't really using
- I regularly check to see if any query is running and if so, kill it.
- Transaction Log Disk Usage in blue, replication slot disk usage in orange. Transaction Log Disk Usage
- Replication lag is over 500GB behind
- FreeStorageSpace is sawtoothing FreeStorageSpace sawtooth
- I combed the logs looking for "timeout", "error", on replication publisher and subscriber in hopes that the
wal_sender_timeout
andwal_receiver_timeout
needed adjustment, however I see nothing. - I see the weird LSN behavior where it advances, the machine runs out of space, and then rewinds as if to start all over again.
-- 3851F/6305C2D0 7:46
-- 38521/3829A280 7:50
-- 38535/9FC44768 8:38
-- 38544/8E82F3D8 9:12
-- 3854F/8BD52F00 9:39pm
-- 38504/9EABBE48 6:29am <- rollover
-- 3851D/9413C9D0 6:11pm
- The amount of data sent between the publisher and subscriber is actually very low when looking at the AWS network sent/received charts.
- I actually had 2 subscribers at the beginning. I took the smaller one offline since it's easily repairable. That helped with the storage pressure, but only a tiny bit.
Questions:
- I don't know for certain that it's the replication that's causing the free space to drop. Is there a way I can verify for certain?
- I don't think Postgres 13 has parallel replication from what I saw in some documentation. Is there a way to fake it?
- Is dropping the subscription, killing the replication slot on the publisher, truncating the tables on the subscriber, and creating a new subscription my only option here?
- Since I have at least 2TB free storage when the drop begins, and the replication is ~500GB, I would think that whatever the publisher has to send would fit in storage. Should I increase storage in hopes that it can overcome the problem? I don't have a sense as to how high it should be increased. Can it be scaled back down after the fact?
- Are there any suggestions as to keywords I should look for in the log files?
Any responses are greatly appreciated.
-
It looks like WAL disk usage is what uses most of the space. Check what is causing that.Laurenz Albe– Laurenz Albe2025年06月08日 02:45:19 +00:00Commented Jun 8 at 2:45
-
1838 GB, active: true, wal_status: extended Well, it looks like my slot is 1.8TB behind. I increased the server size by 4TB. As the primary is replicating, does it require exactly the same amount of bytes to unpack and send the data over on the subscriber?mj_– mj_2025年06月08日 23:10:01 +00:00Commented Jun 8 at 23:10
-
Hopefully not...Laurenz Albe– Laurenz Albe2025年06月10日 07:00:46 +00:00Commented Jun 10 at 7:00
1 Answer 1
I followed the answer by Laurenz Albe in the first comment off the original post.
SELECT slot_name,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) as behind,
active,
wal_status
FROM pg_replication_slots
ORDER BY restart_lsn;
This query was in there and it showed me how much the lag was in bytes. I had ~2TB of free storage and ~1.8TB of lag. Out of frustration, I jacked up free storage by 4TB and that allowed the server to recover.
Explore related questions
See similar questions with these tags.