1

I have a situation on my Postgres 13 db on AWS.

  • 8TB of storage
  • 60GB of memory that it isn't really using
  • I regularly check to see if any query is running and if so, kill it.
  • Transaction Log Disk Usage in blue, replication slot disk usage in orange. Transaction Log Disk Usage
  • Replication lag is over 500GB behind
  • FreeStorageSpace is sawtoothing FreeStorageSpace sawtooth
  • I combed the logs looking for "timeout", "error", on replication publisher and subscriber in hopes that the wal_sender_timeout and wal_receiver_timeout needed adjustment, however I see nothing.
  • I see the weird LSN behavior where it advances, the machine runs out of space, and then rewinds as if to start all over again.
-- 3851F/6305C2D0 7:46
-- 38521/3829A280 7:50
-- 38535/9FC44768 8:38
-- 38544/8E82F3D8 9:12
-- 3854F/8BD52F00 9:39pm
-- 38504/9EABBE48 6:29am <- rollover
-- 3851D/9413C9D0 6:11pm
  • The amount of data sent between the publisher and subscriber is actually very low when looking at the AWS network sent/received charts.
  • I actually had 2 subscribers at the beginning. I took the smaller one offline since it's easily repairable. That helped with the storage pressure, but only a tiny bit.

Questions:

  • I don't know for certain that it's the replication that's causing the free space to drop. Is there a way I can verify for certain?
  • I don't think Postgres 13 has parallel replication from what I saw in some documentation. Is there a way to fake it?
  • Is dropping the subscription, killing the replication slot on the publisher, truncating the tables on the subscriber, and creating a new subscription my only option here?
  • Since I have at least 2TB free storage when the drop begins, and the replication is ~500GB, I would think that whatever the publisher has to send would fit in storage. Should I increase storage in hopes that it can overcome the problem? I don't have a sense as to how high it should be increased. Can it be scaled back down after the fact?
  • Are there any suggestions as to keywords I should look for in the log files?

Any responses are greatly appreciated.

asked Jun 7 at 23:46
3
  • It looks like WAL disk usage is what uses most of the space. Check what is causing that. Commented Jun 8 at 2:45
  • 1838 GB, active: true, wal_status: extended Well, it looks like my slot is 1.8TB behind. I increased the server size by 4TB. As the primary is replicating, does it require exactly the same amount of bytes to unpack and send the data over on the subscriber? Commented Jun 8 at 23:10
  • Hopefully not... Commented Jun 10 at 7:00

1 Answer 1

1

I followed the answer by Laurenz Albe in the first comment off the original post.

SELECT slot_name,
 pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) as behind,
 active,
 wal_status
FROM pg_replication_slots
ORDER BY restart_lsn;

This query was in there and it showed me how much the lag was in bytes. I had ~2TB of free storage and ~1.8TB of lag. Out of frustration, I jacked up free storage by 4TB and that allowed the server to recover.

answered Jun 9 at 14:39

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.