Creating sample multi-TB databases with Postgresql

Question 1

For reasons I won't go into I need to do some tests with sample terabyte-sized databases in Postgres. Those tests are not your typical SQL-level benchmark tests, more like tests of how quickly I can back the database up, restore it, etc, so while test DBs should have some structure, it's not really important what those structures are as long as there's minimum complexity to them like a few tables.

The requirement is that tested databases have at least 10TB size, preferably 20TB, size being understood as diskspace occupied by database files (yes it should be size after vacuuming etc).

So I'm trying to use pgbench tool to generate such sample databases. I started pgbench with intention of creating 15TB DB like this:

pgbench -i -s 1600000 pgbench15t

So this run has generated tuples in like 3 days, but now pgbench is hanging on vacuuming for like 2 days by now. What's worse, there does not seem to be much activity with either postmaster process or disk (I tested that with iotop).

Note: PG version is 11 (I'm not at liberty to choose version) and I've used default settings. Can that be a problem (i.e. default settings)?

OK so I have two main questions:

Is there some other quick way of generating such sample TB-sized databases that would not involve me just writing DB generation script / program?
Can pgbench be used in some specific way that would result in successful generation of say 15TB DB?

Question 2

If pgbench "hangs" on the vacuuming part, did you try to run it with --no-vacuum?

Question 3

Check Open Data SE

Question 4

@a_horse_with_no_name "--no-vacuum" helped. The db is not complete yet, pgbench is creating primary keys right now. However, PG and disk show a lot of activity. Thanks a lot!

Question 5

Rather than viewing the very slow vacuum as a problem, perhaps you could view it as an opportunity to learn what it will be like to work with a 20 TB database. If all you want to do is burden the back-up system, you can just throw some large garbage into the directory $PGDATA/base/<nnnnn>/. I think most backup tools don't bother to validate the files they copy are a legitimate part of the database before copying them.

Question 6

"What's worse, there does not seem to be much activity with either postmaster process or disk (I tested that with iotop)." I certainly can't replicate this. The disks are working just as hard for the vacuuming as they were for the COPY. But it takes about twice as long when entirely disk bound, as there is twice as much to do (it has to read and write everything, rather than just write everything)

Question 7

It will take a long time to create a 15 TB database (I found this article useful for estimating the database size).

If you left the database at default settings, I am not surprised if it takes a long time. To speed up VACUUM, set max_parallel_mainenance_workers and maintenance_work_mem high (but it will still take a long time).

Question 8

I'm confused - would it not be better to switch autovacuum off completely for data loading? Also, set no-logging? Then, after initialisation, take a backup, then set the tables to logged and switch on autovacuum? Then run a benchmark for x hours and then do a pitr using the first backup and the WAL? Would this not be a possibility for the OP?

Question 9

Also, separate disks for data and indexes? Maybe even separate tables?

Question 10

@Verace Separating tables and indexes is an old myth that refuses to die. But I agree that turning off autovacuum while you load will help. As far as I know, turning an unlogged table into a logged one will write everything to WAL, so the savings are unclear. And running pgbench with unlogged tables will make the test even more unrealistic than pgbench is anyway.

Question 11

Surely some sort of RAID is advisable? I found your answer here - so, when a table is changed to LOGGED, you're saying that WALs are written for the entire table on a block by block basis? But, what if you'd just taken a backup? I also found [dba.stackexchange.com/questions/195780/…. It appears that unless the initialisation can be performed in a single transaction, it's not possible! I'm not sure why - a full backup would mean that you'd have no problem with only new WALs?

Question 12

If you could take an LVM snapshot or ZFS one (on a cold system), you'd be able to apply new WALs without worrying about the "older" data and it would be quicker - if the OP has the disk space? This appears to me to be a flaw in PostgreSQL - not being able to tell PostgreSQL to dispense with WAL? In any case, thanks for getting back to me and your input!

Laurenz Albe Laurenz Albe 62.1k4 gold badges57 silver badges93 bronze badges · Answer 1 · 2022-02-08 20:20:37Z

1

It will take a long time to create a 15 TB database (I found this article useful for estimating the database size).

If you left the database at default settings, I am not surprised if it takes a long time. To speed up VACUUM, set max_parallel_mainenance_workers and maintenance_work_mem high (but it will still take a long time).

Share

Improve this answer

answered Feb 8, 2022 at 20:20

Laurenz Albe's user avatar

Laurenz Albe Laurenz Albe

62.1k4 gold badges57 silver badges93 bronze badges

9

I'm confused - would it not be better to switch autovacuum off completely for data loading? Also, set no-logging? Then, after initialisation, take a backup, then set the tables to logged and switch on autovacuum? Then run a benchmark for x hours and then do a pitr using the first backup and the WAL? Would this not be a possibility for the OP?

Vérace
– Vérace

2022年02月08日 22:12:31 +00:00
Commented Feb 8, 2022 at 22:12
Also, separate disks for data and indexes? Maybe even separate tables?

Vérace
– Vérace

2022年02月08日 22:19:20 +00:00
Commented Feb 8, 2022 at 22:19
@Verace Separating tables and indexes is an old myth that refuses to die. But I agree that turning off autovacuum while you load will help. As far as I know, turning an unlogged table into a logged one will write everything to WAL, so the savings are unclear. And running pgbench with unlogged tables will make the test even more unrealistic than pgbench is anyway.

Laurenz Albe
– Laurenz Albe

2022年02月09日 07:30:50 +00:00
Commented Feb 9, 2022 at 7:30
Surely some sort of RAID is advisable? I found your answer here - so, when a table is changed to LOGGED, you're saying that WALs are written for the entire table on a block by block basis? But, what if you'd just taken a backup? I also found [dba.stackexchange.com/questions/195780/…. It appears that unless the initialisation can be performed in a single transaction, it's not possible! I'm not sure why - a full backup would mean that you'd have no problem with only new WALs?

Vérace
– Vérace

2022年02月09日 12:08:27 +00:00
Commented Feb 9, 2022 at 12:08
If you could take an LVM snapshot or ZFS one (on a cold system), you'd be able to apply new WALs without worrying about the "older" data and it would be quicker - if the OP has the disk space? This appears to me to be a flaw in PostgreSQL - not being able to tell PostgreSQL to dispense with WAL? In any case, thanks for getting back to me and your input!

Vérace
– Vérace

2022年02月09日 12:08:32 +00:00
Commented Feb 9, 2022 at 12:08

| Show 4 more comments

Stack Exchange Network

Creating sample multi-TB databases with Postgresql

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Creating sample multi-TB databases with Postgresql

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions