For reasons I won't go into I need to do some tests with sample terabyte-sized databases in Postgres. Those tests are not your typical SQL-level benchmark tests, more like tests of how quickly I can back the database up, restore it, etc, so while test DBs should have some structure, it's not really important what those structures are as long as there's minimum complexity to them like a few tables.
The requirement is that tested databases have at least 10TB size, preferably 20TB, size being understood as diskspace occupied by database files (yes it should be size after vacuuming etc).
So I'm trying to use pgbench
tool to generate such sample databases. I started pgbench with intention of creating 15TB DB like this:
pgbench -i -s 1600000 pgbench15t
So this run has generated tuples in like 3 days, but now pgbench
is hanging on vacuuming for like 2 days by now. What's worse, there does not seem to be much activity with either postmaster
process or disk (I tested that with iotop
).
Note: PG version is 11 (I'm not at liberty to choose version) and I've used default settings. Can that be a problem (i.e. default settings)?
OK so I have two main questions:
- Is there some other quick way of generating such sample TB-sized databases that would not involve me just writing DB generation script / program?
- Can
pgbench
be used in some specific way that would result in successful generation of say 15TB DB?
1 Answer 1
It will take a long time to create a 15 TB database (I found this article useful for estimating the database size).
If you left the database at default settings, I am not surprised if it takes a long time. To speed up VACUUM
, set max_parallel_mainenance_workers
and maintenance_work_mem
high (but it will still take a long time).
-
I'm confused - would it not be better to switch autovacuum off completely for data loading? Also, set no-logging? Then, after initialisation, take a backup, then set the tables to logged and switch on autovacuum? Then run a benchmark for x hours and then do a pitr using the first backup and the WAL? Would this not be a possibility for the OP?Vérace– Vérace2022年02月08日 22:12:31 +00:00Commented Feb 8, 2022 at 22:12
-
Also, separate disks for data and indexes? Maybe even separate tables?Vérace– Vérace2022年02月08日 22:19:20 +00:00Commented Feb 8, 2022 at 22:19
-
@Verace Separating tables and indexes is an old myth that refuses to die. But I agree that turning off autovacuum while you load will help. As far as I know, turning an unlogged table into a logged one will write everything to WAL, so the savings are unclear. And running pgbench with unlogged tables will make the test even more unrealistic than pgbench is anyway.Laurenz Albe– Laurenz Albe2022年02月09日 07:30:50 +00:00Commented Feb 9, 2022 at 7:30
-
Surely some sort of RAID is advisable? I found your answer here - so, when a table is changed to LOGGED, you're saying that WALs are written for the entire table on a block by block basis? But, what if you'd just taken a backup? I also found [dba.stackexchange.com/questions/195780/…. It appears that unless the initialisation can be performed in a single transaction, it's not possible! I'm not sure why - a full backup would mean that you'd have no problem with only new WALs?Vérace– Vérace2022年02月09日 12:08:27 +00:00Commented Feb 9, 2022 at 12:08
-
If you could take an LVM snapshot or ZFS one (on a cold system), you'd be able to apply new WALs without worrying about the "older" data and it would be quicker - if the OP has the disk space? This appears to me to be a flaw in PostgreSQL - not being able to tell PostgreSQL to dispense with WAL? In any case, thanks for getting back to me and your input!Vérace– Vérace2022年02月09日 12:08:32 +00:00Commented Feb 9, 2022 at 12:08
--no-vacuum
?