We have a database with a single table made up of around 700 million entries. We update the database by adding new entries on a server then transfer the server to the production server using pg_dump:
pg_dump -c database> /tmp/database_gen
(by the way we use postgres 8.4) we export the database to the production server by using psql. The pg_dump generated file has instruction on how to create and fill the table.
The problem is with the index creation. Postgres fills the table then spend days creating the index. That was OK, until postgres couldn't create the index anymore due to the fact that it has no more diskspace because it uses a lot of disk space to for the temporary files for sorting and creating the index. Normaly the database takes around 200GB but during the index creation the used disk space increases to 600GB then after the creation it goes back to 200 GB.
My question is : can we create the index in several steps, like create the index for half the table then add the rest of the table and update the index?
Has anyone had the same issues ?
Thanks
-
1With HDDs as cheap as never before, I think it's cheaper to just add a few disks and don't bother.DrColossos– DrColossos2011年07月29日 17:41:44 +00:00Commented Jul 29, 2011 at 17:41
2 Answers 2
If you create the index before loading the table, the time taken to load the data will be significantly increased.
pre load:
create table my_table1(val integer);
create index my_index1 on my_table1(val);
insert into my_table1(val) select generate_series(1,100000) order by random();
Time: 31755.858 ms
post load:
create table my_table2(val integer);
insert into my_table2(val) select generate_series(1,100000) order by random();
Time: 15344.130 ms
create index my_index2 on my_table2(val);
Time: 4073.686 ms
If you are ok with that, with pg_restore
you can:
- Load just the schema using
--schema-only
- Create the index with
--index
- Load the data using
--data-only
Of course "Buy more storage" may well be the best answer here...
-
Conceptually, this answer is a lot more concise and straightforward. +1 !!!RolandoMySQLDBA– RolandoMySQLDBA2011年07月29日 17:48:17 +00:00Commented Jul 29, 2011 at 17:48
-
2This doesn't apply to his case, because he is using pg_dump, which already creates the indexes after loading the data. It says so in the question.Peter Eisentraut– Peter Eisentraut2011年07月30日 06:26:54 +00:00Commented Jul 30, 2011 at 6:26
-
@Peter - thanks, I've updated the steps to show how to create the index before loading the data.Jack Douglas– Jack Douglas2011年07月30日 06:52:42 +00:00Commented Jul 30, 2011 at 6:52
-
I know you can specify a tablespace in
CREATE INDEX
. But can you tell PostgreSQL where to put temporary indexing data (or whatever it's doing to consume all that disk space)?Mike Sherrill 'Cat Recall'– Mike Sherrill 'Cat Recall'2011年07月30日 09:47:28 +00:00Commented Jul 30, 2011 at 9:47 -
1@Catcall - Not tried myself, but possiblyJack Douglas– Jack Douglas2011年07月30日 12:29:59 +00:00Commented Jul 30, 2011 at 12:29
Basically, the only tuning knob is to increase maintenance_work_mem
so that more data is kept in memory rather than on disk. Try out how that makes a difference.
-
Like
temp_buffers
,maintenance_work_mem
may improve performance for the index creation but it won't decrease the space needed - that 400GB on temp space is all going to have to be found on disk at some point during the operation, because it is very likely to dwarf the memory on the server, don't you think?Jack Douglas– Jack Douglas2011年07月30日 07:03:12 +00:00Commented Jul 30, 2011 at 7:03 -
The algorithms for sorting on disk and in memory are not the same, so they will use different amounts of memory. But it's actually more likely that the in-memory sort uses more memory. Still, the numbers that have been quoted are quite bizarre, so it's worth a try.Peter Eisentraut– Peter Eisentraut2011年08月01日 20:10:20 +00:00Commented Aug 1, 2011 at 20:10