Index Generation in Postgres

Question 1

We have a database with a single table made up of around 700 million entries. We update the database by adding new entries on a server then transfer the server to the production server using pg_dump:

pg_dump -c database> /tmp/database_gen

(by the way we use postgres 8.4) we export the database to the production server by using psql. The pg_dump generated file has instruction on how to create and fill the table.

The problem is with the index creation. Postgres fills the table then spend days creating the index. That was OK, until postgres couldn't create the index anymore due to the fact that it has no more diskspace because it uses a lot of disk space to for the temporary files for sorting and creating the index. Normaly the database takes around 200GB but during the index creation the used disk space increases to 600GB then after the creation it goes back to 200 GB.

My question is : can we create the index in several steps, like create the index for half the table then add the rest of the table and update the index?

Has anyone had the same issues ?

Thanks

Question 2

With HDDs as cheap as never before, I think it's cheaper to just add a few disks and don't bother.

Question 3

If you create the index before loading the table, the time taken to load the data will be significantly increased.

pre load:

create table my_table1(val integer);
create index my_index1 on my_table1(val);
insert into my_table1(val) select generate_series(1,100000) order by random();
Time: 31755.858 ms

post load:

create table my_table2(val integer);
insert into my_table2(val) select generate_series(1,100000) order by random();
Time: 15344.130 ms
create index my_index2 on my_table2(val);
Time: 4073.686 ms

If you are ok with that, with pg_restore you can:

Load just the schema using --schema-only
Create the index with --index
Load the data using --data-only

Of course "Buy more storage" may well be the best answer here...

Question 4

Conceptually, this answer is a lot more concise and straightforward. +1 !!!

Question 5

This doesn't apply to his case, because he is using pg_dump, which already creates the indexes after loading the data. It says so in the question.

Question 6

@Peter - thanks, I've updated the steps to show how to create the index before loading the data.

Question 7

I know you can specify a tablespace in CREATE INDEX. But can you tell PostgreSQL where to put temporary indexing data (or whatever it's doing to consume all that disk space)?

Question 8

@Catcall - Not tried myself, but possibly

Question 9

Basically, the only tuning knob is to increase maintenance_work_mem so that more data is kept in memory rather than on disk. Try out how that makes a difference.

Question 10

Like temp_buffers, maintenance_work_mem may improve performance for the index creation but it won't decrease the space needed - that 400GB on temp space is all going to have to be found on disk at some point during the operation, because it is very likely to dwarf the memory on the server, don't you think?

Question 11

The algorithms for sorting on disk and in memory are not the same, so they will use different amounts of memory. But it's actually more likely that the in-memory sort uses more memory. Still, the numbers that have been quoted are quite bizarre, so it's worth a try.

Jack Douglas Jack Douglas 40.6k16 gold badges106 silver badges179 bronze badges · Answer 1 · 2011-07-29 16:41:46Z

If you create the index before loading the table, the time taken to load the data will be significantly increased.

pre load:

create table my_table1(val integer);
create index my_index1 on my_table1(val);
insert into my_table1(val) select generate_series(1,100000) order by random();
Time: 31755.858 ms

post load:

create table my_table2(val integer);
insert into my_table2(val) select generate_series(1,100000) order by random();
Time: 15344.130 ms
create index my_index2 on my_table2(val);
Time: 4073.686 ms

If you are ok with that, with pg_restore you can:

Load just the schema using --schema-only
Create the index with --index
Load the data using --data-only

Of course "Buy more storage" may well be the best answer here...

Conceptually, this answer is a lot more concise and straightforward. +1 !!!
This doesn't apply to his case, because he is using pg_dump, which already creates the indexes after loading the data. It says so in the question.
@Peter - thanks, I've updated the steps to show how to create the index before loading the data.
I know you can specify a tablespace in CREATE INDEX. But can you tell PostgreSQL where to put temporary indexing data (or whatever it's doing to consume all that disk space)?

score 1 · Answer 2 · 2011-07-30 06:28:54Z

1

Basically, the only tuning knob is to increase maintenance_work_mem so that more data is kept in memory rather than on disk. Try out how that makes a difference.

Share

Improve this answer

answered Jul 30, 2011 at 6:28

Peter Eisentraut's user avatar

Peter Eisentraut Peter Eisentraut

10.8k1 gold badge35 silver badges35 bronze badges

2

Like temp_buffers, maintenance_work_mem may improve performance for the index creation but it won't decrease the space needed - that 400GB on temp space is all going to have to be found on disk at some point during the operation, because it is very likely to dwarf the memory on the server, don't you think?

Jack Douglas
– Jack Douglas

2011年07月30日 07:03:12 +00:00
Commented Jul 30, 2011 at 7:03
The algorithms for sorting on disk and in memory are not the same, so they will use different amounts of memory. But it's actually more likely that the in-memory sort uses more memory. Still, the numbers that have been quoted are quite bizarre, so it's worth a try.

Peter Eisentraut
– Peter Eisentraut

2011年08月01日 20:10:20 +00:00
Commented Aug 1, 2011 at 20:10

Add a comment |

Stack Exchange Network

Index Generation in Postgres

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Index Generation in Postgres

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions