Postgres REINDEX time estimate

Question 1

I've got an older DB (postgres 10.15) that's not yet been upgraded. One problematic table had a few large indexes on it, some of which were corrupt and needed reindexing. Since it's not on version 12+, I can't concurrently reindex the table (which means I need to do it non-concurrently, which requires a table write lock) - so I wanted to know how I could do some rough calculations on how long the reindex would take so I can plan some maintenance. Most of my research ends up in the "just use pg_stat_progress_create_index! (which isn't available in 10), or people just saying to use CONCURRENTLY.

The table is ~200GB, and there are indexes are 7 indexes which are 14GB each (as per pg_relation_size). I can get ~900M/s constant read-rate on the DB for this task. Is there a simple metric I can use to determine how much data will be required to be read to reindex fully?

Question 2

CREATE a TABLE by SELECTing 1 in 10 of your records and do a test? We don't know your CPU, RAM and especially your disk config (HDD/SSD - with/without RAID - if with, then which RAID? 0? 1?,5? 0+1? 1+0?). What else will be going on while you reindex? It's impossible to say with the information you've given!

Question 3

@Vérace - Selecting 1/10th wouldn't do a good job at representing a bloated INDEX on the original table, right. IMO something like a REINDEX should be easily explainable (under the hood) as a list of steps - which I think should be estimatable given the info I gave (for example, ~2-3 full table reads, ~1-2 index reads, ~1 index writes).

Question 4

Bloated index has no value for reindex. Old index will not be used. Reindex index will do 1 full table scan, 0 index scan, will write all live tuples into new index.

Question 5

I've just run reindex on my ~400GB DB with 4GB total index size - it was mostly using only single core, minimum RAM, and my VPS has SSD (not sure about available/running speed though) - and it was running for ~50 minutes.

Question 6

You could just create new index with different name by

create index concurrently index_new on ...

Then drop corrupted index with

drop index concurrently index_old;

Then you could rename new index to old name:

alter index index_new rename to index_old;

Latter will require lock, but for few milliseconds of runtime after acquire the lock. So you do not need downtime due to write lock.

The definition of the index can be obtained from the command pg_dump -s -t tablename --no-acl

This is exactly the same procedure that does reindex concurrently under the hood. But reindex concurrently is a bit cheaper since do not need lock for index rename phase.

Also widely known pg_repack has feature to reindex table with option --only-indexes. This option is implemented as create + drop index concurrently.

Is there a simple metric I can use to determine how much data will be required to be read to reindex fully?

Well, any index creation without concurrently will read the entire table sequentially (concurrently will read the table twice). Something else depends on access method. Btree will sort all live tuples. This is the most time-consumption part of create index, for large indexes the work will be done in temporary files (remember increase maintenance_work_mem). This part also depends on datatypes and values. Text with small selectivity (e.g. some status field) will be noticeable slower to build than integer sequences.

I have no way to estimate, except for one: to measure the creation time of an index on some data sample:

create table estimate_table as (
 select * from tablename 
 where created_at > '2020-01-01'
);
\dt+ estimate_table
\timing on
create index on estimate_table ...

Reindex is just a special form of index creation. Hmm, and an important point: reindex table has no difference with several reindex index in terms of resourse usage. reindex table is implemented by calling reindex_index for each individual index on table. So, table with 5 indexes will be scanned 5 times.

Question 7

Absolutely think this would get around the issue, but I'm still interested in the question I've asked - since: I have some even older legacy machines which are <= version 8.1 (i.e. before CONCURRENTLY was added to CREATE INDEX), and I'd still like a way to get a guestimate on how long a CREATE INDEX is going to take anyway.

Question 8

The only reliable estimate of how long it will take can come from restoring a physical backup to an identical machine and testing it there.

There are too many factors going into this to come up with a good estimate otherwise.

Melkij Melkij 3,9078 silver badges17 bronze badges · Accepted Answer · 2020-12-02 18:24:42Z

You could just create new index with different name by

create index concurrently index_new on ...

Then drop corrupted index with

drop index concurrently index_old;

Then you could rename new index to old name:

alter index index_new rename to index_old;

Latter will require lock, but for few milliseconds of runtime after acquire the lock. So you do not need downtime due to write lock.

The definition of the index can be obtained from the command pg_dump -s -t tablename --no-acl

This is exactly the same procedure that does reindex concurrently under the hood. But reindex concurrently is a bit cheaper since do not need lock for index rename phase.

Also widely known pg_repack has feature to reindex table with option --only-indexes. This option is implemented as create + drop index concurrently.

Is there a simple metric I can use to determine how much data will be required to be read to reindex fully?

Well, any index creation without concurrently will read the entire table sequentially (concurrently will read the table twice). Something else depends on access method. Btree will sort all live tuples. This is the most time-consumption part of create index, for large indexes the work will be done in temporary files (remember increase maintenance_work_mem). This part also depends on datatypes and values. Text with small selectivity (e.g. some status field) will be noticeable slower to build than integer sequences.

I have no way to estimate, except for one: to measure the creation time of an index on some data sample:

create table estimate_table as (
 select * from tablename 
 where created_at > '2020-01-01'
);
\dt+ estimate_table
\timing on
create index on estimate_table ...

Reindex is just a special form of index creation. Hmm, and an important point: reindex table has no difference with several reindex index in terms of resourse usage. reindex table is implemented by calling reindex_index for each individual index on table. So, table with 5 indexes will be scanned 5 times.

Absolutely think this would get around the issue, but I'm still interested in the question I've asked - since: I have some even older legacy machines which are <= version 8.1 (i.e. before CONCURRENTLY was added to CREATE INDEX), and I'd still like a way to get a guestimate on how long a CREATE INDEX is going to take anyway.

Stack Exchange Network

Postgres REINDEX time estimate

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Postgres REINDEX time estimate

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions