Postgres ORDER BY...LIMIT breaks after bulk insert or update in large table

Question 1

The largest table in our DB (Postgres 9.5) has almost 2 billion rows and is about to reach the 32-bit integer limit. I have begun a process of converting the PK (id) from int to bigint as follows:

ALTER TABLE bigtable ADD COLUMN id_bigint BIGINT NULL;
ALTER TABLE bigtable ALTER COLUMN id_bigint DROP DEFAULT;

Then, the following script runs to copy all values from id to id_bigint in chunks:

def copy_bigtable_id(model, table_name, batch_count, start_index=0):
 latest_id = model.objects.latest('id').id
 counter = start_index
 if latest_id < counter:
 print("Latest id {} in {} is less than start index {}. Nothing to copy. Exiting.".format(
 latest_id, table_name, start_index))
 return
 print("Starting migration of 'id' to 'id_bigint' on {} from index {} to {}".format(
 table_name, start_index, latest_id))
 print("Copy in batches of {} rows".format(batch_count))
 while counter <= latest_id:
 with connection.cursor() as cursor:
 cursor.execute('BEGIN;')
 select = "SELECT * FROM {} WHERE id >= %s AND id < %s FOR UPDATE".format(
 table_name
 )
 update = "UPDATE {} SET id_bigint=id WHERE id >= %s AND id < %s".format(
 table_name
 )
 cursor.execute(select, [counter, counter + batch_count])
 cursor.execute(update, [counter, counter + batch_count])
 cursor.execute('COMMIT;')
 counter = counter + batch_count
 latest_id = model.objects.latest('id').id
 print("copied {0:.2f}% of ids in {1}. counter={2}".format(
 min(100, 100.0 * counter / latest_id), table_name, counter)
 )
 print('ALL DONE! Last id in {} = {}'.format(table_name, latest_id))

Following the script, id is replaced by id_bigint as follows:

ALTER TABLE bigtable DROP CONSTRAINT bigtable_pkey;
DROP SEQUENCE bigtable_id_seq CASCADE;
ALTER TABLE bigtable RENAME COLUMN id TO id_old;
ALTER TABLE bigtable RENAME COLUMN id_bigint TO id;
ALTER TABLE bigtable ADD PRIMARY KEY (id);
CREATE SEQUENCE bigtable_id_seq START WITH <latest_id>
ALTER TABLE bigtable ALTER COLUMN id SET DEFAULT NEXTVAL('bigtable_id_seq');
ALTER TABLE bigtable ALTER COLUMN id_old DROP NOT NULL;
ALTER TABLE bigtable DROP COLUMN id_old;

This approach was chosen after failing to perform ALTER TABLE on the existing id field because the table is so large and ALTER TABLE takes an exclusive lock, blocking any access to the table. We wanted to minimize down time.

Our script has now been running for over 10 hours now and covered half of the table. The script is not blocking any other SELECT calls on the table as it uses ROW EXCLUSIVE lock only on the rows it is changing.

However, after covering 50%+ of the rows, normal SELECT queries on the table started to take extremely long time. It turns out the indexes set on the table were not being used.

This is the SELECT query which broke:

SELECT "bigtable"."id", "bigtable"."latitude", "bigtable"."longitude", 
"bigtable"."timestamp", "bigtable"."speed", "bigtable"."accuracy", 
"bigtable"."trip_id", "bigtable"."step_id", "bigtable"."revised" 
FROM "bigtable" 
WHERE "bigtable"."step_id" = 5907395 
ORDER BY "bigtable"."timestamp" ASC LIMIT 1;

this is the EXPLAIN result after the massive table updates (in the script above):

Limit (cost=0.58..44711.92 rows=1 width=49)
 -> Index Scan using bigtable_timestamp on bigtable (cost=0.58..2826516601.39 rows=63217 width=49)
 Filter: (step_id = 5907395)

Notice that the query planner is not sorting the results as ORDER BY would suggest.

This is the EXAPLIN of the same query on my localhost replica of the database with much smaller dataset, but exact same schema. This is the query I'd expect and which returns results in milliseconds:

Limit (cost=157.96..157.97 rows=1 width=53)
 -> Sort (cost=157.96..158.30 rows=136 width=53)
 Sort Key: "timestamp"
 -> Index Scan using bigtable_step_id on bigtable (cost=0.43..157.28 rows=136 width=53)
 Index Cond: (step_id = 5907395)

bigtable has the following indexes:

CREATE UNIQUE INDEX bigtable_pkey ON bigtable(id int8_ops);
CREATE INDEX bigtable_step_id ON bigtable(step_id int4_ops);
CREATE INDEX bigtable_timestamp ON bigtable(timestamp int4_ops);
CREATE INDEX bigtable_trip_id ON bigtable(trip_id int4_ops);

The "broken" query plan in the first EXPLAIN takes hours to complete. I dug a bit more and found this post which mentions that Postgres query planner sometimes gets confused with ORDER BY... LIMIT 1. Is this the case I'm running into??

Rows are inserted regularly into bigtable with ever-increasing step_id and timestamp. This query used to take a dozen milliseconds at worst, before it quickly deteriorated to over 30 minutes. This happened, again, after we began populating bigtable with id_bigint. Could this be the result of index bloat?

Are the updates we are performing on that table screwing up the indexes, or Postgresql planner capability to use those indexes?

What happened is precisely what we wanted to avoid: down time. The app is running but effectively not serving clients because each query on that large table, which used to take milliseconds, now takes over 1 hour. I don't have EXPLAIN query plans from before, but the queries took milliseconds and I'm pretty sure index scans were used. I tried to run EXPLAIN on the same query now by gave up after 15 min. Looking at the running queries on the live instance, similar queries had been "active" for longer than 30 min. So there has been a fundamental change in the way the query planner runs those queries, not using the indexes.

SELECT relname, last_vacuum, last_autovacuum, n_dead_tup from pg_stat_user_tables shows a huge amount of dead tuples on this table (1331381938) and that no vacuum or autovacuum has been performed on it. We have autovacum enabled and autovacuum is constantly running on this bigtable. At this point in time it's been running for over 20 hours and has not completed.

What is the fastest way to resolve this situation?

Should I run VACUUM on this table? If so, should I combine it with ANALYZE to rebuild the query planner statistics?
Should I run REINDEX on this table?

Given the sheer size of this table, is there a way to speed up VACUUM or ANALYZE?

Question 2

Thanks, updated the question and added EXPLAIN info. Trying VACCUM VERBOSE ANALYZE bigtable now...

Question 3

It would be nice to see the EXPLAIN for the query without the ORDER BY ... LIMIT. That one will probably use the other index, and we can see what it thinks about the cost of doing so. But if your VACUUM finishes before you do that, it might not be very relevant anymore.

Question 4

The EXPLAIN from your much smaller system is helpful in that it shows that another index could be used, but we can't really extrapolate from the numbers reported up to the large system. That is why the EXPLAIN without the ORDER BY...LIMIT run on the production system could be helpful.

Question 5

If you have cancelled your bulk update and will try to finish it later, you should change it not to update (or lock) rows already updated. Maybe your start_index already does that, but adding a AND id_bigint is NULL should be a cheap safeguard.

Question 6

start_index is already taking care of that indeed. The bulk update of id_bigint has been completed now. I am stuck now for 11 hours and counting in VACCUM ANALYSE VERBOSE bigtable;. No idea if it will help me resolve this issue by fixing the visiblity_map or not, and no idea when it will end. See my other question: dba.stackexchange.com/questions/260170/…

Question 7

Then, a script running in a loop copies the values from id to id_bigint using SELECT FOR UPDATE + UPDATE in chuncks of 1 million rows.

Could you show us exactly how that was done? Details matter here. How fast are these chunks going by?

Are the updates we are performing on that table screwing up the indexes, or Postgresql planner capability to use those indexes?

It shouldn't "screw them up", other than by possibly bloating them, or clearing the all-visible flag on so many pages that an index-only-scan doesn't seem worthwhile anymore. Do you have before and after query plans you can show us? Did the before involve index-only-scans, or just regular index scans?

We have autovaccum enabled and autovaccum is constantly running on this bigtable.

Given the sheer size of this table, is there a way to speed up VACCUM or ANALYZE?

Autovacuum has a default level of IO throttling which makes it pretty slow. (This default was reduced by a factor of 10 between v11 and v12 so it is much faster now, you didn't say what version you were using.) While a manual vacuum is unthrottled by default. If you start a manual VACUUM, is should run much faster than autovac does.

The autovac should automatically cancel itself after it detects the manual VACUUM is trying to run. There are situations where it won't, but they probably don't apply to you.

This might consume all of the IO capacity on your system, but if you are currently down anyway that should hardly matter.

I would not expect ANALYZE to much difference, as the only column whose distribution is changing is not yet being used. I guess the correlation values might be changing in important ways, but I doubt that can explain what you are seeing.

Should I run VACCUM FULL?

No. Regular manual vacuum should be enough.

Should I run REINDEX on this table?

I see no reason to think that would help more than trivially. And it will certainly take time and resources away from things that possibly could.

Question 8

Thank you for the swift answer. I've now edited my question to add some of the missing information. The script copied in chunks of 1000000 (1 million) rows and completed each chunk in about 1 minute. You mentioned I should run VACCUM but isn't VACCUM intended to reclaim disk splace? My problem is the slow queries / indexes not being used properly anymore.

Question 9

Ordinary VACUUM makes space available for internal reuse, but generally does not reclaim disk space unless it happens to find a big chunk of absolutely free space at the very end of a table. But in the process it does other house keeping that can keep the system running smoothly. A big one of those could be maintaining the visibility map, which promotes index usage in some cases.

Question 10

Could you comment on the EXPLAIN result I posted in the question? This was taken after the changes, i.e. the query in the explain takes over 30 minutes (and cancelled). I also am now running VACCUM ANALYSE VERBOSE manually, which has been running for 5 hours now, and counting... while true; do cat /proc/23691/io | grep read_bytes; sleep 60; done on the VACCUM pid shows read-bytes at 2021039136768 (2TB) which is more than our entire volume (1.1TB). The table total_bytes is 718GB. Does this mean VACCUM is making multiple passes on the same bytes?

Question 11

VACUUM has to run over each index each time maintenance_work_mem fills up, and each dead tuple takes 6 bytes of maintenance_work_mem. The initial setting (if you have never changed that) of maintenance_work_mem is very low for large systems. You should set maintenance_work_mem to 1GB (assuming you have enough RAM to support that) on the session running the vacuum. Unfortunately that would mean cancelling the vacuum and restarting it. Some but not all of the work already done will be lost by restarting it.

Question 12

Thanks, I've changed maintenance_work_mem to 1GB now but my VACUUM has already been running for 11 hours. How much of a speed increase could I gain by restarting the VACUUM now? I'd hate to lose the work done already. How do I know what will be retained if I cancel this VACUUM? Time is of the essence here as my system is down and I'm aiming at completing this process asap.

jjanes jjanes 42.4k3 gold badges44 silver badges54 bronze badges · Answer 1 · 2020-02-20 02:04:24Z

Then, a script running in a loop copies the values from id to id_bigint using SELECT FOR UPDATE + UPDATE in chuncks of 1 million rows.

Could you show us exactly how that was done? Details matter here. How fast are these chunks going by?

Are the updates we are performing on that table screwing up the indexes, or Postgresql planner capability to use those indexes?

It shouldn't "screw them up", other than by possibly bloating them, or clearing the all-visible flag on so many pages that an index-only-scan doesn't seem worthwhile anymore. Do you have before and after query plans you can show us? Did the before involve index-only-scans, or just regular index scans?

We have autovaccum enabled and autovaccum is constantly running on this bigtable.

Given the sheer size of this table, is there a way to speed up VACCUM or ANALYZE?

Autovacuum has a default level of IO throttling which makes it pretty slow. (This default was reduced by a factor of 10 between v11 and v12 so it is much faster now, you didn't say what version you were using.) While a manual vacuum is unthrottled by default. If you start a manual VACUUM, is should run much faster than autovac does.

The autovac should automatically cancel itself after it detects the manual VACUUM is trying to run. There are situations where it won't, but they probably don't apply to you.

This might consume all of the IO capacity on your system, but if you are currently down anyway that should hardly matter.

I would not expect ANALYZE to much difference, as the only column whose distribution is changing is not yet being used. I guess the correlation values might be changing in important ways, but I doubt that can explain what you are seeing.

Should I run VACCUM FULL?

No. Regular manual vacuum should be enough.

Should I run REINDEX on this table?

I see no reason to think that would help more than trivially. And it will certainly take time and resources away from things that possibly could.

Thank you for the swift answer. I've now edited my question to add some of the missing information. The script copied in chunks of 1000000 (1 million) rows and completed each chunk in about 1 minute. You mentioned I should run VACCUM but isn't VACCUM intended to reclaim disk splace? My problem is the slow queries / indexes not being used properly anymore.
Ordinary VACUUM makes space available for internal reuse, but generally does not reclaim disk space unless it happens to find a big chunk of absolutely free space at the very end of a table. But in the process it does other house keeping that can keep the system running smoothly. A big one of those could be maintaining the visibility map, which promotes index usage in some cases.
Could you comment on the EXPLAIN result I posted in the question? This was taken after the changes, i.e. the query in the explain takes over 30 minutes (and cancelled). I also am now running VACCUM ANALYSE VERBOSE manually, which has been running for 5 hours now, and counting... while true; do cat /proc/23691/io | grep read_bytes; sleep 60; done on the VACCUM pid shows read-bytes at 2021039136768 (2TB) which is more than our entire volume (1.1TB). The table total_bytes is 718GB. Does this mean VACCUM is making multiple passes on the same bytes?
VACUUM has to run over each index each time maintenance_work_mem fills up, and each dead tuple takes 6 bytes of maintenance_work_mem. The initial setting (if you have never changed that) of maintenance_work_mem is very low for large systems. You should set maintenance_work_mem to 1GB (assuming you have enough RAM to support that) on the session running the vacuum. Unfortunately that would mean cancelling the vacuum and restarting it. Some but not all of the work already done will be lost by restarting it.
Thanks, I've changed maintenance_work_mem to 1GB now but my VACUUM has already been running for 11 hours. How much of a speed increase could I gain by restarting the VACUUM now? I'd hate to lose the work done already. How do I know what will be retained if I cancel this VACUUM? Time is of the essence here as my system is down and I'm aiming at completing this process asap.

Stack Exchange Network

Postgres ORDER BY...LIMIT breaks after bulk insert or update in large table

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Postgres ORDER BY...LIMIT breaks after bulk insert or update in large table

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions