The largest table in our DB (Postgres 9.5) has almost 2 billion rows and is about to reach the 32-bit integer limit. I have begun a process of converting the PK (id) from int to bigint as follows:
ALTER TABLE bigtable ADD COLUMN id_bigint BIGINT NULL;
ALTER TABLE bigtable ALTER COLUMN id_bigint DROP DEFAULT;
Then, the following script runs to copy all values from id
to id_bigint
in chunks:
def copy_bigtable_id(model, table_name, batch_count, start_index=0):
latest_id = model.objects.latest('id').id
counter = start_index
if latest_id < counter:
print("Latest id {} in {} is less than start index {}. Nothing to copy. Exiting.".format(
latest_id, table_name, start_index))
return
print("Starting migration of 'id' to 'id_bigint' on {} from index {} to {}".format(
table_name, start_index, latest_id))
print("Copy in batches of {} rows".format(batch_count))
while counter <= latest_id:
with connection.cursor() as cursor:
cursor.execute('BEGIN;')
select = "SELECT * FROM {} WHERE id >= %s AND id < %s FOR UPDATE".format(
table_name
)
update = "UPDATE {} SET id_bigint=id WHERE id >= %s AND id < %s".format(
table_name
)
cursor.execute(select, [counter, counter + batch_count])
cursor.execute(update, [counter, counter + batch_count])
cursor.execute('COMMIT;')
counter = counter + batch_count
latest_id = model.objects.latest('id').id
print("copied {0:.2f}% of ids in {1}. counter={2}".format(
min(100, 100.0 * counter / latest_id), table_name, counter)
)
print('ALL DONE! Last id in {} = {}'.format(table_name, latest_id))
Following the script, id
is replaced by id_bigint
as follows:
ALTER TABLE bigtable DROP CONSTRAINT bigtable_pkey;
DROP SEQUENCE bigtable_id_seq CASCADE;
ALTER TABLE bigtable RENAME COLUMN id TO id_old;
ALTER TABLE bigtable RENAME COLUMN id_bigint TO id;
ALTER TABLE bigtable ADD PRIMARY KEY (id);
CREATE SEQUENCE bigtable_id_seq START WITH <latest_id>
ALTER TABLE bigtable ALTER COLUMN id SET DEFAULT NEXTVAL('bigtable_id_seq');
ALTER TABLE bigtable ALTER COLUMN id_old DROP NOT NULL;
ALTER TABLE bigtable DROP COLUMN id_old;
This approach was chosen after failing to perform ALTER TABLE
on the existing id
field because the table is so large and ALTER TABLE
takes an exclusive lock, blocking any access to the table. We wanted to minimize down time.
Our script has now been running for over 10 hours now and covered half of the table. The script is not blocking any other SELECT
calls on the table as it uses ROW EXCLUSIVE
lock only on the rows it is changing.
However, after covering 50%+ of the rows, normal SELECT
queries on the table started to take extremely long time. It turns out the indexes set on the table were not being used.
This is the SELECT query which broke:
SELECT "bigtable"."id", "bigtable"."latitude", "bigtable"."longitude",
"bigtable"."timestamp", "bigtable"."speed", "bigtable"."accuracy",
"bigtable"."trip_id", "bigtable"."step_id", "bigtable"."revised"
FROM "bigtable"
WHERE "bigtable"."step_id" = 5907395
ORDER BY "bigtable"."timestamp" ASC LIMIT 1;
this is the EXPLAIN
result after the massive table updates (in the script above):
Limit (cost=0.58..44711.92 rows=1 width=49)
-> Index Scan using bigtable_timestamp on bigtable (cost=0.58..2826516601.39 rows=63217 width=49)
Filter: (step_id = 5907395)
Notice that the query planner is not sorting the results as ORDER BY would suggest.
This is the EXAPLIN
of the same query on my localhost replica of the database with much smaller dataset, but exact same schema. This is the query I'd expect and which returns results in milliseconds:
Limit (cost=157.96..157.97 rows=1 width=53)
-> Sort (cost=157.96..158.30 rows=136 width=53)
Sort Key: "timestamp"
-> Index Scan using bigtable_step_id on bigtable (cost=0.43..157.28 rows=136 width=53)
Index Cond: (step_id = 5907395)
bigtable
has the following indexes:
CREATE UNIQUE INDEX bigtable_pkey ON bigtable(id int8_ops);
CREATE INDEX bigtable_step_id ON bigtable(step_id int4_ops);
CREATE INDEX bigtable_timestamp ON bigtable(timestamp int4_ops);
CREATE INDEX bigtable_trip_id ON bigtable(trip_id int4_ops);
The "broken" query plan in the first EXPLAIN
takes hours to complete. I dug a bit more and found this post which mentions that Postgres query planner sometimes gets confused with ORDER BY... LIMIT 1
. Is this the case I'm running into??
Rows are inserted regularly into bigtable
with ever-increasing step_id
and timestamp
. This query used to take a dozen milliseconds at worst, before it quickly deteriorated to over 30 minutes. This happened, again, after we began populating bigtable with id_bigint. Could this be the result of index bloat?
Are the updates we are performing on that table screwing up the indexes, or Postgresql planner capability to use those indexes?
What happened is precisely what we wanted to avoid: down time. The app is running but effectively not serving clients because each query on that large table, which used to take milliseconds, now takes over 1 hour. I don't have EXPLAIN
query plans from before, but the queries took milliseconds and I'm pretty sure index scans were used. I tried to run EXPLAIN on the same query now by gave up after 15 min. Looking at the running queries on the live instance, similar queries had been "active" for longer than 30 min. So there has been a fundamental change in the way the query planner runs those queries, not using the indexes.
SELECT relname, last_vacuum, last_autovacuum, n_dead_tup from pg_stat_user_tables
shows a huge amount of dead tuples on this table (1331381938) and that no vacuum or autovacuum has been performed on it. We have autovacum enabled and autovacuum is constantly running on this bigtable. At this point in time it's been running for over 20 hours and has not completed.
What is the fastest way to resolve this situation?
- Should I run
VACUUM
on this table? If so, should I combine it withANALYZE
to rebuild the query planner statistics? - Should I run
REINDEX
on this table?
Given the sheer size of this table, is there a way to speed up VACUUM
or ANALYZE
?
1 Answer 1
Then, a script running in a loop copies the values from id to id_bigint using SELECT FOR UPDATE + UPDATE in chuncks of 1 million rows.
Could you show us exactly how that was done? Details matter here. How fast are these chunks going by?
Are the updates we are performing on that table screwing up the indexes, or Postgresql planner capability to use those indexes?
It shouldn't "screw them up", other than by possibly bloating them, or clearing the all-visible flag on so many pages that an index-only-scan doesn't seem worthwhile anymore. Do you have before and after query plans you can show us? Did the before involve index-only-scans, or just regular index scans?
We have autovaccum enabled and autovaccum is constantly running on this bigtable.
Given the sheer size of this table, is there a way to speed up VACCUM or ANALYZE?
Autovacuum has a default level of IO throttling which makes it pretty slow. (This default was reduced by a factor of 10 between v11 and v12 so it is much faster now, you didn't say what version you were using.) While a manual vacuum is unthrottled by default. If you start a manual VACUUM, is should run much faster than autovac does.
The autovac should automatically cancel itself after it detects the manual VACUUM is trying to run. There are situations where it won't, but they probably don't apply to you.
This might consume all of the IO capacity on your system, but if you are currently down anyway that should hardly matter.
I would not expect ANALYZE to much difference, as the only column whose distribution is changing is not yet being used. I guess the correlation values might be changing in important ways, but I doubt that can explain what you are seeing.
Should I run VACCUM FULL?
No. Regular manual vacuum should be enough.
Should I run REINDEX on this table?
I see no reason to think that would help more than trivially. And it will certainly take time and resources away from things that possibly could.
-
Thank you for the swift answer. I've now edited my question to add some of the missing information. The script copied in chunks of 1000000 (1 million) rows and completed each chunk in about 1 minute. You mentioned I should run
VACCUM
but isn't VACCUM intended to reclaim disk splace? My problem is the slow queries / indexes not being used properly anymore.100grams– 100grams2020年02月20日 03:06:03 +00:00Commented Feb 20, 2020 at 3:06 -
Ordinary VACUUM makes space available for internal reuse, but generally does not reclaim disk space unless it happens to find a big chunk of absolutely free space at the very end of a table. But in the process it does other house keeping that can keep the system running smoothly. A big one of those could be maintaining the visibility map, which promotes index usage in some cases.jjanes– jjanes2020年02月20日 04:14:52 +00:00Commented Feb 20, 2020 at 4:14
-
Could you comment on the
EXPLAIN
result I posted in the question? This was taken after the changes, i.e. the query in the explain takes over 30 minutes (and cancelled). I also am now runningVACCUM ANALYSE VERBOSE
manually, which has been running for 5 hours now, and counting...while true; do cat /proc/23691/io | grep read_bytes; sleep 60; done
on theVACCUM
pid shows read-bytes at 2021039136768 (2TB) which is more than our entire volume (1.1TB). The table total_bytes is 718GB. Does this mean VACCUM is making multiple passes on the same bytes?100grams– 100grams2020年02月20日 10:02:55 +00:00Commented Feb 20, 2020 at 10:02 -
VACUUM has to run over each index each time maintenance_work_mem fills up, and each dead tuple takes 6 bytes of maintenance_work_mem. The initial setting (if you have never changed that) of maintenance_work_mem is very low for large systems. You should set maintenance_work_mem to 1GB (assuming you have enough RAM to support that) on the session running the vacuum. Unfortunately that would mean cancelling the vacuum and restarting it. Some but not all of the work already done will be lost by restarting it.jjanes– jjanes2020年02月20日 16:00:23 +00:00Commented Feb 20, 2020 at 16:00
-
Thanks, I've changed maintenance_work_mem to 1GB now but my
VACUUM
has already been running for 11 hours. How much of a speed increase could I gain by restarting the VACUUM now? I'd hate to lose the work done already. How do I know what will be retained if I cancel this VACUUM? Time is of the essence here as my system is down and I'm aiming at completing this process asap.100grams– 100grams2020年02月20日 17:01:23 +00:00Commented Feb 20, 2020 at 17:01
EXPLAIN
info. TryingVACCUM VERBOSE ANALYZE bigtable
now...AND id_bigint is NULL
should be a cheap safeguard.id_bigint
has been completed now. I am stuck now for 11 hours and counting inVACCUM ANALYSE VERBOSE bigtable;
. No idea if it will help me resolve this issue by fixing the visiblity_map or not, and no idea when it will end. See my other question: dba.stackexchange.com/questions/260170/…