We have a PostgreSQL table with ~5 billion rows that has developed a nasty habit of missing the proper indices and doing a Primary Key scan on certain LIMIT
operations.
The problem generally manifests on an ORDER BY .. LIMIT ..
clause (a common pattern in Django pagination) where the LIMIT
is some relatively small subset of the results matched by the index. An extreme example is this:
SELECT * FROM mcqueen_base_imagemeta2
WHERE image_id IN ( 123, ... )
ORDER BY id DESC
LIMIT 1;
where the items in that IN
clause are ~20 and total rows matched by the index on image_id
is 16.
The EXPLAIN
shows that it misses the image_id
index and instead does a PK scan of 5B rows:
Limit (cost=0.58..4632.03 rows=1 width=28) -> Index Scan Backward using mcqueen_base_imagemeta2_pkey on mcqueen_base_imagemeta2 (cost=0.58..364597074.75 rows=78722 width=28) Filter: (image_id = ANY ('{123, ...}'::bigint[]))
If the LIMIT
is increased to 2
, it works as expected:
Limit (cost=7585.92..7585.93 rows=2 width=28) -> Sort (cost=7585.92..7782.73 rows=78722 width=28) Sort Key: id DESC -> Index Scan using mcqueen_base_imagemeta2_image_id_616fe89c on mcqueen_base_imagemeta2 (cost=0.58..6798.70 rows=78722 width=28) Index Cond: (image_id = ANY ('{123, ...}'::bigint[]))
This also happens on queries where the index matches ~3000 rows and the limit is set to 100, so something that easily happens in real world REST API pagination.
The table definition is:
mcqueen=# \d mcqueen_base_imagemeta2
Table "public.mcqueen_base_imagemeta2"
Column | Type | Modifiers
-------------------+--------------------------+----------------------------------------------------------------------
id | bigint | not null default nextval('mcqueen_base_imagemeta2_id_seq'::regclass)
created_at | timestamp with time zone | not null
image_id | bigint | not null
key_id | smallint | not null
source_version_id | smallint | not null
Indexes:
"mcqueen_base_imagemeta2_pkey" PRIMARY KEY, btree (id)
"mcqueen_base_imagemeta2_image_id_616fe89c" btree (image_id)
"mcqueen_base_imagemeta2_key_id_a4854581" btree (key_id)
"mcqueen_base_imagemeta2_source_version_id_f9b0513e" btree (source_version_id)
Foreign-key constraints:
"mcqueen_base_imageme_image_id_616fe89c_fk_mcqueen_b" FOREIGN KEY (image_id) REFERENCES mcqueen_base_image(id) DEFERRABLE INITIALLY DEFERRED
"mcqueen_base_imageme_key_id_a4854581_fk_mcqueen_b" FOREIGN KEY (key_id) REFERENCES mcqueen_base_metakey(id) DEFERRABLE INITIALLY DEFERRED
"mcqueen_base_imageme_source_version_id_f9b0513e_fk_mcqueen_b" FOREIGN KEY (source_version_id) REFERENCES mcqueen_base_metasourceversion(id) DEFERRABLE INITIALLY DEFERRED
I'm a novice at best when it comes to tuning, but I figure that the defaults for statistics are not up to that table's size and so it naively thinks that a PK scan is faster than an index scan.
3 Answers 3
It thinks it is going to find 78722, but it really finds 16, so that is going to lead to some bad plans.
When a value in the in-list is not present in the MCV list of the stats table, it guesses their frequency using the n_distinct value, which is probably way off (you didn't answer my question about that). The way it does this is to take the number of tuples not covered by the MCV frequency list and divides it by the number of distinct values not listed in the MCV list. So basically ntuples * (1-sum of MCF) / (n_distinct - length of MCF)
. This simplified formula ignores NULLs.
As @ErwinBrandstetter suggests, you might be able to improve the situation by increasing the size of the MCV list by increasing the statistics sample size. That might also increase the accuracy of the n_distinct estimate. But with 6 billions rows, it might not possible to increase the sample size by enough. Also, if image_id are clumped together with the duplicate values likely to occur in the same page, then the sampling method used by PostgreSQL is quite biased when it comes to computing n_distinct, and this is resistant to fixing by just increasing the sample size.
A simpler way to fix this may be to fix the n_distinct manually:
alter table mcqueen_base_imagemeta2 alter column image_id set (n_distinct=1000000000);
analyze mcqueen_base_imagemeta2;
This method doesn't increase the time or storage required by ANALYZE, the way increasing the sample size does, and is also more likely to succeed.
-
Would you know at what occasions the system overwrites a manual setting of
n_distinct
?Erwin Brandstetter– Erwin Brandstetter2019年09月26日 13:33:15 +00:00Commented Sep 26, 2019 at 13:33 -
1@ErwinBrandstetter Nothing that I know of, other than an explicit RESET (or dropping the table and recreating it). Note that the manual setting is not used directly by the planner, it is just copied to overwrite the one in "pg_stats" on the next ANALYZE (and every ANALYZE thereafter) .jjanes– jjanes2019年09月26日 14:08:59 +00:00Commented Sep 26, 2019 at 14:08
-
Thanks, very useful!Erwin Brandstetter– Erwin Brandstetter2019年09月26日 14:59:14 +00:00Commented Sep 26, 2019 at 14:59
-
The
n_distinct
is currently 2,540,250, which seems about right, but with a number as large and sparsely populated across the 6B that I tend to agree that I cannot get a sample size large enough to get meaningful coverage. I will manually increasingn_distinct
Arne Claassen– Arne Claassen2019年09月26日 17:31:54 +00:00Commented Sep 26, 2019 at 17:31 -
The
n_distinct
ALTER followed by anANALYZE
did the trick. Thanks!Arne Claassen– Arne Claassen2019年09月26日 17:43:38 +00:00Commented Sep 26, 2019 at 17:43
Why?
For a LIMIT 1
, Postgres may estimate it to be faster to traverse the index supporting the ORDER BY
and just keep filtering until the first row is found. This is fast as long as more than a few rows qualify and one of those pops up early according to ORDER BY
. But it is (very) slow if no qualifying row pops up early, or even a worst case scenario if no row ends up qualifying at all. Similar for any small LIMIT
.
Postgres collects statistics about the most common values (MCV list), but not for the least common ones - for obvious reasons, that would be far too many to be useful. And it has no statistics for correlations between columns by default. (While that can be created manually it won't fit your use case anyway, as ID numbers are typically uncorrelated.)
So Postgres has to base its decision on generic estimates. It's very hard to identify the sweet spot where to switch from one index to the other. This gets harder, yet, for a predicate like image_id IN (123, ... )
with many items, and most are typically rare or very rare or even non-existent. But if you put enough numbers into the list, Postgres will eventually expect that traversing the other index will find the first hit faster.
Solutions?
You may be able to improve the situation somewhat with a larger statistics target:
ALTER TABLE mcqueen_base_imagemeta2 ALTER image_id SET STATISTICS 2000;
That (among other things) increases the size of the MCV list for the column and help identify more (less) common values. But it's not a general solution for the problem, and makes ANALYZE
and query planning a bit more expensive. Related:
Upgrading to the latest version (soon to be Postgres 12) also helps as general performance got better and the planner smarter.
There are various techniques for a workaround, depending on cardinalities, value frequencies, access patterns, ... Completely disabling the ORDER BY
index like Laurenz demonstrated is one radical workaround - which can backfire for long lists or very common image_id
, where the ORDER BY
index would, in fact, be much faster.
Related:
Workaround for your case
Should work well for the given numbers: 5 billion rows, around 20 image_id
in the filter list, small LIMIT
. Most efficient for LIMIT 1
and a short list, but good for any small LIMIT
and manageable list size:
SELECT m.*
FROM unnest( '{123, ...}'::bigint[]) i(image_id)
CROSS JOIN LATERAL (
SELECT m.id
FROM mcqueen_base_imagemeta2 m
WHERE m.image_id = i.image_id
ORDER BY m.id DESC
LIMIT 1 -- or N
) m
ORDER BY id DESC
LIMIT 1; -- or N
Provide your list as array and unnest()
. Or use a VALUES
expression. Related:
It's essential to support this with a multicolumn index on (image_id, id DESC)
!
You might then delete the existing index mcqueen_base_imagemeta2_image_id_616fe89c
on just (image_id)
. See:
This should result in one very fast index(-only) scan per image_id
. And a final, (very) cheap sort step.
Fetching N rows for each image_id
guarantees that we have all rows needed in the outer query. If you have meta-knowledge that only fewer rows per single image_id
can be in the result, you can decrease the nested LIMIT
accordingly.
Aside
(a common pattern in Django pagination)
Pagination with LIMIT
and OFFSET
? OK for the first page, but after that it's just a bad idea.
-
Django actually does not use OFFSET (I am well aware of that evil from other frameworks as well). It over-fetches by 1, and each subsequent page uses a
>=
that over-fetched id. Hence the requirement for an order.Arne Claassen– Arne Claassen2019年09月26日 17:36:59 +00:00Commented Sep 26, 2019 at 17:36 -
Over-fetching reveals whether there are more rows. But the next page would generally better be based on
>
last id than on>=
over-fetched id. (Maybe it is?) Not to miss possibly inserted values in between. Anyway, if your queries are as displayed, consider the multicolumn index I suggested. Should help with your old query, too.Erwin Brandstetter– Erwin Brandstetter2019年09月26日 22:01:05 +00:00Commented Sep 26, 2019 at 22:01
The simple solution is to modify the ORDER BY
condition so that the semantics are unchanged, but PostgreSQL cannot use the index any more:
SELECT * FROM mcqueen_base_imagemeta2
WHERE image_id IN ( 123, ... )
ORDER BY id + 0 DESC
LIMIT 1;
-
That does indeed work. Unfortunately the queries are generated by the Django ORM and only certain scenarios would this really be the desirable behavior, so it's probably not a solution we can deploy.Arne Claassen– Arne Claassen2019年09月25日 20:44:39 +00:00Commented Sep 25, 2019 at 20:44
-
Any ORM worth its salt will let you write queries in SQL. How about creating a view
AS SELECT id + 0 AS id, ...
?Laurenz Albe– Laurenz Albe2019年09月25日 20:49:38 +00:00Commented Sep 25, 2019 at 20:49 -
It's not that ORM won't let me generate custom SQL, it's that the query is automatically generated as part of the REST API machinery. I can probably suppress that for the couple of places where this is applicable, but would certainly prefer fixing it at the source rather through a special case of tech debt.Arne Claassen– Arne Claassen2019年09月25日 20:53:16 +00:00Commented Sep 25, 2019 at 20:53
Explore related questions
See similar questions with these tags.
select * from pg_stats where tablename ='mcqueen_base_imagemeta2' and attname='image_id'\x\g\x
? Extremely rare values (which these must be if 20 of them only find 16 rows) are often estimated poorly. And what is the actual number of distinct values of image_id in the table?select count(*) from (select distinct image_id from mcqueen_base_imagemeta2) foo