Postgres sometimes uses inferior index for WHERE a IN (...) ORDER BY b LIMIT N

Question 1

We have a PostgreSQL table with ~5 billion rows that has developed a nasty habit of missing the proper indices and doing a Primary Key scan on certain LIMIT operations.

The problem generally manifests on an ORDER BY .. LIMIT .. clause (a common pattern in Django pagination) where the LIMIT is some relatively small subset of the results matched by the index. An extreme example is this:

SELECT * FROM mcqueen_base_imagemeta2 
 WHERE image_id IN ( 123, ... )
 ORDER BY id DESC
 LIMIT 1;

where the items in that IN clause are ~20 and total rows matched by the index on image_id is 16.

The EXPLAIN shows that it misses the image_id index and instead does a PK scan of 5B rows:

Limit (cost=0.58..4632.03 rows=1 width=28)
 -> Index Scan Backward using mcqueen_base_imagemeta2_pkey on mcqueen_base_imagemeta2 (cost=0.58..364597074.75 rows=78722 width=28)
 Filter: (image_id = ANY ('{123, ...}'::bigint[]))

If the LIMIT is increased to 2, it works as expected:

Limit (cost=7585.92..7585.93 rows=2 width=28)
 -> Sort (cost=7585.92..7782.73 rows=78722 width=28)
 Sort Key: id DESC
 -> Index Scan using mcqueen_base_imagemeta2_image_id_616fe89c on mcqueen_base_imagemeta2 (cost=0.58..6798.70 rows=78722 width=28)
 Index Cond: (image_id = ANY ('{123, ...}'::bigint[]))

This also happens on queries where the index matches ~3000 rows and the limit is set to 100, so something that easily happens in real world REST API pagination.

The table definition is:

mcqueen=# \d mcqueen_base_imagemeta2
 Table "public.mcqueen_base_imagemeta2"
 Column | Type | Modifiers 
-------------------+--------------------------+----------------------------------------------------------------------
 id | bigint | not null default nextval('mcqueen_base_imagemeta2_id_seq'::regclass)
 created_at | timestamp with time zone | not null
 image_id | bigint | not null
 key_id | smallint | not null
 source_version_id | smallint | not null
Indexes:
 "mcqueen_base_imagemeta2_pkey" PRIMARY KEY, btree (id)
 "mcqueen_base_imagemeta2_image_id_616fe89c" btree (image_id)
 "mcqueen_base_imagemeta2_key_id_a4854581" btree (key_id)
 "mcqueen_base_imagemeta2_source_version_id_f9b0513e" btree (source_version_id)
Foreign-key constraints:
 "mcqueen_base_imageme_image_id_616fe89c_fk_mcqueen_b" FOREIGN KEY (image_id) REFERENCES mcqueen_base_image(id) DEFERRABLE INITIALLY DEFERRED
 "mcqueen_base_imageme_key_id_a4854581_fk_mcqueen_b" FOREIGN KEY (key_id) REFERENCES mcqueen_base_metakey(id) DEFERRABLE INITIALLY DEFERRED
 "mcqueen_base_imageme_source_version_id_f9b0513e_fk_mcqueen_b" FOREIGN KEY (source_version_id) REFERENCES mcqueen_base_metasourceversion(id) DEFERRABLE INITIALLY DEFERRED

I'm a novice at best when it comes to tuning, but I figure that the defaults for statistics are not up to that table's size and so it naively thinks that a PK scan is faster than an index scan.

Question 2

Can you show the output of select * from pg_stats where tablename ='mcqueen_base_imagemeta2' and attname='image_id'\x\g\x? Extremely rare values (which these must be if 20 of them only find 16 rows) are often estimated poorly. And what is the actual number of distinct values of image_id in the table? select count(*) from (select distinct image_id from mcqueen_base_imagemeta2) foo

Question 3

@jjanes the image table has over 6B rows. The chances of not finding a row in meta is very high, as in ~20% of images have meta records, but those may have many rows. The annoying thing with this is that it only happens on certain low limit values. Without limit it always picks the image_id index and is fast.

Question 4

Check the answers here for a similar problem: Can spatial index help a "range - order by - limit" query

Question 5

I understand those things, but they don't help find a solution. If you want a statistical solution, you have to dig into the statistics, as in my questions.

Question 6

It thinks it is going to find 78722, but it really finds 16, so that is going to lead to some bad plans.

When a value in the in-list is not present in the MCV list of the stats table, it guesses their frequency using the n_distinct value, which is probably way off (you didn't answer my question about that). The way it does this is to take the number of tuples not covered by the MCV frequency list and divides it by the number of distinct values not listed in the MCV list. So basically ntuples * (1-sum of MCF) / (n_distinct - length of MCF). This simplified formula ignores NULLs.

As @ErwinBrandstetter suggests, you might be able to improve the situation by increasing the size of the MCV list by increasing the statistics sample size. That might also increase the accuracy of the n_distinct estimate. But with 6 billions rows, it might not possible to increase the sample size by enough. Also, if image_id are clumped together with the duplicate values likely to occur in the same page, then the sampling method used by PostgreSQL is quite biased when it comes to computing n_distinct, and this is resistant to fixing by just increasing the sample size.

A simpler way to fix this may be to fix the n_distinct manually:

alter table mcqueen_base_imagemeta2 alter column image_id set (n_distinct=1000000000);
analyze mcqueen_base_imagemeta2;

This method doesn't increase the time or storage required by ANALYZE, the way increasing the sample size does, and is also more likely to succeed.

Question 7

Would you know at what occasions the system overwrites a manual setting of n_distinct?

Question 8

@ErwinBrandstetter Nothing that I know of, other than an explicit RESET (or dropping the table and recreating it). Note that the manual setting is not used directly by the planner, it is just copied to overwrite the one in "pg_stats" on the next ANALYZE (and every ANALYZE thereafter) .

Question 9

Thanks, very useful!

Question 10

The n_distinct is currently 2,540,250, which seems about right, but with a number as large and sparsely populated across the 6B that I tend to agree that I cannot get a sample size large enough to get meaningful coverage. I will manually increasing n_distinct

Question 11

The n_distinct ALTER followed by an ANALYZE did the trick. Thanks!

Question 12

Why?

For a LIMIT 1, Postgres may estimate it to be faster to traverse the index supporting the ORDER BY and just keep filtering until the first row is found. This is fast as long as more than a few rows qualify and one of those pops up early according to ORDER BY. But it is (very) slow if no qualifying row pops up early, or even a worst case scenario if no row ends up qualifying at all. Similar for any small LIMIT.

Postgres collects statistics about the most common values (MCV list), but not for the least common ones - for obvious reasons, that would be far too many to be useful. And it has no statistics for correlations between columns by default. (While that can be created manually it won't fit your use case anyway, as ID numbers are typically uncorrelated.)

So Postgres has to base its decision on generic estimates. It's very hard to identify the sweet spot where to switch from one index to the other. This gets harder, yet, for a predicate like image_id IN (123, ... ) with many items, and most are typically rare or very rare or even non-existent. But if you put enough numbers into the list, Postgres will eventually expect that traversing the other index will find the first hit faster.

Solutions?

You may be able to improve the situation somewhat with a larger statistics target:

ALTER TABLE mcqueen_base_imagemeta2 ALTER image_id SET STATISTICS 2000;

That (among other things) increases the size of the MCV list for the column and help identify more (less) common values. But it's not a general solution for the problem, and makes ANALYZE and query planning a bit more expensive. Related:

Check statistics targets in PostgreSQL

Upgrading to the latest version (soon to be Postgres 12) also helps as general performance got better and the planner smarter.

There are various techniques for a workaround, depending on cardinalities, value frequencies, access patterns, ... Completely disabling the ORDER BY index like Laurenz demonstrated is one radical workaround - which can backfire for long lists or very common image_id, where the ORDER BY index would, in fact, be much faster.

Can spatial index help a "range - order by - limit" query

Workaround for your case

Should work well for the given numbers: 5 billion rows, around 20 image_id in the filter list, small LIMIT. Most efficient for LIMIT 1 and a short list, but good for any small LIMIT and manageable list size:

SELECT m.*
FROM unnest( '{123, ...}'::bigint[]) i(image_id)
CROSS JOIN LATERAL (
 SELECT m.id
 FROM mcqueen_base_imagemeta2 m
 WHERE m.image_id = i.image_id
 ORDER BY m.id DESC
 LIMIT 1 -- or N
 ) m
ORDER BY id DESC
LIMIT 1; -- or N

Provide your list as array and unnest(). Or use a VALUES expression. Related:

Optimizing a Postgres query with a large IN

It's essential to support this with a multicolumn index on (image_id, id DESC)!

You might then delete the existing index mcqueen_base_imagemeta2_image_id_616fe89c on just (image_id). See:

Is a composite index also good for queries on the first field?

This should result in one very fast index(-only) scan per image_id. And a final, (very) cheap sort step.

Fetching N rows for each image_id guarantees that we have all rows needed in the outer query. If you have meta-knowledge that only fewer rows per single image_id can be in the result, you can decrease the nested LIMIT accordingly.

Aside

(a common pattern in Django pagination)

Pagination with LIMIT and OFFSET? OK for the first page, but after that it's just a bad idea.

Question 13

Django actually does not use OFFSET (I am well aware of that evil from other frameworks as well). It over-fetches by 1, and each subsequent page uses a >= that over-fetched id. Hence the requirement for an order.

Question 14

Over-fetching reveals whether there are more rows. But the next page would generally better be based on > last id than on >= over-fetched id. (Maybe it is?) Not to miss possibly inserted values in between. Anyway, if your queries are as displayed, consider the multicolumn index I suggested. Should help with your old query, too.

Question 15

The simple solution is to modify the ORDER BY condition so that the semantics are unchanged, but PostgreSQL cannot use the index any more:

SELECT * FROM mcqueen_base_imagemeta2 
 WHERE image_id IN ( 123, ... )
 ORDER BY id + 0 DESC
 LIMIT 1;

Question 16

That does indeed work. Unfortunately the queries are generated by the Django ORM and only certain scenarios would this really be the desirable behavior, so it's probably not a solution we can deploy.

Question 17

Any ORM worth its salt will let you write queries in SQL. How about creating a view AS SELECT id + 0 AS id, ...?

Question 18

It's not that ORM won't let me generate custom SQL, it's that the query is automatically generated as part of the REST API machinery. I can probably suppress that for the couple of places where this is applicable, but would certainly prefer fixing it at the source rather through a special case of tech debt.

jjanes jjanes 42.4k3 gold badges44 silver badges54 bronze badges · Accepted Answer · 2019-09-26 12:17:21Z

It thinks it is going to find 78722, but it really finds 16, so that is going to lead to some bad plans.

When a value in the in-list is not present in the MCV list of the stats table, it guesses their frequency using the n_distinct value, which is probably way off (you didn't answer my question about that). The way it does this is to take the number of tuples not covered by the MCV frequency list and divides it by the number of distinct values not listed in the MCV list. So basically ntuples * (1-sum of MCF) / (n_distinct - length of MCF). This simplified formula ignores NULLs.

As @ErwinBrandstetter suggests, you might be able to improve the situation by increasing the size of the MCV list by increasing the statistics sample size. That might also increase the accuracy of the n_distinct estimate. But with 6 billions rows, it might not possible to increase the sample size by enough. Also, if image_id are clumped together with the duplicate values likely to occur in the same page, then the sampling method used by PostgreSQL is quite biased when it comes to computing n_distinct, and this is resistant to fixing by just increasing the sample size.

A simpler way to fix this may be to fix the n_distinct manually:

alter table mcqueen_base_imagemeta2 alter column image_id set (n_distinct=1000000000);
analyze mcqueen_base_imagemeta2;

This method doesn't increase the time or storage required by ANALYZE, the way increasing the sample size does, and is also more likely to succeed.

Would you know at what occasions the system overwrites a manual setting of n_distinct?
@ErwinBrandstetter Nothing that I know of, other than an explicit RESET (or dropping the table and recreating it). Note that the manual setting is not used directly by the planner, it is just copied to overwrite the one in "pg_stats" on the next ANALYZE (and every ANALYZE thereafter) .
The n_distinct is currently 2,540,250, which seems about right, but with a number as large and sparsely populated across the 6B that I tend to agree that I cannot get a sample size large enough to get meaningful coverage. I will manually increasing n_distinct
The n_distinct ALTER followed by an ANALYZE did the trick. Thanks!

Stack Exchange Network

Postgres sometimes uses inferior index for WHERE a IN (...) ORDER BY b LIMIT N

3 Answers 3

Why?

Solutions?

Workaround for your case

Aside

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions