PostgreSQL - How does multicolumn B-Tree index work with order by on 1st column and IN lookup for 2nd?

Question 1

I created such table (similar to example from http://use-the-index-luke.com/sql/example-schema/postgresql/performance-testing-scalability )

CREATE TABLE scale_data (
 section NUMERIC NOT NULL,
 id1 NUMERIC NOT NULL, -- unique values simulating ID or Timestamp
 id2 NUMERIC NOT NULL -- a kind of Type
);

Populate it with:

INSERT INTO scale_data
SELECT sections.sections, sections.sections*10000 + gen.gen
 , CEIL(RANDOM()*100) 
 FROM GENERATE_SERIES(1, 300) sections,
 GENERATE_SERIES(1, 90000) gen
 WHERE gen <= sections * 300;

It generated 13545000 records.

Composite index on it:

CREATE INDEX id1_id2_idx
 ON public.scale_data
 USING btree
 (id1, id2);

And select#1:

select id2 from scale_data 
where id2 in (50)
order by id1 desc
limit 500

Explain analyze:

"Limit (cost=0.56..1177.67 rows=500 width=11) (actual time=0.046..5.124 rows=500 loops=1)"
" -> Index Only Scan Backward using id1_id2_idx on scale_data (cost=0.56..311588.74 rows=132353 width=11) (actual time=0.045..5.060 rows=500 loops=1)"
" Index Cond: (id2 = '50'::numeric)"
" Heap Fetches: 0"
"Planning time: 0.103 ms"
"Execution time: 5.177 ms"

Select#2 --more values in IN - plan has changed

select id2 from scale_data 
where id2 in (50, 52)
order by id1 desc
limit 500

Explain analyze#2:

"Limit (cost=0.56..857.20 rows=500 width=11) (actual time=0.061..8.703 rows=500 loops=1)"
" -> Index Only Scan Backward using id1_id2_idx on scale_data (cost=0.56..445780.74 rows=260190 width=11) (actual time=0.059..8.648 rows=500 loops=1)"
" Filter: (id2 = ANY ('{50,52}'::numeric[]))"
" Rows Removed by Filter: 25030"
" Heap Fetches: 0"
"Planning time: 0.153 ms"
"Execution time: 8.771 ms"

Why plan differs? Why in #1 it does show like Index condition, but in #2 Filter and number of index scanned cells. Doesn't sql#1 traverse index in the same way like explain for sql#2 shows?

On real/production DB #2 works much slower, even if search by 2 keys separately is fast

PG 9.5

Question 2

I probably miss the point, but would not id2, id1 be a better index for these queries?

Question 3

I wouldn't let this bug you. FILTER in that context, I believe just means more than one conditional statement on the index (which is how IN and array operations get translated to, afaik). In either one, they're all executed under Index Only Scan Backward. It works the same way with OR

 QUERY PLAN 
----------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit (cost=0.56..1219.95 rows=500 width=11) (actual time=0.061..16.159 rows=500 loops=1)
 -> Index Only Scan Backward using id1_id2_idx on scale_data (cost=0.56..679161.56 rows=278484 width=11) (actual time=0.060..16.086 rows=500 loops=1)
 Filter: ((id2 = '50'::numeric) OR (id2 = '52'::numeric))
 Rows Removed by Filter: 24673
 Heap Fetches: 25173
 Planning time: 0.206 ms
 Execution time: 16.235 ms
(7 rows)
test=# EXPlAIN ANALYZE select id2 from scale_data 
where id2 in (50, 52)
order by id1 desc
limit 500
;
 QUERY PLAN 
----------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit (cost=0.56..1153.17 rows=500 width=11) (actual time=0.072..18.604 rows=500 loops=1)
 -> Index Only Scan Backward using id1_id2_idx on scale_data (cost=0.56..645299.05 rows=279930 width=11) (actual time=0.070..18.506 rows=500 loops=1)
 Filter: (id2 = ANY ('{50,52}'::numeric[]))
 Rows Removed by Filter: 24673
 Heap Fetches: 25173
 Planning time: 0.187 ms
 Execution time: 18.695 ms
(7 rows)

Question 4

OK, why for one value it is much faster on real DB and there shows Index Con? :)

Question 5

You can tinker with the environment here, or share both plans and we can look at it. In this case you don't actually have to visit the rows so it doesn't matter much, but if the index didn't have a useful ordering you would have to visit the rows. Which means an additional value could require visiting tends, hundreds, thousands, millions or EVEN BILLIONZ of extra rows. After billionzzz PostgreSQL slows down and acts likes other RDBMS, losing its super powers.

Question 6

To me this looks like a good demonstration of SARGable vs. non-SARGable predicates.

where id2 in (50) is rewritten by the optimizer into a simple equality predicate, which seems to be SARGable by Postgres' standards: even though id2 is not the leading column in the composite key, it still needs just a simple comparison of byte arrays.

where id2 in (50, 52) cannot be so rewritten, and the resulting predicate seems to be non-SARGable, it requires that the subset of the composite key value be extracted into a variable that can become an operand to =ANY(). This seems to demand greater smarts than the index scan is capable of, hence the involvement of the Filter step, crossing the border between the index manager and the SQL engine.

Evan Carroll Evan Carroll 65.7k50 gold badges259 silver badges510 bronze badges · Answer 1 · 2017-05-25 15:54:15Z

I wouldn't let this bug you. FILTER in that context, I believe just means more than one conditional statement on the index (which is how IN and array operations get translated to, afaik). In either one, they're all executed under Index Only Scan Backward. It works the same way with OR

 QUERY PLAN 
----------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit (cost=0.56..1219.95 rows=500 width=11) (actual time=0.061..16.159 rows=500 loops=1)
 -> Index Only Scan Backward using id1_id2_idx on scale_data (cost=0.56..679161.56 rows=278484 width=11) (actual time=0.060..16.086 rows=500 loops=1)
 Filter: ((id2 = '50'::numeric) OR (id2 = '52'::numeric))
 Rows Removed by Filter: 24673
 Heap Fetches: 25173
 Planning time: 0.206 ms
 Execution time: 16.235 ms
(7 rows)
test=# EXPlAIN ANALYZE select id2 from scale_data 
where id2 in (50, 52)
order by id1 desc
limit 500
;
 QUERY PLAN 
----------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit (cost=0.56..1153.17 rows=500 width=11) (actual time=0.072..18.604 rows=500 loops=1)
 -> Index Only Scan Backward using id1_id2_idx on scale_data (cost=0.56..645299.05 rows=279930 width=11) (actual time=0.070..18.506 rows=500 loops=1)
 Filter: (id2 = ANY ('{50,52}'::numeric[]))
 Rows Removed by Filter: 24673
 Heap Fetches: 25173
 Planning time: 0.187 ms
 Execution time: 18.695 ms
(7 rows)

OK, why for one value it is much faster on real DB and there shows Index Con? :)
You can tinker with the environment here, or share both plans and we can look at it. In this case you don't actually have to visit the rows so it doesn't matter much, but if the index didn't have a useful ordering you would have to visit the rows. Which means an additional value could require visiting tends, hundreds, thousands, millions or EVEN BILLIONZ of extra rows. After billionzzz PostgreSQL slows down and acts likes other RDBMS, losing its super powers.

mustaccio mustaccio 28.6k24 gold badges60 silver badges77 bronze badges · Answer 2 · 2020-12-03 20:17:02Z

To me this looks like a good demonstration of SARGable vs. non-SARGable predicates.

where id2 in (50) is rewritten by the optimizer into a simple equality predicate, which seems to be SARGable by Postgres' standards: even though id2 is not the leading column in the composite key, it still needs just a simple comparison of byte arrays.

where id2 in (50, 52) cannot be so rewritten, and the resulting predicate seems to be non-SARGable, it requires that the subset of the composite key value be extracted into a variable that can become an operand to =ANY(). This seems to demand greater smarts than the index scan is capable of, hence the involvement of the Filter step, crossing the border between the index manager and the SQL engine.

Stack Exchange Network

PostgreSQL - How does multicolumn B-Tree index work with order by on 1st column and IN lookup for 2nd?

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

PostgreSQL - How does multicolumn B-Tree index work with order by on 1st column and IN lookup for 2nd?

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions