Can spatial index help a "range - order by - limit" query

Question 1

Asking this question, specifically for Postgres, as it has good supoort for R-tree/spatial indexes.

We have the following table with a tree structure (Nested Set model) of words and their frequencies:

lexikon
-------
_id integer PRIMARY KEY
word text
frequency integer
lset integer UNIQUE KEY
rset integer UNIQUE KEY

And the query:

SELECT word
FROM lexikon
WHERE lset BETWEEN @Low AND @High
ORDER BY frequency DESC
LIMIT @N

I suppose a covering index on (lset, frequency, word) would be useful but I feel it may not perform well if there are too many lset values in the (@High, @Low) range.

A simple index on (frequency DESC) may also be sufficient sometimes, when a search using that index yields early the @N rows that match the range condition.

But it seems that performance depends a lot on the parameter values.

Is there a way to make it perform fast, regardless of whether the range (@Low, @High) is wide or narrow and regardless of whether the top frequency words are luckily in the (narrow) selected range?

Would an R-tree/spatial index help?

Adding indexes, rewriting the query, re-designing the table, there is no limitation.

Question 2

Covering indexes are introduced with 9.2 (now beta), btw. PostgreSQL people speak of Index-only scans. See this related answer: dba.stackexchange.com/a/7541/3684 and the PostgreSQL Wiki page

Question 3

Two questions: (1) What kind of usage pattern do you expect for the table? Are there mostly reads or are there frequent updates (especially of the nested set variables)? (2) Is there any connection between the nested set integer variables lset and rset and the text variable word?

Question 4

@jug: Mostly reads. No connection between the lset,rset and word.

Question 5

You may be able to achieve better performance by searching first in rows with higher frequencies. This can be achieved by 'granulating' the frequencies and then stepping through them procedurally, for example as follows:

--testbed and lexikon dummy data:

begin;
set role dba;
create role stack;
grant stack to dba;
create schema authorization stack;
set role stack;
--
create table lexikon( _id serial, 
 word text, 
 frequency integer, 
 lset integer, 
 width_granule integer);
--
insert into lexikon(word, frequency, lset) 
select word, (1000000/row_number() over(order by random()))::integer as frequency, lset
from (select 'word'||generate_series(1,1000000) word, generate_series(1,1000000) lset) z;
--
update lexikon set width_granule=ln(frequency)::integer;
--
create index on lexikon(width_granule, lset);
create index on lexikon(lset);
-- the second index is not used with the function but is added to make the timings 'fair'

granule analysis (mostly for information and tuning):

create table granule as 
select width_granule, count(*) as freq, 
 min(frequency) as granule_start, max(frequency) as granule_end 
from lexikon group by width_granule;
--
select * from granule order by 1;
/*
 width_granule | freq | granule_start | granule_end
---------------+--------+---------------+-------------
 0 | 500000 | 1 | 1
 1 | 300000 | 2 | 4
 2 | 123077 | 5 | 12
 3 | 47512 | 13 | 33
 4 | 18422 | 34 | 90
 5 | 6908 | 91 | 244
 6 | 2580 | 245 | 665
 7 | 949 | 666 | 1808
 8 | 349 | 1811 | 4901
 9 | 129 | 4926 | 13333
 10 | 47 | 13513 | 35714
 11 | 17 | 37037 | 90909
 12 | 7 | 100000 | 250000
 13 | 2 | 333333 | 500000
 14 | 1 | 1000000 | 1000000
*/
alter table granule drop column freq;
--

function for scanning high frequencies first:

create function f(p_lset_low in integer, p_lset_high in integer, p_limit in integer)
 returns setof lexikon language plpgsql set search_path to 'stack' as $$
declare
 m integer;
 n integer := 0;
 r record;
begin 
 for r in (select width_granule from granule order by width_granule desc) loop
 return query( select * 
 from lexikon 
 where width_granule=r.width_granule 
 and lset>=p_lset_low and lset<=p_lset_high );
 get diagnostics m = row_count;
 n = n+m;
 exit when n>=p_limit;
 end loop;
end;$$;

results (timings should probably be taken with a pinch of salt but each query is run twice to counter any caching)

first using the function we've written:

\timing on
--
select * from f(20000, 30000, 5) order by frequency desc limit 5;
/*
 _id | word | frequency | lset | width_granule
-----+-----------+-----------+-------+---------------
 141 | word23237 | 7092 | 23237 | 9
 246 | word25112 | 4065 | 25112 | 8
 275 | word23825 | 3636 | 23825 | 8
 409 | word28660 | 2444 | 28660 | 8
 418 | word29923 | 2392 | 29923 | 8
Time: 80.452 ms
*/
select * from f(20000, 30000, 5) order by frequency desc limit 5;
/*
 _id | word | frequency | lset | width_granule
-----+-----------+-----------+-------+---------------
 141 | word23237 | 7092 | 23237 | 9
 246 | word25112 | 4065 | 25112 | 8
 275 | word23825 | 3636 | 23825 | 8
 409 | word28660 | 2444 | 28660 | 8
 418 | word29923 | 2392 | 29923 | 8
Time: 0.510 ms
*/

and then with a simple index scan:

select * from lexikon where lset between 20000 and 30000 order by frequency desc limit 5;
/*
 _id | word | frequency | lset | width_granule
-----+-----------+-----------+-------+---------------
 141 | word23237 | 7092 | 23237 | 9
 246 | word25112 | 4065 | 25112 | 8
 275 | word23825 | 3636 | 23825 | 8
 409 | word28660 | 2444 | 28660 | 8
 418 | word29923 | 2392 | 29923 | 8
Time: 218.897 ms
*/
select * from lexikon where lset between 20000 and 30000 order by frequency desc limit 5;
/*
 _id | word | frequency | lset | width_granule
-----+-----------+-----------+-------+---------------
 141 | word23237 | 7092 | 23237 | 9
 246 | word25112 | 4065 | 25112 | 8
 275 | word23825 | 3636 | 23825 | 8
 409 | word28660 | 2444 | 28660 | 8
 418 | word29923 | 2392 | 29923 | 8
Time: 51.250 ms
*/
\timing off
--
rollback;

Depending on your real-world data, you will probably want to vary the number of granules and the function used for putting rows into them. The actual distribution of frequencies is key here, as is the expected values for the limit clause and size of lset ranges sought.

Question 6

Setup

I am building on Jack's setup to make it easier for people to follow and compare. Tested with PostgreSQL 9.1.4. Later re-tested with Postgres 15.

CREATE TABLE lexikon (
 lex_id serial PRIMARY KEY
 , word text
 , frequency int NOT NULL -- we'd need to do more if NULL was allowed
 , lset int
);
INSERT INTO lexikon(word, frequency, lset) 
SELECT 'w' || g -- shorter with just 'w'
 , (1000000 / row_number() OVER (ORDER BY random()))::int
 , g
FROM generate_series(1,1000000) g;

From here on I take a different route:

ANALYZE lexikon;

Auxiliary table

This solution does not add columns to the original table, it just needs a tiny helper table. I placed it in the schema public, use any schema of your choice.

CREATE TABLE public.lex_freq AS
WITH x AS (
 SELECT DISTINCT ON (f.row_min)
 f.row_min, c.row_ct, c.frequency
 FROM (
 SELECT frequency, sum(count(*)) OVER (ORDER BY frequency DESC) AS row_ct
 FROM lexikon
 GROUP BY 1
 ) c
 JOIN ( -- list of steps in recursive search
 VALUES (400),(1600),(6400),(25000),(100000),(200000),(400000),(600000),(800000)
 ) f(row_min) ON c.row_ct >= f.row_min -- match next greater number
 ORDER BY f.row_min, c.row_ct, c.frequency DESC
 )
, y AS ( 
 SELECT DISTINCT ON (frequency)
 row_min, row_ct, frequency AS freq_min
 , lag(frequency) OVER (ORDER BY row_min) AS freq_max
 FROM x
 ORDER BY frequency, row_min
 -- if one frequency spans multiple ranges, pick the lowest row_min
 )
SELECT row_min, row_ct, freq_min
 , CASE
 WHEN freq_max = freq_min + 1 THEN 'frequency = ' || freq_min
 WHEN freq_max > freq_min THEN 'frequency >= ' || freq_min || ' AND frequency < ' || freq_max
 ELSE 'frequency >= ' || freq_min
 END AS cond
FROM y;

Table looks like this:

row_min | row_ct | freq_min | cond
--------+---------+----------+-------------
400 | 400 | 2500 | frequency >= 2500
1600 | 1600 | 625 | frequency >= 625 AND frequency < 2500
6400 | 6410 | 156 | frequency >= 156 AND frequency < 625
25000 | 25000 | 40 | frequency >= 40 AND frequency < 156
100000 | 100000 | 10 | frequency >= 10 AND frequency < 40
200000 | 200000 | 5 | frequency >= 5 AND frequency < 10
400000 | 500000 | 2 | frequency >= 2 AND frequency < 5
600000 | 1000000 | 1 | frequency = 1

As the column cond is going to be used in dynamic SQL further down, you have to make this table secure. Always schema-qualify the table if you cannot be sure of an appropriate current search_path, and revoke write privileges from public (and any other untrusted role):

REVOKE ALL ON public.lex_freq FROM public;
GRANT SELECT ON public.lex_freq TO public;

The table lex_freq serves 3 purposes:

Create needed partial indexes automatically.
Provide steps for iterative function.
Meta information for tuning.

Indexes

This DO statement creates all needed indexes:

DO
$do$
DECLARE
 _cond text;
BEGIN
 FOR _cond IN
 SELECT cond FROM public.lex_freq
 LOOP
 IF _cond LIKE 'frequency =%' THEN
 EXECUTE 'CREATE INDEX ON lexikon(lset) WHERE ' || _cond;
 ELSE
 EXECUTE 'CREATE INDEX ON lexikon(lset, frequency DESC) WHERE ' || _cond;
 END IF;
 END LOOP;
END
$do$;

All of these partial indexes together span the table once. They are about the same size as one basic index on the whole table:

SELECT pg_size_pretty(pg_relation_size('lexikon')); -- 50 MB
SELECT pg_size_pretty(pg_total_relation_size('lexikon')); -- 71 MB

Only 21 MB of indexes for 50 MB table so far (in pg 9.1).

I create most of the partial indexes on (lset, frequency DESC). The second column only helps in special cases. But as both involved columns are of type integer, due to the specifics of data alignment in combination with MAXALIGN in PostgreSQL, the second column does not make the index any bigger. It's a small win for hardly any cost.

There is no point in doing that for partial indexes that only span a single frequency. Those are just on (lset). Created indexes look like this:

CREATE INDEX ON lexikon(lset, frequency DESC) WHERE frequency >= 2500;
CREATE INDEX ON lexikon(lset, frequency DESC) WHERE frequency >= 625 AND frequency < 2500;
-- ...
CREATE INDEX ON lexikon(lset, frequency DESC) WHERE frequency >= 2 AND frequency < 5;
CREATE INDEX ON lexikon(lset) WHERE freqency = 1;

Function

The function is somewhat similar in style to Jack's solution:

CREATE OR REPLACE FUNCTION f_search(_lset_min int, _lset_max int, _limit int)
 RETURNS SETOF lexikon
 LANGUAGE plpgsql STABLE AS
$func$
DECLARE
 _n int;
 _rest int := _limit; -- init with _limit param
 _cond text;
BEGIN 
 FOR _cond IN
 SELECT l.cond FROM public.lex_freq l ORDER BY l.row_min
 LOOP 
 -- RAISE NOTICE '_cond: %, _limit: %', _cond, _rest; -- for debugging
 RETURN QUERY EXECUTE '
 SELECT * 
 FROM public.lexikon 
 WHERE ' || _cond || '
 AND lset >= 1ドル
 AND lset <= 2ドル
 ORDER BY frequency DESC
 LIMIT 3ドル'
 USING _lset_min, _lset_max, _rest;
 GET DIAGNOSTICS _n = ROW_COUNT;
 _rest := _rest - _n;
 EXIT WHEN _rest < 1;
 END LOOP;
END
$func$

Key differences

Dynamic SQL with RETURN QUERY EXECUTE.
A different query plan may be preferable on each iteration. The query plan for static SQL is generated once and then reused, which can save some overhead. But in this case the query is simple, and the values are very different. Dynamic SQL can be a big win.
Dynamic LIMIT for every iteration.
This helps in multiple ways: Firstly, rows are only fetched as needed. In combination with dynamic SQL this may also generate different query plans to begin with. Secondly, no need for an additional LIMIT in the function call to trim the surplus.

Benchmark

Setup

I picked four examples and ran three different tests with each. I took the best of five to compare with warm cache:

The raw SQL query of the form:

 SELECT *
 FROM lexikon 
 WHERE lset >= 20000
 AND lset <= 30000
 ORDER BY frequency DESC
 LIMIT 5;

The same after creating this index
```
 CREATE INDEX ON lexikon(lset);
```

Needs about the same space as all my partial indexes together:

 SELECT pg_size_pretty(pg_total_relation_size('lexikon')) -- 93 MB

The function

 SELECT * FROM f_search(20000, 30000, 5);

Results

SELECT * FROM f_search(20000, 30000, 5);

1: Total runtime: 315.458 ms
2: Total runtime: 36.458 ms
3: Total runtime: 0.330 ms

SELECT * FROM f_search(60000, 65000, 100);

1: Total runtime: 294.819 ms
2: Total runtime: 18.915 ms
3: Total runtime: 1.414 ms

SELECT * FROM f_search(10000, 70000, 100);

1: Total runtime: 426.831 ms
2: Total runtime: 217.874 ms
3: Total runtime: 1.611 ms

SELECT * FROM f_search(1, 1000000, 5);

1: Total runtime: 2458.205 ms
2: Total runtime: 2458.205 ms -- for large ranges of lset, seq scan is faster than index.
3: Total runtime: 0.266 ms

Basic fiddle

Conclusion

As expected, the benefit from the function grows with bigger ranges of lset and smaller LIMIT.

With very small ranges of lset, the raw query in combination with the index is actually faster. You'll want to test and maybe branch: Raw query for small ranges of lset, else function call. You could even just build that into the function for a "best of both worlds" - that's what I would do.

Depending on your data distribution and typical queries, more steps in lex_freq may help performance. Test to find the sweet spot. With the tools presented here, it should be easy to test.

Question 7

I do not see any reason to include the word column in the index. So this index

CREATE INDEX lexikon_lset_frequency ON lexicon (lset, frequency DESC)

will make your query to perform fast.

UPD

Currently there are no ways to make a covering index in PostgreSQL. There were a discussion about this feature in the PostgreSQL mailing list http://archives.postgresql.org/pgsql-performance/2012-06/msg00114.php

Question 8

Using a GIST index

Is there a way to make it perform fast, regardless of whether the range (@Low, @High) is wide or narrow and regardless of whether the top frequency words are luckily in the (narrow) selected range?

It depends on what you mean when you fast: you obviously have to visit every row in the range because your query is ORDER freq DESC. Shy of that the query planner already covers this if I understand the question,

Here we create a table with 10k rows of (5::int,random()::double precision)

CREATE EXTENSION IF NOT EXISTS btree_gin;
CREATE TABLE t AS
 SELECT 5::int AS foo, random() AS bar
 FROM generate_series(1,1e4) AS gs(x);

We index it,

CREATE INDEX ON t USING gist (foo, bar);
ANALYZE t;

We query it,

EXPLAIN ANALYZE
SELECT *
FROM t
WHERE foo BETWEEN 1 AND 6
ORDER BY bar DESC
FETCH FIRST ROW ONLY;

We get a Seq Scan on t. This is simply because our selectivity estimates let pg conclude heap access is faster than scanning an index and rechecking. So we make it more juicy by inserting 1,000,000 more rows of (42::int,random()::double precision) that don't fit our "range".

INSERT INTO t(foo,bar)
SELECT 42::int, x
FROM generate_series(1,1e6) AS gs(x);
VACUUM ANALYZE t;

And then we requery,

EXPLAIN ANALYZE
SELECT *
FROM t
WHERE foo BETWEEN 1 AND 6
ORDER BY bar DESC
FETCH FIRST ROW ONLY;

You can see here we complete in 4.6 MS with an Index Only Scan,

 QUERY PLAN 
---------------------------------------------------------------------------------------------------------------------------------------------
 Limit (cost=617.64..617.64 rows=1 width=12) (actual time=4.652..4.652 rows=1 loops=1)
 -> Sort (cost=617.64..642.97 rows=10134 width=12) (actual time=4.651..4.651 rows=1 loops=1)
 Sort Key: bar DESC
 Sort Method: top-N heapsort Memory: 25kB
 -> Index Only Scan using t_foo_bar_idx on t (cost=0.29..566.97 rows=10134 width=12) (actual time=0.123..3.623 rows=10000 loops=1)
 Index Cond: ((foo >= 1) AND (foo <= 6))
 Heap Fetches: 0
 Planning time: 0.144 ms
 Execution time: 4.678 ms
(9 rows)

Expanding the range to include the whole table, produces another seq scan -- logically, and growing it with another billion rows would produce another Index Scan.

So in summary,

It'll perform fast, for the quantity of data.
Fast is relative to the alternative, if the range isn't selective enough a sequential scan may be as fast as you can get.

Jack Douglas Jack Douglas 40.6k16 gold badges106 silver badges179 bronze badges · Accepted Answer · 2012-08-10 14:21:45Z

You may be able to achieve better performance by searching first in rows with higher frequencies. This can be achieved by 'granulating' the frequencies and then stepping through them procedurally, for example as follows:

--testbed and lexikon dummy data:

begin;
set role dba;
create role stack;
grant stack to dba;
create schema authorization stack;
set role stack;
--
create table lexikon( _id serial, 
 word text, 
 frequency integer, 
 lset integer, 
 width_granule integer);
--
insert into lexikon(word, frequency, lset) 
select word, (1000000/row_number() over(order by random()))::integer as frequency, lset
from (select 'word'||generate_series(1,1000000) word, generate_series(1,1000000) lset) z;
--
update lexikon set width_granule=ln(frequency)::integer;
--
create index on lexikon(width_granule, lset);
create index on lexikon(lset);
-- the second index is not used with the function but is added to make the timings 'fair'

granule analysis (mostly for information and tuning):

create table granule as 
select width_granule, count(*) as freq, 
 min(frequency) as granule_start, max(frequency) as granule_end 
from lexikon group by width_granule;
--
select * from granule order by 1;
/*
 width_granule | freq | granule_start | granule_end
---------------+--------+---------------+-------------
 0 | 500000 | 1 | 1
 1 | 300000 | 2 | 4
 2 | 123077 | 5 | 12
 3 | 47512 | 13 | 33
 4 | 18422 | 34 | 90
 5 | 6908 | 91 | 244
 6 | 2580 | 245 | 665
 7 | 949 | 666 | 1808
 8 | 349 | 1811 | 4901
 9 | 129 | 4926 | 13333
 10 | 47 | 13513 | 35714
 11 | 17 | 37037 | 90909
 12 | 7 | 100000 | 250000
 13 | 2 | 333333 | 500000
 14 | 1 | 1000000 | 1000000
*/
alter table granule drop column freq;
--

function for scanning high frequencies first:

create function f(p_lset_low in integer, p_lset_high in integer, p_limit in integer)
 returns setof lexikon language plpgsql set search_path to 'stack' as $$
declare
 m integer;
 n integer := 0;
 r record;
begin 
 for r in (select width_granule from granule order by width_granule desc) loop
 return query( select * 
 from lexikon 
 where width_granule=r.width_granule 
 and lset>=p_lset_low and lset<=p_lset_high );
 get diagnostics m = row_count;
 n = n+m;
 exit when n>=p_limit;
 end loop;
end;$$;

results (timings should probably be taken with a pinch of salt but each query is run twice to counter any caching)

first using the function we've written:

\timing on
--
select * from f(20000, 30000, 5) order by frequency desc limit 5;
/*
 _id | word | frequency | lset | width_granule
-----+-----------+-----------+-------+---------------
 141 | word23237 | 7092 | 23237 | 9
 246 | word25112 | 4065 | 25112 | 8
 275 | word23825 | 3636 | 23825 | 8
 409 | word28660 | 2444 | 28660 | 8
 418 | word29923 | 2392 | 29923 | 8
Time: 80.452 ms
*/
select * from f(20000, 30000, 5) order by frequency desc limit 5;
/*
 _id | word | frequency | lset | width_granule
-----+-----------+-----------+-------+---------------
 141 | word23237 | 7092 | 23237 | 9
 246 | word25112 | 4065 | 25112 | 8
 275 | word23825 | 3636 | 23825 | 8
 409 | word28660 | 2444 | 28660 | 8
 418 | word29923 | 2392 | 29923 | 8
Time: 0.510 ms
*/

and then with a simple index scan:

select * from lexikon where lset between 20000 and 30000 order by frequency desc limit 5;
/*
 _id | word | frequency | lset | width_granule
-----+-----------+-----------+-------+---------------
 141 | word23237 | 7092 | 23237 | 9
 246 | word25112 | 4065 | 25112 | 8
 275 | word23825 | 3636 | 23825 | 8
 409 | word28660 | 2444 | 28660 | 8
 418 | word29923 | 2392 | 29923 | 8
Time: 218.897 ms
*/
select * from lexikon where lset between 20000 and 30000 order by frequency desc limit 5;
/*
 _id | word | frequency | lset | width_granule
-----+-----------+-----------+-------+---------------
 141 | word23237 | 7092 | 23237 | 9
 246 | word25112 | 4065 | 25112 | 8
 275 | word23825 | 3636 | 23825 | 8
 409 | word28660 | 2444 | 28660 | 8
 418 | word29923 | 2392 | 29923 | 8
Time: 51.250 ms
*/
\timing off
--
rollback;

Depending on your real-world data, you will probably want to vary the number of granules and the function used for putting rows into them. The actual distribution of frequencies is key here, as is the expected values for the limit clause and size of lset ranges sought.

Stack Exchange Network

Can spatial index help a "range - order by - limit" query

4 Answers 4

Setup

Auxiliary table

Indexes

Function

Key differences

Benchmark

Setup

Results

Conclusion

Using a GIST index

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Can spatial index help a "range - order by - limit" query

4 Answers 4

Setup

Auxiliary table

Indexes

Function

Key differences

Benchmark

Setup

Results

Conclusion

Using a GIST index

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions