Postgres: Efficient schema for querying with one exact string match and two range matches

Question 1

The table I need to query:

CREATE TABLE regions (
 state text NOT NULL,
 zip_begin text NOT NULL, -- 9-digit zip code
 zip_end text NOT NULL,
 date_begin date NOT NULL,
 date_end date,
 data ...,
)

There are ~50 states and between 0-5M rows per state. The zip and date ranges might overlap.

The following will be one of the most common queries in my OLTP application:

SELECT data
FROM regions
WHERE state = {state}
 AND {date} BETWEEN date_begin AND date_end
 AND {zip} BETWEEN zip_begin AND zip_end

This query usually yields one row, but may sometimes yield more.

From the Postgres docs, it sounds like a GiST index might do what I need, but I don't really understand how those work.

Question 2

A standard B-Tree index on (state, date_begin, date_end, zip_begin, zip_end) should be enough

Question 3

A B-Tree index won't work. For example, if I query with state 'CA' and date '2020年01月01日', Postgres can quickly identify the sub-range of 'CA' rows where date_begin is less than or equal to '2020年01月01日', but it has to then check every row in that range to see if date_end is greater than or equal to '2020年01月01日'.

Question 4

Did you try it?

Question 5

@KannanGoudan I also agree with a_horse_with_no_name here. Unless you think after the state and date_begin filtering of the index that there'll still be millions of rows for just the subset of what's left, it's probably a non-issue once it compares the date_end and zip code fields, no need to get creative before you test what commonly works. The only other suggestion I'd have (but very similar to a_horse_with_no_name's) is trying a B-Tree index on the same fields in a different order that may yield less results sooner, that is define the index field list in order of uniqueness...

Question 6

...For example, you can try testing (zip_begin, zip_end, date_begin, date_end, state) or (date_begin, date_end, zip_begin, zip_end, state) since your state field is probably the least unique of them all. This improves the selectivity of the index by reducing the results sooner so the rest of the fields in the index are applied to a smaller subset of records upfront. This is also a micro-optimization and probably not needed. Also the only way to actually tell which way is best, is to test each definition.

Question 7

Don't dismiss the simple btree index until you try it. It surely won't scale well to a trillion rows, but you surely don't have a trillion rows. It might be good enough.

To use an gist index, you would need to treat the ranges explicitly as ranges, not as end points. It would probably be best to reformat your table in that format, but you can instead use expression indexes to reformat on the fly. But then your query will have to be written to match.

Some other complexities are that simple scalars don't have built in gist operators, so to put "state" into a gist index requires you to use the btree_gist extension; and there is not a built-in text range. Rather than creating a text-range, you can just cast the zip to int (which would probably be better done in the table in the first place), so something like this:

create index on regions using gist(
 state,
 daterange(date_begin,date_end,'[]'),
 int4range(zip_begin::int,zip_end::int,'[]')
);

And then the query would look like:

SELECT data
FROM regions
WHERE state = {state}
 AND {date}::date <@ daterange(date_begin,date_end,'[]')
 AND {zip}::int <@ int4range(zip_begin::int,zip_end::int,'[]')

~~(削除) Now I haven't tested this query on the gist index, because gist indexes are so ridiculously slow to build that I don't have one available yet to try it on. (削除ここまで)~~ It has finished building and I have tested it, and it runs and gives the correct answer with '0' left-padded 5 digit zip codes. With the range sizes I used, it is sometimes faster and some times slower than the btree index on a 10e6 row table, but was never markedly different.

Question 8

By casting ZIP from string to a number, you alter the comparison semantics...

Question 9

I don't think you do as long as they were zero left padded to start with. And if they weren't zero left padded, then your comparisons seems rather hosed when done as strings and casting does change it but in a way that is probably for the better.

jjanes jjanes 42.4k3 gold badges44 silver badges54 bronze badges · Answer 1 · 2021-03-18 15:29:09Z

Don't dismiss the simple btree index until you try it. It surely won't scale well to a trillion rows, but you surely don't have a trillion rows. It might be good enough.

To use an gist index, you would need to treat the ranges explicitly as ranges, not as end points. It would probably be best to reformat your table in that format, but you can instead use expression indexes to reformat on the fly. But then your query will have to be written to match.

Some other complexities are that simple scalars don't have built in gist operators, so to put "state" into a gist index requires you to use the btree_gist extension; and there is not a built-in text range. Rather than creating a text-range, you can just cast the zip to int (which would probably be better done in the table in the first place), so something like this:

create index on regions using gist(
 state,
 daterange(date_begin,date_end,'[]'),
 int4range(zip_begin::int,zip_end::int,'[]')
);

And then the query would look like:

SELECT data
FROM regions
WHERE state = {state}
 AND {date}::date <@ daterange(date_begin,date_end,'[]')
 AND {zip}::int <@ int4range(zip_begin::int,zip_end::int,'[]')

~~(削除) Now I haven't tested this query on the gist index, because gist indexes are so ridiculously slow to build that I don't have one available yet to try it on. (削除ここまで)~~ It has finished building and I have tested it, and it runs and gives the correct answer with '0' left-padded 5 digit zip codes. With the range sizes I used, it is sometimes faster and some times slower than the btree index on a 10e6 row table, but was never markedly different.

By casting ZIP from string to a number, you alter the comparison semantics...
I don't think you do as long as they were zero left padded to start with. And if they weren't zero left padded, then your comparisons seems rather hosed when done as strings and casting does change it but in a way that is probably for the better.

Stack Exchange Network

Postgres: Efficient schema for querying with one exact string match and two range matches

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Postgres: Efficient schema for querying with one exact string match and two range matches

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions