The table I need to query:
CREATE TABLE regions (
state text NOT NULL,
zip_begin text NOT NULL, -- 9-digit zip code
zip_end text NOT NULL,
date_begin date NOT NULL,
date_end date,
data ...,
)
There are ~50 states and between 0-5M rows per state. The zip and date ranges might overlap.
The following will be one of the most common queries in my OLTP application:
SELECT data
FROM regions
WHERE state = {state}
AND {date} BETWEEN date_begin AND date_end
AND {zip} BETWEEN zip_begin AND zip_end
This query usually yields one row, but may sometimes yield more.
From the Postgres docs, it sounds like a GiST index might do what I need, but I don't really understand how those work.
1 Answer 1
Don't dismiss the simple btree index until you try it. It surely won't scale well to a trillion rows, but you surely don't have a trillion rows. It might be good enough.
To use an gist index, you would need to treat the ranges explicitly as ranges, not as end points. It would probably be best to reformat your table in that format, but you can instead use expression indexes to reformat on the fly. But then your query will have to be written to match.
Some other complexities are that simple scalars don't have built in gist operators, so to put "state" into a gist index requires you to use the btree_gist extension; and there is not a built-in text range. Rather than creating a text-range, you can just cast the zip to int (which would probably be better done in the table in the first place), so something like this:
create index on regions using gist(
state,
daterange(date_begin,date_end,'[]'),
int4range(zip_begin::int,zip_end::int,'[]')
);
And then the query would look like:
SELECT data
FROM regions
WHERE state = {state}
AND {date}::date <@ daterange(date_begin,date_end,'[]')
AND {zip}::int <@ int4range(zip_begin::int,zip_end::int,'[]')
(削除) Now I haven't tested this query on the gist index, because gist indexes are so ridiculously slow to build that I don't have one available yet to try it on. (削除ここまで) It has finished building and I have tested it, and it runs and gives the correct answer with '0' left-padded 5 digit zip codes. With the range sizes I used, it is sometimes faster and some times slower than the btree index on a 10e6 row table, but was never markedly different.
-
By casting ZIP from string to a number, you alter the comparison semantics...Laurenz Albe– Laurenz Albe2021年03月18日 16:03:58 +00:00Commented Mar 18, 2021 at 16:03
-
I don't think you do as long as they were zero left padded to start with. And if they weren't zero left padded, then your comparisons seems rather hosed when done as strings and casting does change it but in a way that is probably for the better.jjanes– jjanes2021年03月18日 16:11:58 +00:00Commented Mar 18, 2021 at 16:11
(state, date_begin, date_end, zip_begin, zip_end)
should be enoughstate
'CA' anddate
'2020年01月01日', Postgres can quickly identify the sub-range of 'CA' rows wheredate_begin
is less than or equal to '2020年01月01日', but it has to then check every row in that range to see ifdate_end
is greater than or equal to '2020年01月01日'.state
anddate_begin
filtering of the index that there'll still be millions of rows for just the subset of what's left, it's probably a non-issue once it compares thedate_end
and zip code fields, no need to get creative before you test what commonly works. The only other suggestion I'd have (but very similar to a_horse_with_no_name's) is trying a B-Tree index on the same fields in a different order that may yield less results sooner, that is define the index field list in order of uniqueness...(zip_begin, zip_end, date_begin, date_end, state)
or(date_begin, date_end, zip_begin, zip_end, state)
since yourstate
field is probably the least unique of them all. This improves the selectivity of the index by reducing the results sooner so the rest of the fields in the index are applied to a smaller subset of records upfront. This is also a micro-optimization and probably not needed. Also the only way to actually tell which way is best, is to test each definition.