1

The table I need to query:

CREATE TABLE regions (
 state text NOT NULL,
 zip_begin text NOT NULL, -- 9-digit zip code
 zip_end text NOT NULL,
 date_begin date NOT NULL,
 date_end date,
 data ...,
)

There are ~50 states and between 0-5M rows per state. The zip and date ranges might overlap.

The following will be one of the most common queries in my OLTP application:

SELECT data
FROM regions
WHERE state = {state}
 AND {date} BETWEEN date_begin AND date_end
 AND {zip} BETWEEN zip_begin AND zip_end

This query usually yields one row, but may sometimes yield more.

From the Postgres docs, it sounds like a GiST index might do what I need, but I don't really understand how those work.

asked Mar 18, 2021 at 7:59
16
  • 1
    A standard B-Tree index on (state, date_begin, date_end, zip_begin, zip_end) should be enough Commented Mar 18, 2021 at 8:02
  • 1
    A B-Tree index won't work. For example, if I query with state 'CA' and date '2020年01月01日', Postgres can quickly identify the sub-range of 'CA' rows where date_begin is less than or equal to '2020年01月01日', but it has to then check every row in that range to see if date_end is greater than or equal to '2020年01月01日'. Commented Mar 18, 2021 at 8:33
  • 1
    Did you try it? Commented Mar 18, 2021 at 8:44
  • @KannanGoudan I also agree with a_horse_with_no_name here. Unless you think after the state and date_begin filtering of the index that there'll still be millions of rows for just the subset of what's left, it's probably a non-issue once it compares the date_end and zip code fields, no need to get creative before you test what commonly works. The only other suggestion I'd have (but very similar to a_horse_with_no_name's) is trying a B-Tree index on the same fields in a different order that may yield less results sooner, that is define the index field list in order of uniqueness... Commented Mar 18, 2021 at 11:17
  • ...For example, you can try testing (zip_begin, zip_end, date_begin, date_end, state) or (date_begin, date_end, zip_begin, zip_end, state) since your state field is probably the least unique of them all. This improves the selectivity of the index by reducing the results sooner so the rest of the fields in the index are applied to a smaller subset of records upfront. This is also a micro-optimization and probably not needed. Also the only way to actually tell which way is best, is to test each definition. Commented Mar 18, 2021 at 11:19

1 Answer 1

1

Don't dismiss the simple btree index until you try it. It surely won't scale well to a trillion rows, but you surely don't have a trillion rows. It might be good enough.

To use an gist index, you would need to treat the ranges explicitly as ranges, not as end points. It would probably be best to reformat your table in that format, but you can instead use expression indexes to reformat on the fly. But then your query will have to be written to match.

Some other complexities are that simple scalars don't have built in gist operators, so to put "state" into a gist index requires you to use the btree_gist extension; and there is not a built-in text range. Rather than creating a text-range, you can just cast the zip to int (which would probably be better done in the table in the first place), so something like this:

create index on regions using gist(
 state,
 daterange(date_begin,date_end,'[]'),
 int4range(zip_begin::int,zip_end::int,'[]')
);

And then the query would look like:

SELECT data
FROM regions
WHERE state = {state}
 AND {date}::date <@ daterange(date_begin,date_end,'[]')
 AND {zip}::int <@ int4range(zip_begin::int,zip_end::int,'[]')

(削除) Now I haven't tested this query on the gist index, because gist indexes are so ridiculously slow to build that I don't have one available yet to try it on. (削除ここまで) It has finished building and I have tested it, and it runs and gives the correct answer with '0' left-padded 5 digit zip codes. With the range sizes I used, it is sometimes faster and some times slower than the btree index on a 10e6 row table, but was never markedly different.

answered Mar 18, 2021 at 15:29
2
  • By casting ZIP from string to a number, you alter the comparison semantics... Commented Mar 18, 2021 at 16:03
  • I don't think you do as long as they were zero left padded to start with. And if they weren't zero left padded, then your comparisons seems rather hosed when done as strings and casting does change it but in a way that is probably for the better. Commented Mar 18, 2021 at 16:11

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.