How do I query out the largest value using an index

Question 1

I have a table with several terabytes of event data in a very simple (id, bucket_id, data, created_at) schema and there is an index like so

create index index_events_on_created_at_and_bucket_id
 on public.events (created_at desc, bucket_id asc);

Now I thought it would be fast to find the id of the most recent event in each bucket with a query like:

select max(created_at), bucket_id from events group by bucket_id;

Explain output:

HashAggregate (cost=170172168.62..170172178.41 rows=979 width=16)
 Group Key: bucket_id
 -> Index Only Scan using index_events_on_created_at_and_bucket_id on events (cost=0.70..156003994.34 rows=2833634856 width=16)

It seems to be using the index, but doing an index scan instead of just grabbing the head value like I expected. Either way, it does not complete in a timely manner. I suppose it's a problem with using the aggregate function in the query, but I don't know how to fix it.

Is there a query that can return the most recent (i.e. first in the index) created_at timestamp for each bucket by fetching it out of this index?

Question 2

Better index with leading `bucket_id`

You want one row per bucket. An index with leading bucket_id will be much more useful.

CREATE INDEX events_bucket_id_created_at_idx ON events (bucket_id, created_at DESC);

Is a composite index also good for queries on the first field?

Since you have a very small number of distinct values in bucket_id ("rows=979"), this query technique should give you dramatically faster results, based on my suggested index:

WITH RECURSIVE cte AS (
 ( -- parentheses required
 SELECT bucket_id, created_at
 FROM events
 ORDER BY bucket_id, created_at DESC
 LIMIT 1
 )
 
 UNION ALL
 SELECT e.*
 FROM cte c
 CROSS JOIN LATERAL (
 SELECT e.bucket_id, e.created_at
 FROM events e
 WHERE e.bucket_id > c.bucket_id
 ORDER BY e.bucket_id, e.created_at DESC
 LIMIT 1
 ) e
 WHERE c.bucket_id IS NOT NULL
 )
SELECT * FROM cte
WHERE bucket_id IS NOT NULL;

It emulates a "loose index scan", only picking the "first" row for every distinct bucket_id - exactly what you are looking for.

Note how the sort order in the query meticulously matches the index.

If the visibility map of the table is up to date (i.e. the table is vacuum'ed enough), you get index-only scans. Should apply, since the slow query you demonstrated got an index-only scan, too. (Though that's scanning the whole index instead of just leading entries per bucket). Related:

Can PostgreSQL use indexes to expedite count(distinct) queries?

This assumes both columns of interest to be NOT NULL. Else you have to do more.

If you also have a table bucket with one row per relevant bucket_id, this is even a bit faster, yet:

SELECT b.bucket_id, e.created_at
FROM bucket b
CROSS JOIN LATERAL (
 SELECT e.created_at
 FROM events e
 WHERE e.bucket_id = b.bucket_id
 ORDER BY e.created_at DESC
 LIMIT 1
 ) e
ORDER BY b.bucket_id;

See:

Stuck with index on `(created_at DESC, bucket_id ASC)`

We can work with the additional meta information from your comments:

I know all the buckets I care about have recent events

You can enhance the queries above, but a different angle based on that should perform better:

SELECT DISTINCT ON (bucket_id)
 bucket_id, created_at
FROM events
WHERE created_at > now() - interval '15 minutes' -- adapt as needed
ORDER BY bucket_id, created_at DESC;

Should be faster when limited to the tiny (?) fraction of the most recent rows. Postgres can read top rows from the index and feed that to DISTINCT ON. About DISTINCT ON:

Select first row in each GROUP BY group?

Question 3

I do have a buckets table, but does that final query also require the differently ordered index? the inner query is very fast if I stick in a hardcoded bucket_id but with the cross join it does not complete. This is a readonly connection so I'm trying to use the existing index.

Question 4

To make it fast, you need an index with leading bucket_id for either query. Or any query, for that matter. Well, there are workarounds for special cases. Like, when you know a minimum (recent) timestamp for each bucket. Or you know that the latest entry for each bucket is recent. So we can find it cheaply from the top of an index with leading created_at. We don't want to traverse billions of entries for a hit, not even once.

Question 5

Thank you for that last hint, because I know all the buckets I care about have recent events i was able to add where ... and e.created_at > now() - interval '15 minutes' to your final example query and got results within 5 seconds. Thanks again, very helpful.

Question 6

@Segfault: Note the addendum.

score 4 · Accepted Answer · 2023-09-25 04:23:50Z

Better index with leading `bucket_id`

You want one row per bucket. An index with leading bucket_id will be much more useful.

CREATE INDEX events_bucket_id_created_at_idx ON events (bucket_id, created_at DESC);

Is a composite index also good for queries on the first field?

Since you have a very small number of distinct values in bucket_id ("rows=979"), this query technique should give you dramatically faster results, based on my suggested index:

WITH RECURSIVE cte AS (
 ( -- parentheses required
 SELECT bucket_id, created_at
 FROM events
 ORDER BY bucket_id, created_at DESC
 LIMIT 1
 )
 
 UNION ALL
 SELECT e.*
 FROM cte c
 CROSS JOIN LATERAL (
 SELECT e.bucket_id, e.created_at
 FROM events e
 WHERE e.bucket_id > c.bucket_id
 ORDER BY e.bucket_id, e.created_at DESC
 LIMIT 1
 ) e
 WHERE c.bucket_id IS NOT NULL
 )
SELECT * FROM cte
WHERE bucket_id IS NOT NULL;

It emulates a "loose index scan", only picking the "first" row for every distinct bucket_id - exactly what you are looking for.

Note how the sort order in the query meticulously matches the index.

If the visibility map of the table is up to date (i.e. the table is vacuum'ed enough), you get index-only scans. Should apply, since the slow query you demonstrated got an index-only scan, too. (Though that's scanning the whole index instead of just leading entries per bucket). Related:

Can PostgreSQL use indexes to expedite count(distinct) queries?

This assumes both columns of interest to be NOT NULL. Else you have to do more.

If you also have a table bucket with one row per relevant bucket_id, this is even a bit faster, yet:

SELECT b.bucket_id, e.created_at
FROM bucket b
CROSS JOIN LATERAL (
 SELECT e.created_at
 FROM events e
 WHERE e.bucket_id = b.bucket_id
 ORDER BY e.created_at DESC
 LIMIT 1
 ) e
ORDER BY b.bucket_id;

See:

Stuck with index on `(created_at DESC, bucket_id ASC)`

We can work with the additional meta information from your comments:

I know all the buckets I care about have recent events

You can enhance the queries above, but a different angle based on that should perform better:

SELECT DISTINCT ON (bucket_id)
 bucket_id, created_at
FROM events
WHERE created_at > now() - interval '15 minutes' -- adapt as needed
ORDER BY bucket_id, created_at DESC;

Should be faster when limited to the tiny (?) fraction of the most recent rows. Postgres can read top rows from the index and feed that to DISTINCT ON. About DISTINCT ON:

Select first row in each GROUP BY group?

I do have a buckets table, but does that final query also require the differently ordered index? the inner query is very fast if I stick in a hardcoded bucket_id but with the cross join it does not complete. This is a readonly connection so I'm trying to use the existing index.
To make it fast, you need an index with leading bucket_id for either query. Or any query, for that matter. Well, there are workarounds for special cases. Like, when you know a minimum (recent) timestamp for each bucket. Or you know that the latest entry for each bucket is recent. So we can find it cheaply from the top of an index with leading created_at. We don't want to traverse billions of entries for a hit, not even once.
Thank you for that last hint, because I know all the buckets I care about have recent events i was able to add where ... and e.created_at > now() - interval '15 minutes' to your final example query and got results within 5 seconds. Thanks again, very helpful.

Stack Exchange Network

How do I query out the largest value using an index

1 Answer 1

Better index with leading `bucket_id`

Stuck with index on `(created_at DESC, bucket_id ASC)`

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

How do I query out the largest value using an index

1 Answer 1

Better index with leading bucket_id

Stuck with index on (created_at DESC, bucket_id ASC)

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions

Better index with leading `bucket_id`

Stuck with index on `(created_at DESC, bucket_id ASC)`