Indexes for SQL query with WHERE condition and GROUP BY

Question 1

I am trying to determine which indexes to use for an SQL query with a WHERE condition and a GROUP BY which is currently running very slow.

My query:

SELECT group_id
FROM counter
WHERE ts between timestamp '2014-03-02 00:00:00.0' and timestamp '2014-03-05 12:00:00.0'
GROUP BY group_id

The table currently has 32.000.000 rows. The execution time of the query increases a lot when I increase the time-frame.

The table in question looks like this:

CREATE TABLE counter (
 id bigserial PRIMARY KEY
 , ts timestamp NOT NULL
 , group_id bigint NOT NULL
);

I currently have the following indexes, but the performance is still slow:

CREATE INDEX ts_index
 ON counter
 USING btree
 (ts);
CREATE INDEX group_id_index
 ON counter
 USING btree
 (group_id);
CREATE INDEX comp_1_index
 ON counter
 USING btree
 (ts, group_id);
CREATE INDEX comp_2_index
 ON counter
 USING btree
 (group_id, ts);

Running EXPLAIN on the query gives the following result:

"QUERY PLAN"
"HashAggregate (cost=467958.16..467958.17 rows=1 width=4)"
" -> Index Scan using ts_index on counter (cost=0.56..467470.93 rows=194892 width=4)"
" Index Cond: ((ts >= '2014-02-26 00:00:00'::timestamp without time zone) AND (ts <= '2014-02-27 23:59:00'::timestamp without time zone))"

SQL Fiddle with example data: http://sqlfiddle.com/#!15/7492b/1

The Question

Can the performance of this query be improved by adding better indexes, or must I increase the processing power?

Edit 1

PostgreSQL version 9.3.2 is used.

Edit 2

I tried @Erwin 's proposal with EXISTS:

SELECT group_id
FROM groups g
WHERE EXISTS (
 SELECT 1
 FROM counter c
 WHERE c.group_id = g.group_id
 AND ts BETWEEN timestamp '2014-03-02 00:00:00'
 AND timestamp '2014-03-05 12:00:00'
 );

But unfortunetly this didn't seem to increase the performance. The Query Plan:

"QUERY PLAN"
"Nested Loop Semi Join (cost=1607.18..371680.60 rows=113 width=4)"
" -> Seq Scan on groups g (cost=0.00..2.33 rows=133 width=4)"
" -> Bitmap Heap Scan on counter c (cost=1607.18..158895.53 rows=60641 width=4)"
" Recheck Cond: ((group_id = g.id) AND (ts >= '2014-01-01 00:00:00'::timestamp without time zone) AND (ts <= '2014-03-05 12:00:00'::timestamp without time zone))"
" -> Bitmap Index Scan on comp_2_index (cost=0.00..1592.02 rows=60641 width=0)"
" Index Cond: ((group_id = g.id) AND (ts >= '2014-01-01 00:00:00'::timestamp without time zone) AND (ts <= '2014-03-05 12:00:00'::timestamp without time zone))"

Edit 3

The query plan for the LATERAL query from ypercube:

"QUERY PLAN"
"Nested Loop (cost=8.98..1200.42 rows=133 width=20)"
" -> Seq Scan on groups g (cost=0.00..2.33 rows=133 width=4)"
" -> Result (cost=8.98..8.99 rows=1 width=0)"
" One-Time Filter: (1ドル IS NOT NULL)"
" InitPlan 1 (returns 1ドル)"
" -> Limit (cost=0.56..4.49 rows=1 width=8)"
" -> Index Only Scan using comp_2_index on counter c (cost=0.56..1098691.21 rows=279808 width=8)"
" Index Cond: ((group_id = 0ドル) AND (ts IS NOT NULL) AND (ts >= '2010-03-02 00:00:00'::timestamp without time zone) AND (ts <= '2014-03-05 12:00:00'::timestamp without time zone))"
" InitPlan 2 (returns 2ドル)"
" -> Limit (cost=0.56..4.49 rows=1 width=8)"
" -> Index Only Scan Backward using comp_2_index on counter c_1 (cost=0.56..1098691.21 rows=279808 width=8)"
" Index Cond: ((group_id = 0ドル) AND (ts IS NOT NULL) AND (ts >= '2010-03-02 00:00:00'::timestamp without time zone) AND (ts <= '2014-03-05 12:00:00'::timestamp without time zone))"

Question 2

How many different group_id values are there on the table?

Question 3

There are 133 different group_id's.

Question 4

The timestamps range from 2011 to 2014. Both seconds and milliseconds are in use.

Question 5

Are you only interested in group_id and not in any count?

Question 6

@Erwin We are interested in max() and (min) as well on a fourth column not shown in the example.

Question 7

Another idea, that also uses the groups table and a construction called LATERAL join (for SQL-Server fans, this is almost identical to OUTER APPLY). It has the advantage that aggregates can be calculated in the subquery:

SELECT group_id, min_ts, max_ts
FROM groups g, -- notice the comma here, is required
 LATERAL 
 ( SELECT MIN(ts) AS min_ts,
 MAX(ts) AS max_ts
 FROM counter c
 WHERE c.group_id = g.group_id
 AND c.ts BETWEEN timestamp '2011-03-02 00:00:00'
 AND timestamp '2013-03-05 12:00:00'
 ) x 
WHERE min_ts IS NOT NULL ;

Test at SQL-Fiddle shows that the query does index scans on the (group_id, ts) index.

Similar plans are produced using 2 lateral joins, one for min and one for max and also with 2 inline correlated subqueries. They could also be used if you need to show the whole counter rows besides the min and max dates:

SELECT group_id, 
 min_ts, min_ts_id, 
 max_ts, max_ts_id 
FROM groups g
 , LATERAL 
 ( SELECT ts AS min_ts, c.id AS min_ts_id
 FROM counter c
 WHERE c.group_id = g.group_id
 AND c.ts BETWEEN timestamp '2012-03-02 00:00:00'
 AND timestamp '2014-03-05 12:00:00'
 ORDER BY ts ASC
 LIMIT 1
 ) xmin
 , LATERAL 
 ( SELECT ts AS max_ts, c.id AS max_ts_id
 FROM counter c
 WHERE c.group_id = g.group_id
 AND c.ts BETWEEN timestamp '2012-03-02 00:00:00'
 AND timestamp '2014-03-05 12:00:00'
 ORDER BY ts DESC 
 LIMIT 1
 ) xmax
WHERE min_ts IS NOT NULL ;

Question 8

@ypercube I added the query plan for your query to the original question. The query runs in under 50 ms even on large time spans.

Question 9

Since you have no aggregate in the select list, the the group by is pretty much the same as putting a distinct in the select list, right?

If that is what you want, you might be able to get a fast index lookup on comp_2_index by rewriting this to use a recursive query, as described on the PostgreSQL wiki.

Make a view to efficiently return the distinct group_ids:

create or replace view groups as
WITH RECURSIVE t AS (
 SELECT min(counter.group_id) AS group_id
 FROM counter
 UNION ALL
 SELECT ( SELECT min(counter.group_id) AS min
 FROM counter
 WHERE counter.group_id > t.group_id) AS min
 FROM t
 WHERE t.group_id IS NOT NULL
 )
 SELECT t.group_id
 FROM t
 WHERE t.group_id IS NOT NULL
UNION ALL
 SELECT NULL::bigint AS col
 WHERE (EXISTS ( SELECT counter.id,
 counter.ts,
 counter.group_id
 FROM counter
 WHERE counter.group_id IS NULL));

And then use that view in place of the lookup table in Erwin's exists semi-join.

Question 10

For only "133 different group_id's", you could use integer (or even smallint). Won't buy much, though, because padding to 8 bytes will eat the rest in your table and possible indexes. Processing of plain integer is a bit faster, though. More on int4 vs. int2:

Use of integer instead of interval (of one type)

CREATE TABLE counter (
 id bigserial PRIMARY KEY
 , ts timestamp NOT NULL
 , group_id int NOT NULL
);

@Leo: timestamps are stored as 8-byte integers in modern Postgres and can be processed perfectly fast. See:

Ignoring time zones altogether in Rails and PostgreSQL

@ypercube: The index on (group_id, ts) can't help, since there is no condition on group_id in the query.

Your main problem is the massive amount of data that has to be processed:

Index Scan using ts_index on counter (cost=0.56..467470.93 rows=194892 width=4)

You are only interested in existence of a group_id, and not the actual count. There are only 133 different group_ids, so your query can be satisfied with the first hit per gorup_id in the time frame. Hence my suggestion for an alternative query with an EXISTS expression:

Assuming a lookup table for groups:

SELECT group_id
FROM groups g
WHERE EXISTS (
 SELECT counter c
 WHERE c.group_id = g.group_id
 AND ts BETWEEN timestamp '2014-03-02 00:00:00'
 AND timestamp '2014-03-05 12:00:00'
 );

Your index comp_2_index on (group_id, ts) becomes instrumental now.

fiddle
_{Old sqlfiddle building on ypercube's fiddle in the comments}

Here, the query prefers the index on (ts, group_id), but I think that's because of the test setup with "clustered" timestamps. If you remove the indexes with leading ts (more about that), the planner will happily use the index on (group_id, ts) as well - notably in an Index Only Scan.

If that works, you might not need this other possible improvement: Pre-aggregate data in a materialized view to drastically reduce the number of rows. This would make sense in particular, if you also need actual counts additionally. Then you have the cost to process many rows once when updating the mv. You could even combine daily and hourly aggregates (two separate tables) and adapt your query to that.

Are time frames in your queries arbitrary? Or mostly on full minutes / hours / days?

CREATE MATERIALIZED VIEW counter_mv AS
SELECT date_trunc('hour', ts) AS hour
 , group_id
 , count(*)::int AS ct
FROM counter
GROUP BY 1,2
ORDER BY 1,2;

Create the necessary index(es) on counter_mv and adapt your query to work with it. Like:

CREATE INDEX foo ON counter_mv (hour, group_id, ct); -- once
SELECT group_id, sum(ct) AS total_ct
FROM counter_mv
WHERE hour BETWEEN timestamp '2014-03-02 00:00:00'
 AND timestamp '2014-03-05 12:00:00'
GROUP BY 1
ORDER BY 2;

Question 11

I tried several similar things in SQL-Fiddle, with 10k rows, but all showed some sequential scan. Does using the groups table make the difference?

Question 12

@ypercube: I think so. Also, ANALYZE makes a difference. But indexes on counter even get used without ANALYZE as soon as I introduce the groups table. Point is, without that table, a seqscan is needed anyway to build the set of possible group_id´s. I added more to my answer. And thanks for your fiddle!

Question 13

@ErwinBrandstetter That is what I thought as well, and was very surprised to find out otherwise. Without a LIMIT 1, it can choose a bitmap index scan, which does not benefit from early stopping and takes a lot longer. (But if the table is freshly vacuumed, it might prefer the indexonly scan over the bitmap scan, so which behavior you see depends on the vacuum status of the table).

Question 14

@uldall: Daily aggregates will drastically reduce the number of rows. That should do the trick. But be sure to give the EXISTS-query a try. It might be surprisingly fast. Won't work for min / max additionally. I would be interested in the resulting performance, though, if you'd be so kind as to drop a line here.

Question 15

@questionto42 Yes. I added examples and improved the old answer a bit.

ypercubeTM ypercubeTM 99.7k13 gold badges217 silver badges306 bronze badges · Accepted Answer · 2014-03-13 18:59:49Z

Another idea, that also uses the groups table and a construction called LATERAL join (for SQL-Server fans, this is almost identical to OUTER APPLY). It has the advantage that aggregates can be calculated in the subquery:

SELECT group_id, min_ts, max_ts
FROM groups g, -- notice the comma here, is required
 LATERAL 
 ( SELECT MIN(ts) AS min_ts,
 MAX(ts) AS max_ts
 FROM counter c
 WHERE c.group_id = g.group_id
 AND c.ts BETWEEN timestamp '2011-03-02 00:00:00'
 AND timestamp '2013-03-05 12:00:00'
 ) x 
WHERE min_ts IS NOT NULL ;

Test at SQL-Fiddle shows that the query does index scans on the (group_id, ts) index.

Similar plans are produced using 2 lateral joins, one for min and one for max and also with 2 inline correlated subqueries. They could also be used if you need to show the whole counter rows besides the min and max dates:

SELECT group_id, 
 min_ts, min_ts_id, 
 max_ts, max_ts_id 
FROM groups g
 , LATERAL 
 ( SELECT ts AS min_ts, c.id AS min_ts_id
 FROM counter c
 WHERE c.group_id = g.group_id
 AND c.ts BETWEEN timestamp '2012-03-02 00:00:00'
 AND timestamp '2014-03-05 12:00:00'
 ORDER BY ts ASC
 LIMIT 1
 ) xmin
 , LATERAL 
 ( SELECT ts AS max_ts, c.id AS max_ts_id
 FROM counter c
 WHERE c.group_id = g.group_id
 AND c.ts BETWEEN timestamp '2012-03-02 00:00:00'
 AND timestamp '2014-03-05 12:00:00'
 ORDER BY ts DESC 
 LIMIT 1
 ) xmax
WHERE min_ts IS NOT NULL ;

@ypercube I added the query plan for your query to the original question. The query runs in under 50 ms even on large time spans.

Stack Exchange Network

Indexes for SQL query with WHERE condition and GROUP BY

The Question

Edit 1

Edit 2

Edit 3

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Indexes for SQL query with WHERE condition and GROUP BY

The Question

Edit 1

Edit 2

Edit 3

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions