Multiple aggregations in select slow on postgres

Question 1

I have a table with columns: id, antenna_id, latitude, longitude. There are two composite indexes on (antenna_id, latitude) and (antenna_id, longitude). When I do a max(latitude) for a specific antenna id(s), the speed is acceptable, but doing a min and max for both latitude and longitude at the same time is very slow.

Using PostgreSQL 12.3

Query


EXPLAIN (analyze, buffers, format text) 
SELECT max(latitude) 
FROM packets 
WHERE antenna_id IN (1,2)
Finalize Aggregate (cost=443017.21..443017.22 rows=1 width=32) (actual time=4373.679..4373.679 rows=1 loops=1)
 Buffers: shared hit=10812 read=16887
 -> Gather (cost=443017.10..443017.21 rows=1 width=32) (actual time=4373.412..4389.032 rows=2 loops=1)
 Workers Planned: 1
 Workers Launched: 1
 Buffers: shared hit=10812 read=16887
 -> Partial Aggregate (cost=442017.10..442017.11 rows=1 width=32) (actual time=4313.576..4313.577 rows=1 loops=2)
 Buffers: shared hit=10809 read=16887
 -> Parallel Index Only Scan using idx_packets_antenna_id_latitude on packets (cost=0.57..433527.51 rows=3395835 width=7) (actual time=0.375..3435.488 rows=2201866 loops=2)
 Index Cond: (antenna_id = ANY ('{1,2}'::integer[]))
 Heap Fetches: 0
 Buffers: shared hit=10809 read=16887
Planning Time: 5.992 ms
JIT:
 Functions: 8
 Options: Inlining false, Optimization false, Expressions true, Deforming true
 Timing: Generation 6.236 ms, Inlining 0.000 ms, Optimization 1.549 ms, Emission 32.058 ms, Total 39.842 ms
Execution Time: 4706.406 ms

The explain looks almost identical for max(longitude), min(latitude) and min(longitude) on their own. Speed is acceptable.

But when I combine the queries

SELECT max(latitude), max(longitude), min(latitude), min(longitude) 
FROM packets 
WHERE antenna_id IN (1,2)

Duration

[2021年03月06日 09:28:30] 1 row retrieved starting from 1 in 5 m 35 s 907 ms (execution: 5 m 35 s 869 ms, fetching: 38 ms)

Explain

Finalize Aggregate (cost=3677020.18..3677020.19 rows=1 width=128)
 -> Gather (cost=3677020.06..3677020.17 rows=1 width=128)
 Workers Planned: 1
 -> Partial Aggregate (cost=3676020.06..3676020.07 rows=1 width=128)
 -> Parallel Seq Scan on packets (cost=0.00..3642080.76 rows=3393930 width=14)
 Filter: (antenna_id = ANY ('{1,2}'::integer[]))
JIT:
 Functions: 7
 Options: Inlining true, Optimization true, Expressions true, Deforming true

EXPLAIN (analyze, buffers, format text) 
SELECT max(latitude), max(longitude), min(latitude), min(longitude) 
FROM packets 
WHERE antenna_id IN (1,2)

has been running for 24h now, and isn't done yet

Indexes

create index idx_packets_antenna_id_time
 on packets (antenna_id, time);
create index idx_packets_antenna_id_longitude
 on packets (antenna_id, longitude);
create index idx_packets_device_id_time
 on packets (device_id, time);
create index idx_packets_antenna_id_latitude
 on packets (antenna_id, latitude);

Data stats

select count(*) from packets
136758098
select count(distinct (antenna_id)) from packets
17558
select antenna_id, count(*) as records 
from packets 
where antenna_id in (1,2) 
group by antenna_id 
order by records desc
1,4361049
2,42683

Question

Why does the second query which does the min and max on the latitude and longitude field not use the indexes? And how can I rewrite the query so that it is faster?

Question 2

All queries that only use max/min on a single column can be supported by one of your indexes. If you apply the aggregate on both columns, you would need an index on (antenna_id, longitude, latitude)

Question 3

The where clause seems to select a substantial percentage of all rows from that table. So the optimizer doesn't think the index will help in looking up the needed rows. That's why the second query doesn't use an index at all. How many different antenna_id values does the table have?

Question 4

Let's create some test data. Looks like your query has about 1% of rows per antenna_id so let's replicate this.

CREATE UNLOGGED TABLE foo( lat FLOAT NOT NULL, lon FLOAT NOT NULL, aid INTEGER NOT NULL );
INSERT INTO foo SELECT random(), random(), random()*100
 FROM generate_series(1,10000000) s;
CREATE INDEX foo_aid_lat ON foo( aid, lat );
CREATE INDEX foo_aid_lon ON foo( aid, lon );
VACUUM ANALYZE foo;
SELECT min(lat),max(lat),min(lon),max(lon) FROM foo WHERE aid IN (1,2);
 Finalize Aggregate (cost=71572.35..71572.36 rows=1 width=32) (actual time=119.907..125.118 rows=1 loops=1)
 -> Gather (cost=71572.12..71572.33 rows=2 width=32) (actual time=119.648..125.110 rows=3 loops=1)
 Workers Planned: 2
 Workers Launched: 2
 -> Partial Aggregate (cost=70572.12..70572.13 rows=1 width=32) (actual time=95.595..95.596 rows=1 loops=3)
 -> Parallel Bitmap Heap Scan on foo (cost=4886.47..69687.39 rows=88473 width=16) (actual time=9.532..90.336 rows=66524 loops=3)
 Recheck Cond: (aid = ANY ('{1,2}'::integer[]))
 Heap Blocks: exact=26477
 -> Bitmap Index Scan on foo_aid_lon (cost=0.00..4833.39 rows=212336 width=0) (actual time=20.022..20.023 rows=199572 loops=1)
 Index Cond: (aid = ANY ('{1,2}'::integer[]))
 Planning Time: 0.499 ms
 Execution Time: 125.202 ms

This is really slow. Let's try for one antenna_id.

SELECT min(lat),max(lat),min(lon),max(lon) FROM foo WHERE aid=1;
 Result (cost=1.88..1.89 rows=1 width=32) (actual time=0.192..0.196 rows=1 loops=1)
 InitPlan 1 (returns 0ドル)
 -> Limit (cost=0.43..0.47 rows=1 width=8) (actual time=0.059..0.060 rows=1 loops=1)
 -> Index Only Scan using foo_aid_lat on foo (cost=0.43..3777.80 rows=106668 width=8) (actual time=0.057..0.057 rows=1 loops=1)
 Index Cond: ((aid = 1) AND (lat IS NOT NULL))
 Heap Fetches: 0
 InitPlan 2 (returns 1ドル)
 -> Limit (cost=0.43..0.47 rows=1 width=8) (actual time=0.044..0.045 rows=1 loops=1)
 -> Index Only Scan Backward using foo_aid_lat on foo foo_1 (cost=0.43..3777.80 rows=106668 width=8) (actual time=0.043..0.044 rows=1 loops=1)
 Index Cond: ((aid = 1) AND (lat IS NOT NULL))
 Heap Fetches: 0
 InitPlan 3 (returns 2ドル)
 -> Limit (cost=0.43..0.47 rows=1 width=8) (actual time=0.038..0.038 rows=1 loops=1)
 -> Index Only Scan using foo_aid_lon on foo foo_2 (cost=0.43..3777.80 rows=106668 width=8) (actual time=0.037..0.037 rows=1 loops=1)
 Index Cond: ((aid = 1) AND (lon IS NOT NULL))
 Heap Fetches: 0
 InitPlan 4 (returns 3ドル)
 -> Limit (cost=0.43..0.47 rows=1 width=8) (actual time=0.042..0.042 rows=1 loops=1)
 -> Index Only Scan Backward using foo_aid_lon on foo foo_3 (cost=0.43..3777.80 rows=106668 width=8) (actual time=0.041..0.041 rows=1 loops=1)
 Index Cond: ((aid = 1) AND (lon IS NOT NULL))
 Heap Fetches: 0
 Planning Time: 0.504 ms
 Execution Time: 0.277 ms

That's the correct plan, that uses the multicolumn index to compute the max and min. This requires only 1 index lookup per min() or max(), because

SELECT max(lat) where aid=...

is equivalent to

SELECT lat WHERE aid=... ORDER BY lat DESC LIMIT 1

...which is optimizable using an index which contains the rows in pre-sorted order.

The above optimization of max() and min() is basically syntactic sugar, it turns the query into an ORDER BY+LIMIT and puts that into an InitPlan in order to use the index.

But, apparently, it doesn't do this when querying several antenna_ids using "WHERE IN()". Adding a "GROUP BY aid" at the end of the first query doesn't help.

So... Let's query the antenna_ids one at a time then.

SELECT * FROM 
(VALUES (1),(2)) AS v
CROSS JOIN LATERAL (SELECT min(lat),max(lat),min(lon),max(lon) FROM foo WHERE aid=v.column1) x;

It makes a nested loop over the VALUES, and inside the nested loop is the above fast query. It returns the max() and min() for each antenna_id, so to get the global max() and min() you have to wrap that in a subquery and apply max() and min() over the result.

This shouldn't take more than a millisecond unless there is another problem.

Replacing the VALUES above with a generate_series(1,100) that gets the max for the 100 aid's in the table takes about 5ms. Doing it the old fashioned way:

select aid,min(lat),max(lat),min(lon),max(lon) FROM foo group by aid;

takes about 100x longer.

Question 5

Thanks for telling me about CROSS JOIN LATERAL. This query is as quick as what I was hoping for. I however realised I had to filter out latitude and longitudes equal to 0, changing the query to: SELECT * FROM (VALUES (1),(2)) AS v CROSS JOIN LATERAL (SELECT min(latitude),max(latitude),min(longitude),max(longitude) FROM packets WHERE antenna_id=v.column1 AND latitude!=0 AND longitude!=0) x This is however running slow again. Any recommendations?

Question 6

Is zero a special value? If 0 means "unknown" maybe it should be NULL... Can you post the execution plan of the slow query with !=0 ? I've tried it on my test data and I get the same fast plan.

bobflux bobflux 1,7761 gold badge9 silver badges7 bronze badges · Accepted Answer · 2021-03-07 17:43:32Z

Let's create some test data. Looks like your query has about 1% of rows per antenna_id so let's replicate this.

CREATE UNLOGGED TABLE foo( lat FLOAT NOT NULL, lon FLOAT NOT NULL, aid INTEGER NOT NULL );
INSERT INTO foo SELECT random(), random(), random()*100
 FROM generate_series(1,10000000) s;
CREATE INDEX foo_aid_lat ON foo( aid, lat );
CREATE INDEX foo_aid_lon ON foo( aid, lon );
VACUUM ANALYZE foo;
SELECT min(lat),max(lat),min(lon),max(lon) FROM foo WHERE aid IN (1,2);
 Finalize Aggregate (cost=71572.35..71572.36 rows=1 width=32) (actual time=119.907..125.118 rows=1 loops=1)
 -> Gather (cost=71572.12..71572.33 rows=2 width=32) (actual time=119.648..125.110 rows=3 loops=1)
 Workers Planned: 2
 Workers Launched: 2
 -> Partial Aggregate (cost=70572.12..70572.13 rows=1 width=32) (actual time=95.595..95.596 rows=1 loops=3)
 -> Parallel Bitmap Heap Scan on foo (cost=4886.47..69687.39 rows=88473 width=16) (actual time=9.532..90.336 rows=66524 loops=3)
 Recheck Cond: (aid = ANY ('{1,2}'::integer[]))
 Heap Blocks: exact=26477
 -> Bitmap Index Scan on foo_aid_lon (cost=0.00..4833.39 rows=212336 width=0) (actual time=20.022..20.023 rows=199572 loops=1)
 Index Cond: (aid = ANY ('{1,2}'::integer[]))
 Planning Time: 0.499 ms
 Execution Time: 125.202 ms

This is really slow. Let's try for one antenna_id.

SELECT min(lat),max(lat),min(lon),max(lon) FROM foo WHERE aid=1;
 Result (cost=1.88..1.89 rows=1 width=32) (actual time=0.192..0.196 rows=1 loops=1)
 InitPlan 1 (returns 0ドル)
 -> Limit (cost=0.43..0.47 rows=1 width=8) (actual time=0.059..0.060 rows=1 loops=1)
 -> Index Only Scan using foo_aid_lat on foo (cost=0.43..3777.80 rows=106668 width=8) (actual time=0.057..0.057 rows=1 loops=1)
 Index Cond: ((aid = 1) AND (lat IS NOT NULL))
 Heap Fetches: 0
 InitPlan 2 (returns 1ドル)
 -> Limit (cost=0.43..0.47 rows=1 width=8) (actual time=0.044..0.045 rows=1 loops=1)
 -> Index Only Scan Backward using foo_aid_lat on foo foo_1 (cost=0.43..3777.80 rows=106668 width=8) (actual time=0.043..0.044 rows=1 loops=1)
 Index Cond: ((aid = 1) AND (lat IS NOT NULL))
 Heap Fetches: 0
 InitPlan 3 (returns 2ドル)
 -> Limit (cost=0.43..0.47 rows=1 width=8) (actual time=0.038..0.038 rows=1 loops=1)
 -> Index Only Scan using foo_aid_lon on foo foo_2 (cost=0.43..3777.80 rows=106668 width=8) (actual time=0.037..0.037 rows=1 loops=1)
 Index Cond: ((aid = 1) AND (lon IS NOT NULL))
 Heap Fetches: 0
 InitPlan 4 (returns 3ドル)
 -> Limit (cost=0.43..0.47 rows=1 width=8) (actual time=0.042..0.042 rows=1 loops=1)
 -> Index Only Scan Backward using foo_aid_lon on foo foo_3 (cost=0.43..3777.80 rows=106668 width=8) (actual time=0.041..0.041 rows=1 loops=1)
 Index Cond: ((aid = 1) AND (lon IS NOT NULL))
 Heap Fetches: 0
 Planning Time: 0.504 ms
 Execution Time: 0.277 ms

That's the correct plan, that uses the multicolumn index to compute the max and min. This requires only 1 index lookup per min() or max(), because

SELECT max(lat) where aid=...

is equivalent to

SELECT lat WHERE aid=... ORDER BY lat DESC LIMIT 1

...which is optimizable using an index which contains the rows in pre-sorted order.

The above optimization of max() and min() is basically syntactic sugar, it turns the query into an ORDER BY+LIMIT and puts that into an InitPlan in order to use the index.

But, apparently, it doesn't do this when querying several antenna_ids using "WHERE IN()". Adding a "GROUP BY aid" at the end of the first query doesn't help.

So... Let's query the antenna_ids one at a time then.

SELECT * FROM 
(VALUES (1),(2)) AS v
CROSS JOIN LATERAL (SELECT min(lat),max(lat),min(lon),max(lon) FROM foo WHERE aid=v.column1) x;

It makes a nested loop over the VALUES, and inside the nested loop is the above fast query. It returns the max() and min() for each antenna_id, so to get the global max() and min() you have to wrap that in a subquery and apply max() and min() over the result.

This shouldn't take more than a millisecond unless there is another problem.

Replacing the VALUES above with a generate_series(1,100) that gets the max for the 100 aid's in the table takes about 5ms. Doing it the old fashioned way:

select aid,min(lat),max(lat),min(lon),max(lon) FROM foo group by aid;

takes about 100x longer.

Thanks for telling me about CROSS JOIN LATERAL. This query is as quick as what I was hoping for. I however realised I had to filter out latitude and longitudes equal to 0, changing the query to: SELECT * FROM (VALUES (1),(2)) AS v CROSS JOIN LATERAL (SELECT min(latitude),max(latitude),min(longitude),max(longitude) FROM packets WHERE antenna_id=v.column1 AND latitude!=0 AND longitude!=0) x This is however running slow again. Any recommendations?
Is zero a special value? If 0 means "unknown" maybe it should be NULL... Can you post the execution plan of the slow query with !=0 ? I've tried it on my test data and I get the same fast plan.

Stack Exchange Network

Multiple aggregations in select slow on postgres

Indexes

Data stats

Question

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Multiple aggregations in select slow on postgres

Indexes

Data stats

Question

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions