I have a table with columns: id, antenna_id, latitude, longitude. There are two composite indexes on (antenna_id, latitude) and (antenna_id, longitude). When I do a max(latitude) for a specific antenna id(s), the speed is acceptable, but doing a min and max for both latitude and longitude at the same time is very slow.
Using PostgreSQL 12.3
Query
EXPLAIN (analyze, buffers, format text)
SELECT max(latitude)
FROM packets
WHERE antenna_id IN (1,2)
Finalize Aggregate (cost=443017.21..443017.22 rows=1 width=32) (actual time=4373.679..4373.679 rows=1 loops=1)
Buffers: shared hit=10812 read=16887
-> Gather (cost=443017.10..443017.21 rows=1 width=32) (actual time=4373.412..4389.032 rows=2 loops=1)
Workers Planned: 1
Workers Launched: 1
Buffers: shared hit=10812 read=16887
-> Partial Aggregate (cost=442017.10..442017.11 rows=1 width=32) (actual time=4313.576..4313.577 rows=1 loops=2)
Buffers: shared hit=10809 read=16887
-> Parallel Index Only Scan using idx_packets_antenna_id_latitude on packets (cost=0.57..433527.51 rows=3395835 width=7) (actual time=0.375..3435.488 rows=2201866 loops=2)
Index Cond: (antenna_id = ANY ('{1,2}'::integer[]))
Heap Fetches: 0
Buffers: shared hit=10809 read=16887
Planning Time: 5.992 ms
JIT:
Functions: 8
Options: Inlining false, Optimization false, Expressions true, Deforming true
Timing: Generation 6.236 ms, Inlining 0.000 ms, Optimization 1.549 ms, Emission 32.058 ms, Total 39.842 ms
Execution Time: 4706.406 ms
The explain looks almost identical for max(longitude), min(latitude) and min(longitude) on their own. Speed is acceptable.
But when I combine the queriesSELECT max(latitude), max(longitude), min(latitude), min(longitude)
FROM packets
WHERE antenna_id IN (1,2)
Duration
[2021年03月06日 09:28:30] 1 row retrieved starting from 1 in 5 m 35 s 907 ms (execution: 5 m 35 s 869 ms, fetching: 38 ms)
Finalize Aggregate (cost=3677020.18..3677020.19 rows=1 width=128)
-> Gather (cost=3677020.06..3677020.17 rows=1 width=128)
Workers Planned: 1
-> Partial Aggregate (cost=3676020.06..3676020.07 rows=1 width=128)
-> Parallel Seq Scan on packets (cost=0.00..3642080.76 rows=3393930 width=14)
Filter: (antenna_id = ANY ('{1,2}'::integer[]))
JIT:
Functions: 7
Options: Inlining true, Optimization true, Expressions true, Deforming true
EXPLAIN (analyze, buffers, format text)
SELECT max(latitude), max(longitude), min(latitude), min(longitude)
FROM packets
WHERE antenna_id IN (1,2)
has been running for 24h now, and isn't done yet
Indexes
create index idx_packets_antenna_id_time
on packets (antenna_id, time);
create index idx_packets_antenna_id_longitude
on packets (antenna_id, longitude);
create index idx_packets_device_id_time
on packets (device_id, time);
create index idx_packets_antenna_id_latitude
on packets (antenna_id, latitude);
Data stats
select count(*) from packets
136758098
select count(distinct (antenna_id)) from packets
17558
select antenna_id, count(*) as records
from packets
where antenna_id in (1,2)
group by antenna_id
order by records desc
1,4361049
2,42683
Question
Why does the second query which does the min and max on the latitude and longitude field not use the indexes? And how can I rewrite the query so that it is faster?
1 Answer 1
Let's create some test data. Looks like your query has about 1% of rows per antenna_id so let's replicate this.
CREATE UNLOGGED TABLE foo( lat FLOAT NOT NULL, lon FLOAT NOT NULL, aid INTEGER NOT NULL );
INSERT INTO foo SELECT random(), random(), random()*100
FROM generate_series(1,10000000) s;
CREATE INDEX foo_aid_lat ON foo( aid, lat );
CREATE INDEX foo_aid_lon ON foo( aid, lon );
VACUUM ANALYZE foo;
SELECT min(lat),max(lat),min(lon),max(lon) FROM foo WHERE aid IN (1,2);
Finalize Aggregate (cost=71572.35..71572.36 rows=1 width=32) (actual time=119.907..125.118 rows=1 loops=1)
-> Gather (cost=71572.12..71572.33 rows=2 width=32) (actual time=119.648..125.110 rows=3 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial Aggregate (cost=70572.12..70572.13 rows=1 width=32) (actual time=95.595..95.596 rows=1 loops=3)
-> Parallel Bitmap Heap Scan on foo (cost=4886.47..69687.39 rows=88473 width=16) (actual time=9.532..90.336 rows=66524 loops=3)
Recheck Cond: (aid = ANY ('{1,2}'::integer[]))
Heap Blocks: exact=26477
-> Bitmap Index Scan on foo_aid_lon (cost=0.00..4833.39 rows=212336 width=0) (actual time=20.022..20.023 rows=199572 loops=1)
Index Cond: (aid = ANY ('{1,2}'::integer[]))
Planning Time: 0.499 ms
Execution Time: 125.202 ms
This is really slow. Let's try for one antenna_id.
SELECT min(lat),max(lat),min(lon),max(lon) FROM foo WHERE aid=1;
Result (cost=1.88..1.89 rows=1 width=32) (actual time=0.192..0.196 rows=1 loops=1)
InitPlan 1 (returns 0ドル)
-> Limit (cost=0.43..0.47 rows=1 width=8) (actual time=0.059..0.060 rows=1 loops=1)
-> Index Only Scan using foo_aid_lat on foo (cost=0.43..3777.80 rows=106668 width=8) (actual time=0.057..0.057 rows=1 loops=1)
Index Cond: ((aid = 1) AND (lat IS NOT NULL))
Heap Fetches: 0
InitPlan 2 (returns 1ドル)
-> Limit (cost=0.43..0.47 rows=1 width=8) (actual time=0.044..0.045 rows=1 loops=1)
-> Index Only Scan Backward using foo_aid_lat on foo foo_1 (cost=0.43..3777.80 rows=106668 width=8) (actual time=0.043..0.044 rows=1 loops=1)
Index Cond: ((aid = 1) AND (lat IS NOT NULL))
Heap Fetches: 0
InitPlan 3 (returns 2ドル)
-> Limit (cost=0.43..0.47 rows=1 width=8) (actual time=0.038..0.038 rows=1 loops=1)
-> Index Only Scan using foo_aid_lon on foo foo_2 (cost=0.43..3777.80 rows=106668 width=8) (actual time=0.037..0.037 rows=1 loops=1)
Index Cond: ((aid = 1) AND (lon IS NOT NULL))
Heap Fetches: 0
InitPlan 4 (returns 3ドル)
-> Limit (cost=0.43..0.47 rows=1 width=8) (actual time=0.042..0.042 rows=1 loops=1)
-> Index Only Scan Backward using foo_aid_lon on foo foo_3 (cost=0.43..3777.80 rows=106668 width=8) (actual time=0.041..0.041 rows=1 loops=1)
Index Cond: ((aid = 1) AND (lon IS NOT NULL))
Heap Fetches: 0
Planning Time: 0.504 ms
Execution Time: 0.277 ms
That's the correct plan, that uses the multicolumn index to compute the max and min. This requires only 1 index lookup per min() or max(), because
SELECT max(lat) where aid=...
is equivalent to
SELECT lat WHERE aid=... ORDER BY lat DESC LIMIT 1
...which is optimizable using an index which contains the rows in pre-sorted order.
The above optimization of max() and min() is basically syntactic sugar, it turns the query into an ORDER BY+LIMIT and puts that into an InitPlan in order to use the index.
But, apparently, it doesn't do this when querying several antenna_ids using "WHERE IN()". Adding a "GROUP BY aid" at the end of the first query doesn't help.
So... Let's query the antenna_ids one at a time then.
SELECT * FROM
(VALUES (1),(2)) AS v
CROSS JOIN LATERAL (SELECT min(lat),max(lat),min(lon),max(lon) FROM foo WHERE aid=v.column1) x;
It makes a nested loop over the VALUES, and inside the nested loop is the above fast query. It returns the max() and min() for each antenna_id, so to get the global max() and min() you have to wrap that in a subquery and apply max() and min() over the result.
This shouldn't take more than a millisecond unless there is another problem.
Replacing the VALUES above with a generate_series(1,100) that gets the max for the 100 aid's in the table takes about 5ms. Doing it the old fashioned way:
select aid,min(lat),max(lat),min(lon),max(lon) FROM foo group by aid;
takes about 100x longer.
-
Thanks for telling me about CROSS JOIN LATERAL. This query is as quick as what I was hoping for. I however realised I had to filter out latitude and longitudes equal to 0, changing the query to: SELECT * FROM (VALUES (1),(2)) AS v CROSS JOIN LATERAL (SELECT min(latitude),max(latitude),min(longitude),max(longitude) FROM packets WHERE antenna_id=v.column1 AND latitude!=0 AND longitude!=0) x This is however running slow again. Any recommendations?JPM– JPM2021年03月07日 21:20:16 +00:00Commented Mar 7, 2021 at 21:20
-
Is zero a special value? If 0 means "unknown" maybe it should be NULL... Can you post the execution plan of the slow query with !=0 ? I've tried it on my test data and I get the same fast plan.bobflux– bobflux2021年03月07日 21:25:41 +00:00Commented Mar 7, 2021 at 21:25
(antenna_id, longitude, latitude)
antenna_id
values does the table have?