Following scenario: I have a table roads
, which basically contains the road network of a whole country extracted from the OSM line
table and a table points
containing millions of GPS tracked points which are part of tracks. Both have a id
column, a PostGIS geometry geom
column with types LineString
(roads) and Point
(points) as well a spatial index on those columns. The reference system has SRID 4326. The geometry index is clustered.
Versions: Postgres 9.5, PostGIS 2.2.2
I want to find out which road is the closest for each point. I came up with the following as proposed here: How to find the nearest point by using PostGIS function?
SELECT point.id, road.id, ST_Distance(road.geom, point.geom) AS Distance
FROM roads AS road, points AS point
WHERE road.id = (
SELECT r.id
FROM roads as r, points as p
WHERE point.id = p.id
ORDER BY ST_Distance(r.geom, p.geom) ASC LIMIT 1
);
The result of EXPLAIN ANALYZE
where the result is limited to 1 point:
"Limit (cost=2171519.51..4343038.72 rows=1 width=186) (actual time=7944.471..7944.471 rows=1 loops=1)"
" -> Nested Loop (cost=2171519.51..4578064121748.85 rows=2108230 width=186) (actual time=7944.470..7944.470 rows=1 loops=1)"
" -> Seq Scan on points point (cost=0.00..78098.30 rows=2108230 width=36) (actual time=0.006..0.006 rows=1 loops=1)"
" -> Index Scan using roads_id_index on roads road (cost=2171519.51..2171519.95 rows=1 width=150) (actual time=0.010..0.010 rows=1 loops=1)"
" Index Cond: (id = (SubPlan 1))"
" SubPlan 1"
" -> Limit (cost=2171519.07..2171519.07 rows=1 width=182) (actual time=7944.438..7944.438 rows=1 loops=1)"
" -> Sort (cost=2171519.07..2189421.96 rows=7161155 width=182) (actual time=7944.437..7944.437 rows=1 loops=1)"
" Sort Key: (st_distance(r.geom, p.geom))"
" Sort Method: top-N heapsort Memory: 25kB"
" -> Nested Loop (cost=0.43..2135713.30 rows=7161155 width=182) (actual time=0.034..7046.628 rows=7161026 loops=1)"
" -> Index Scan using points_pkey on points p (cost=0.43..8.45 rows=1 width=32) (actual time=0.013..0.015 rows=1 loops=1)"
" Index Cond: (point.id = id)"
" -> Seq Scan on roads r (cost=0.00..273804.55 rows=7161155 width=150) (actual time=0.003..1455.615 rows=7161026 loops=1)"
" SubPlan 1"
" -> Limit (cost=2171519.07..2171519.07 rows=1 width=182) (actual time=7944.438..7944.438 rows=1 loops=1)"
" -> Sort (cost=2171519.07..2189421.96 rows=7161155 width=182) (actual time=7944.437..7944.437 rows=1 loops=1)"
" Sort Key: (st_distance(r.geom, p.geom))"
" Sort Method: top-N heapsort Memory: 25kB"
" -> Nested Loop (cost=0.43..2135713.30 rows=7161155 width=182) (actual time=0.034..7046.628 rows=7161026 loops=1)"
" -> Index Scan using points_pkey on points p (cost=0.43..8.45 rows=1 width=32) (actual time=0.013..0.015 rows=1 loops=1)"
" Index Cond: (point.id = id)"
" -> Seq Scan on de_road_network r (cost=0.00..273804.55 rows=7161155 width=150) (actual time=0.003..1455.615 rows=7161026 loops=1)"
"Planning time: 0.426 ms"
"Execution time: 7944.560 ms"
This query works fine, but i have an efficiency problem.
The query above takes more than 7 seconds for one point. I think that there is no way to speed this up with the index, since the ST_Distance
function cannot use the index. Anyway, since i would like to do this for (~ 3) millions points i would like to speed up the query or to develop a faster one.
Do you have any suggestions on that ?
Some thoughts:
- Using the
&&
oeprator, which uses the index. Changing theWHERE
clause of the sub-query above to:WHERE point.id = p.id AND p.geom && r.geom
results in a massive increase in speed. See: Nearest Neighbor problem in Postgis 2.0 using GIST Index (<-> function) .
Is this solution accurate ? What happens if a point p
has a closest road r1
but p
is not in the bounding box of r1
but in the bounding box of r2
? Is this scenario feasible ?
- Using the
<->
operator: I tried to replace theST_Distance
with the<->
operator but without success. The spatial index is not involved. I guess it is because
Index only kicks in if one of the geometries is a constant (not in a subquery/cte). e.g. 'SRID=3005;POINT(1011102 450541)'::geometry instead of a.geom
Including a pre-filtering step using other postgis functions ? E.g. with the
ST_DWithin
function. (See: Find nearest neighbours faster in PostGIS). How would a query look like if i want to specify the distance in meters ?Would it probably make sense to turn this problem into a point in polygon problem by transforming each road into a polygon with a fixed width followed by a point in polygon test ? I guess a lot of points would only match one polygon. And for those which do not, i could make the distance test.
Get an approximative solution (e.g. Using a radial sweep-line as proposed here: Find Nearest Line Segments to Point)
Solution - Final Query: Using CROSS JOIN LATERAL
(Credits to Tyler)
SELECT p.id AS point_id, b.id AS road_id, ST_DISTANCE(b.geom, p.geom) AS distance
FROM points p
CROSS JOIN LATERAL (
SELECT r.id AS id, r.geom AS geom
FROM roads r
ORDER BY r.geom <-> p.geom
LIMIT 1
) b;
-
could you help me plz. I have similar problem and can't get you request work. Do you have skype or could you give me your e-mail.Dmitry Bubnenkov– Dmitry Bubnenkov2017年04月12日 08:41:00 +00:00Commented Apr 12, 2017 at 8:41
1 Answer 1
(削除) Would you be able to add the result of putting "EXPLAIN ANALYZE" before the query in your question? Then I can update my answer with suggestions. (削除ここまで)
Of course, you will need GIST indexes on your table's geometry fields and vacuum analyze first.
CREATE INDEX ON road USING gist (geom);
CREATE INDEX ON point USING gist (geom);
VACUUM ANALYZE road;
VACUUM ANALYZE point;
You could try something like this, at the very least it might give you some ideas which may result in you finding your answer.
SELECT p.id, cjl.id
FROM point p
CROSS JOIN LATERAL (
SELECT r.id
FROM road r
ORDER BY r.geom <-> p.geom
LIMIT 1
) cjl
WHERE p.id = 1
Explanation:
I personally enjoy using CROSS JOIN LATERAL because I can pass in the parent table's column to filter and limit the results. The <-> operator is used to increase performance for doing nearest neighbor approximate distance ordering - the performance gains far outweigh the "approximate" distance in terms of accuracy.
Resources:
http://postgis.net/docs/geometry_distance_knn.html http://postgis.net/docs/geometry_distance_centroid.html https://www.postgresql.org/docs/9.3/static/queries-table-expressions.html#QUERIES-LATERAL
-
Thank you very much. Your suggestion of using
CROSS JOIN LATERAL
is perfect and i am able to process 2 million points in < 9 minutes with it. What an incredibly awesome feature. Since i am using Postgres 9.5 the<->
operator should give the real distances on the geometry according to postgis.net/docs/geometry_distance_knn.htmlJohannes H– Johannes H2016年08月19日 15:10:59 +00:00Commented Aug 19, 2016 at 15:10 -
Just curious, in your first example you said the query took 7 seconds to complete. How long did a similar query take after you made these changes?Tyler– Tyler2016年08月21日 00:45:21 +00:00Commented Aug 21, 2016 at 0:45
-
Using the
CROSS JOIN LATERAL
query takes approx. 11 msec for one point (LIMIT 1
), It uses a nested loop with a sequential scan in thepoints
table and a spatial index scan on theroads
table. The inaccurate query involving the&&
operator i mentioned takes 11-12 msec for one point.Johannes H– Johannes H2016年08月21日 14:12:51 +00:00Commented Aug 21, 2016 at 14:12
Explore related questions
See similar questions with these tags.