I have two point tables in SRID 4326: Table 1. unjoined_pts & Table 2. midpts. I want to join the attributes from the unjoined_pts table to the closest midpts table based on a few conditions. The query has yet to complete after sometime and I am not sure if I have fully optimized my code or if something is missing from my SQL query. I have spatial indices, vacuumed, and performed an ST_DWithin to limit the search area to only those points that are a maximum of 848 m away. The unjoined_pts table has 9,627 points and the midpts table has 1.98 million points.
CREATE INDEX unjoined_pts_geom_idx ON public.unjoined_pts USING GIST (geom);
CREATE INDEX midpts_geom_idx ON public.midpts USING GIST (geom);
-- functional index of points
CREATE INDEX ON unjoined_pts USING gist(geography(geom));
CREATE INDEX ON midpts USING gist(geography(geom));
VACUUM ANALYZE public.midpts;
VACUUM ANALYZE public.unjoined_pts;
EXPLAIN CREATE TABLE nearest AS
WITH knn_midpoints AS (
SELECT p.lic_li_no, p.orig_licli, p.descriptor, p.attribute1, p.attribute2, ....,
p.attributeN, closest_midpoint.geom
FROM public.unjoined_pts p
CROSS JOIN LATERAL (
SELECT m.geom, m.lic_li_no, m.orig_licli, m.descriptor
FROM public.midpts m
WHERE ST_DWithin(p.geom::geography, m.geom::geography, 848)
AND p.lic_li_no = m.lic_li_no -- varchar (length=30)
OR p.lic_li_no = m.orig_licli -- varchar (length=30)
AND p.descriptor != m.descriptor -- varchar (length=50)
ORDER BY
p.geom::geography <-> m.geom::geography -- nearest midpoint for each unjoined point
LIMIT 1 -- limit to only a single midpoint location
) AS closest_midpoint
)
SELECT * FROM knn_midpoints;
EXPLAIN results from a subset of the unjoined_pts, nut I used the same number of midpts. I removed the ST_DWithin. enter image description here
1 Answer 1
First of all, why do you need to cast geoms to geography? At a distance of 848 m, there will be absolutely minuscule differences between spherical and planar systems.
Then I will try to avoid cross-join. Simple join will do just fine I thing. Something like this should return points that satisfy spatial and other conditions. It should be pretty quick to iterate over 9k or so points. Then you can grab them and left join them to your original table with 2M points (based on id or spatially...)
select p.lic_li_no, p.orig_licli, p.descriptor, p.attribute1, p.attribute2,...., p.attributeN, closest_midpoint.geom
from public.unjoined_pts p
left join lateral (
select m.geom, m.lic_li_no, m.orig_licli, m.descriptor
from public.midpts m
where ST_DWithin(p.geom, m.geom, 848)
and (p.lic_li_no = m.lic_li_no or p.lic_li_no = m.orig_licli)
and p.descriptor != m.descriptor
order by p.geom <-> m.geom
limit 1) closeset_midpoint on true
--- EDIT ---
These indexes should also help:
create index midpts_licli_ix on midpts (lic_li_no);
create index midpts_orig_licli_ix on midpts (orig_licli);
-
still doesn't return any results. I have been letting it run for more than 1/2hr.MJM– MJM2023年03月15日 17:18:17 +00:00Commented Mar 15, 2023 at 17:18
-
@MJM OK, what happens if you factor out non spatial conditions?DavidP– DavidP2023年03月16日 06:44:18 +00:00Commented Mar 16, 2023 at 6:44
-
@MJM At a closer look at where condition there should be parentheses to properly group logical predicates. Please try to put parentheses around p.lic_li_no = m.lic_li_no or p.lic_li_no = m.orig_licli.DavidP– DavidP2023年03月16日 07:16:32 +00:00Commented Mar 16, 2023 at 7:16
-
Query returned successfully in 20 min 39 secs. 9627 entries.MJM– MJM2023年03月16日 20:51:07 +00:00Commented Mar 16, 2023 at 20:51
-
@MJM it is pretty slow in my opinion. I think there is still room for optimization... But if it is god enough for you...DavidP– DavidP2023年03月16日 22:14:09 +00:00Commented Mar 16, 2023 at 22:14
GEOGRAPHY
is not covered by the index - you'd need a functional index... USING GIST ( (geom::GEOGRAPHY) );
.ST_DWithin
is just noise with a cost for the (K)NN search - better remove it.EXPLAIN
of your request? Or even better, anEXPLAIN ANALYZE
using smaller tables that you create but sampling the big ones ?