I have a very large dataset contains over 700 million points and a polygon dataset as a buffer zone.
My task is to extract all points inside the buffer zone and create a new table.
Below is my code. I test it with a small point dataset and it works fine.
create table schema1.result as
select point.* from
schema1.site as point, schema2.buffer as poly
Where ST_Intersects(point.geo_loc,poly.wkb_geometry);
Unfortunately, the query lasted for 1 day and showed no signs to finish.
Is there any advice to optimise my code to speed up the query?
Update: The output of Explain
"Nested Loop (cost=0.41..17773703.88 rows=6789472 width=208)"
" -> Seq Scan on buffer poly (cost=0.00..18.50 rows=850 width=32)"
" -> Index Scan using idx_site on site point (cost=0.41..20902.23 rows=799 width=208)"
" Index Cond: (geo_loc && poly.wkb_geometry)"
" Filter: st_intersects(geo_loc, poly.wkb_geometry)"
"JIT:"
" Functions: 6"
" Options: Inlining true, Optimization true, Expressions true, Deforming true"
2 Answers 2
I think what's going on here is that when you write:
from schema1.site as point, schema2.buffer as poly
PostgreSQL is doing a CROSS JOIN
between the two tables. When multiple tables are listed in the FROM
clause postgres uses a CROSS JOIN
source that results in a table with a number of rows equal to the Cartesian product of the two tables source.
to avoid that you can use WHERE EXISTS
as:
SELECT
*
FROM schema1.site AS point_table
WHERE EXISTS(
SELECT 1
FROM schema2.buffer AS poly_table
where ST_Intersects(point_table.geom, poly_table.geom)
);
This allows PostgreSQL to calculate only the spatial join and not have to build a temporary table of the full cross join which takes forever.
-
Note that the assumption is to create a new table; PG should handle this elegantly without writing a temporary merge table first; also, there are no hash join nodes in the plan. The
CROSS JOIN
should get translated to an (usually much faster)INNER JOIN
with the givenWHERE
filter, but if the polygons overlap, the result will have (costly) duplicates. This also invokes the subquery 700M times, run against the less effective polygon index. I do love myself anEXISTS
whenever possible, but I suspect it is not going to beat aJOIN
if OP can save theDISTINCT
.geozelot– geozelot2020年10月29日 08:58:26 +00:00Commented Oct 29, 2020 at 8:58 -
If no join qualifier is specified, the default is to compute an INNER join.dr_jts– dr_jts2020年10月29日 14:54:39 +00:00Commented Oct 29, 2020 at 14:54
-
@dr_jts is my answer above not an improvement over the original method used in the question or are you pointing out that it's not as good as the answer you posted? Also, I'm not seeing anything on google for "postgres join qualifier" is there a specific definition of this term?Hugh_Kelley– Hugh_Kelley2020年10月29日 15:10:13 +00:00Commented Oct 29, 2020 at 15:10
-
@Hugh_Kelley I was just pointing out that if no JOIN type is specified, then the default is
INNER
. Postgres reference is here: postgresql.org/docs/current/…. And actually I do suspect that using the point table as the driving table is going to be less efficient than the reverse.dr_jts– dr_jts2020年10月29日 18:47:17 +00:00Commented Oct 29, 2020 at 18:47 -
1@dr_jts, thanks, not intending to be argumentative, trying to understand what's going on with the query. I need to get a better conceptual understanding of the spatial join operations.Hugh_Kelley– Hugh_Kelley2020年10月30日 13:02:10 +00:00Commented Oct 30, 2020 at 13:02
It might be faster to use the polygons as the driving table, using the point table index to filter down the large number of records. This allows PostGIS to optimize the ST_Intersects
spatial predicate by preparing each polygon.
This can be forced using LATERAL
:
SELECT pt.* FROM schema2.buffer AS poly
JOIN LATERAL (SELECT * FROM schema1.site) AS pt
ON ST_Intersects(poly.wkb_geometry, pt.geo_loc);
If the polygons have a large number of vertices, it can be faster to use ST_Subdivide
to fragment them before doing the point query:
WITH poly AS (
SELECT ST_Subdivide(wkb_geometry) AS geom FROM schema2.buffer
)
SELECT pt.* FROM poly
JOIN LATERAL (SELECT * FROM schema1.site) AS pt
ON ST_Intersects(poly.geom, pt.geo_loc);
-
+1 for subdivision, if applicable here. At least with PG>10, the planner should choose the more effective (larger) index no matter the written
JOIN
order. That's also what OPsEXPLAIN ANALYZE
suggests.geozelot– geozelot2020年10月29日 09:02:40 +00:00Commented Oct 29, 2020 at 9:02 -
@dr_jts Thanks for your advice. I also suppose that to use the polygons as the driving table can be a bit faster, but somehow this query took a bit longer than answer from Hugh_Kelley. But anyway, LATERAL is quite a novel join method for me to learn.Scorpioooooon21– Scorpioooooon212020年11月04日 05:55:08 +00:00Commented Nov 4, 2020 at 5:55
-
Oh, interesting. Relative performance might depend on the nature of the data.dr_jts– dr_jts2020年11月04日 17:07:22 +00:00Commented Nov 4, 2020 at 17:07
point.geo_loc
and/orpoly.wkb_geometry
? Could you show the output of explain with the query?ANALYZE
d the tables after creating the indices? Also, someone else may correct me, but you could addAND ST_DWITHIN(point.geo_loc,poly.wkb_geometry, 1)
to the where and it may use the spatial indices more efficiently/at all.ST_NPoints
of your polygons? How many are there? Do they all have a regular shape?JOIN
will fetch it for each match; you'd need to add aDISTINCT
which will be significantly slower. In that case, anEXISTS
may indeed be the better choice. If, however, a point always only intersects with one polygon, aJOIN
is likely the better plan.