PostGIS Spatial Relationships query performance

Question 1

I have a very large dataset contains over 700 million points and a polygon dataset as a buffer zone.

My task is to extract all points inside the buffer zone and create a new table.

Below is my code. I test it with a small point dataset and it works fine.

create table schema1.result as
select point.* from
schema1.site as point, schema2.buffer as poly
Where ST_Intersects(point.geo_loc,poly.wkb_geometry);

Unfortunately, the query lasted for 1 day and showed no signs to finish.

Is there any advice to optimise my code to speed up the query?

Update: The output of Explain

"Nested Loop (cost=0.41..17773703.88 rows=6789472 width=208)"

" -> Seq Scan on buffer poly (cost=0.00..18.50 rows=850 width=32)"

" -> Index Scan using idx_site on site point (cost=0.41..20902.23 rows=799 width=208)"

" Index Cond: (geo_loc && poly.wkb_geometry)"

" Filter: st_intersects(geo_loc, poly.wkb_geometry)"

"JIT:"

" Functions: 6"

" Options: Inlining true, Optimization true, Expressions true, Deforming true"

Question 2

Do you have a GiST index on point.geo_loc and/or poly.wkb_geometry? Could you show the output of explain with the query?

Question 3

Hi @MikeT, Thanks for your reply. I believe I have created index on both point.geo_loc and poly.wkb_geometry. Also, I have added the Explain output to the original post.

Question 4

have you ANALYZEd the tables after creating the indices? Also, someone else may correct me, but you could add AND ST_DWITHIN(point.geo_loc,poly.wkb_geometry, 1) to the where and it may use the spatial indices more efficiently/at all.

Question 5

What is the min, max and avg ST_NPoints of your polygons? How many are there? Do they all have a regular shape?

Question 6

And most importantly, do they overlap? Since, if a point may be found in more than one poly, a JOIN will fetch it for each match; you'd need to add a DISTINCT which will be significantly slower. In that case, an EXISTS may indeed be the better choice. If, however, a point always only intersects with one polygon, a JOIN is likely the better plan.

Question 7

I think what's going on here is that when you write:

from schema1.site as point, schema2.buffer as poly

PostgreSQL is doing a CROSS JOIN between the two tables. When multiple tables are listed in the FROM clause postgres uses a CROSS JOIN source that results in a table with a number of rows equal to the Cartesian product of the two tables source.

to avoid that you can use WHERE EXISTS as:

SELECT
 *
FROM schema1.site AS point_table
WHERE EXISTS(
 SELECT 1 
 FROM schema2.buffer AS poly_table
 where ST_Intersects(point_table.geom, poly_table.geom)
);

This allows PostgreSQL to calculate only the spatial join and not have to build a temporary table of the full cross join which takes forever.

Question 8

Note that the assumption is to create a new table; PG should handle this elegantly without writing a temporary merge table first; also, there are no hash join nodes in the plan. The CROSS JOIN should get translated to an (usually much faster) INNER JOIN with the given WHERE filter, but if the polygons overlap, the result will have (costly) duplicates. This also invokes the subquery 700M times, run against the less effective polygon index. I do love myself an EXISTS whenever possible, but I suspect it is not going to beat a JOIN if OP can save the DISTINCT.

Question 9

If no join qualifier is specified, the default is to compute an INNER join.

Question 10

@dr_jts is my answer above not an improvement over the original method used in the question or are you pointing out that it's not as good as the answer you posted? Also, I'm not seeing anything on google for "postgres join qualifier" is there a specific definition of this term?

Question 11

@Hugh_Kelley I was just pointing out that if no JOIN type is specified, then the default is INNER. Postgres reference is here: postgresql.org/docs/current/…. And actually I do suspect that using the point table as the driving table is going to be less efficient than the reverse.

Question 12

@dr_jts, thanks, not intending to be argumentative, trying to understand what's going on with the query. I need to get a better conceptual understanding of the spatial join operations.

Question 13

It might be faster to use the polygons as the driving table, using the point table index to filter down the large number of records. This allows PostGIS to optimize the ST_Intersects spatial predicate by preparing each polygon.

This can be forced using LATERAL:

SELECT pt.* FROM schema2.buffer AS poly
JOIN LATERAL (SELECT * FROM schema1.site) AS pt 
 ON ST_Intersects(poly.wkb_geometry, pt.geo_loc);

If the polygons have a large number of vertices, it can be faster to use ST_Subdivide to fragment them before doing the point query:

WITH poly AS (
 SELECT ST_Subdivide(wkb_geometry) AS geom FROM schema2.buffer 
)
SELECT pt.* FROM poly
JOIN LATERAL (SELECT * FROM schema1.site) AS pt 
 ON ST_Intersects(poly.geom, pt.geo_loc);

Question 14

+1 for subdivision, if applicable here. At least with PG>10, the planner should choose the more effective (larger) index no matter the written JOIN order. That's also what OPs EXPLAIN ANALYZE suggests.

Question 15

@dr_jts Thanks for your advice. I also suppose that to use the polygons as the driving table can be a bit faster, but somehow this query took a bit longer than answer from Hugh_Kelley. But anyway, LATERAL is quite a novel join method for me to learn.

Question 16

Oh, interesting. Relative performance might depend on the nature of the data.

Hugh_Kelley Hugh_Kelley 7792 gold badges7 silver badges22 bronze badges · Accepted Answer · 2020-10-29 00:14:42Z

6

I think what's going on here is that when you write:

from schema1.site as point, schema2.buffer as poly

PostgreSQL is doing a CROSS JOIN between the two tables. When multiple tables are listed in the FROM clause postgres uses a CROSS JOIN source that results in a table with a number of rows equal to the Cartesian product of the two tables source.

to avoid that you can use WHERE EXISTS as:

SELECT
 *
FROM schema1.site AS point_table
WHERE EXISTS(
 SELECT 1 
 FROM schema2.buffer AS poly_table
 where ST_Intersects(point_table.geom, poly_table.geom)
);

This allows PostgreSQL to calculate only the spatial join and not have to build a temporary table of the full cross join which takes forever.

Share

Improve this answer

edited Oct 30, 2020 at 13:11

answered Oct 29, 2020 at 0:14

Hugh_Kelley's user avatar

Hugh_Kelley Hugh_Kelley

7792 gold badges7 silver badges22 bronze badges

6

Note that the assumption is to create a new table; PG should handle this elegantly without writing a temporary merge table first; also, there are no hash join nodes in the plan. The CROSS JOIN should get translated to an (usually much faster) INNER JOIN with the given WHERE filter, but if the polygons overlap, the result will have (costly) duplicates. This also invokes the subquery 700M times, run against the less effective polygon index. I do love myself an EXISTS whenever possible, but I suspect it is not going to beat a JOIN if OP can save the DISTINCT.

geozelot
– geozelot

2020年10月29日 08:58:26 +00:00
Commented Oct 29, 2020 at 8:58
If no join qualifier is specified, the default is to compute an INNER join.

dr_jts
– dr_jts

2020年10月29日 14:54:39 +00:00
Commented Oct 29, 2020 at 14:54
@dr_jts is my answer above not an improvement over the original method used in the question or are you pointing out that it's not as good as the answer you posted? Also, I'm not seeing anything on google for "postgres join qualifier" is there a specific definition of this term?

Hugh_Kelley
– Hugh_Kelley

2020年10月29日 15:10:13 +00:00
Commented Oct 29, 2020 at 15:10
@Hugh_Kelley I was just pointing out that if no JOIN type is specified, then the default is INNER. Postgres reference is here: postgresql.org/docs/current/…. And actually I do suspect that using the point table as the driving table is going to be less efficient than the reverse.

dr_jts
– dr_jts

2020年10月29日 18:47:17 +00:00
Commented Oct 29, 2020 at 18:47
1

@dr_jts, thanks, not intending to be argumentative, trying to understand what's going on with the query. I need to get a better conceptual understanding of the spatial join operations.

Hugh_Kelley
– Hugh_Kelley

2020年10月30日 13:02:10 +00:00
Commented Oct 30, 2020 at 13:02

| Show 1 more comment

Stack Exchange Network

PostGIS Spatial Relationships query performance

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

PostGIS Spatial Relationships query performance

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions