Limit Rows Through Spatial Function

Question 1

I'm trying to improve performance for the query below. No matter how I write the query (subquery in FROM clause, subquery in WHERE clause) postgres insists on running all ~570K rows through the expensive ST_DWITHIN function even though there are only 60 rows where county=24. How can I get postgres to filter on county=24 BEFORE running through the postgis func which seems to me would be much faster and far more efficient? 700ms isnt cause for too much concern but as this table grows to 10M+ I'm concerned about performance.

Also to note, p.id is a primary key, p.zipcode is an fk index, z.county is a fk index, and p.geom has a GiST index.

Query:

EXPLAIN ANALYZE
 SELECT count(p.id)
 FROM point AS p
 LEFT JOIN zipcode AS z
 ON p.zipcode = z.zipcode
 WHERE z.county = 24
 AND ST_DWithin(
 p.geom, 
 ST_SetSRID(ST_Point(-121.479756008715,38.563236291512),4269), 
 16090.0,
 false
 )

EXPLAIN ANALYZE:

Aggregate (cost=250851.91..250851.92 rows=1 width=4) (actual time=724.007..724.007 rows=1 loops=1)
 -> Hash Join (cost=152.05..250851.34 rows=228 width=4) (actual time=0.359..723.996 rows=51 loops=1)
 Hash Cond: ((p.zipcode)::text = (z.zipcode)::text)
 -> Seq Scan on point p (cost=0.00..250669.12 rows=7437 width=10) (actual time=0.258..723.867 rows=63 loops=1)
 Filter: (((geom)::geography && '0101000020AD10000063DF8B52B45E5EC070FB752018484340'::geography) AND ('0101000020AD10000063DF8B52B45E5EC070FB752018484340'::geography && _st_expand((geom)::geography, 16090::double precision)) AND _st_dwithin((g (...)
 Rows Removed by Filter: 557731
 -> Hash (cost=151.38..151.38 rows=54 width=6) (actual time=0.095..0.095 rows=54 loops=1)
 Buckets: 1024 Batches: 1 Memory Usage: 3kB
 -> Bitmap Heap Scan on zipcode z (cost=4.70..151.38 rows=54 width=6) (actual time=0.023..0.079 rows=54 loops=1)
 Recheck Cond: (county = 24)
 Heap Blocks: exact=39
 -> Bitmap Index Scan on fki_zipcode_county_foreign_key (cost=0.00..4.68 rows=54 width=0) (actual time=0.016..0.016 rows=54 loops=1)
 Index Cond: (county = 24)
Planning time: 0.504 ms
Execution time: 724.064 ms

Question 2

Maybe try changing the line "point as p left join zipcode as z" to something like "point as p left join (SELECT * FROM zipcode WHERE zipcode.county = 24) as z" ?

Question 3

Just tried it, same results. When I copy the ~60 point rows where county=24 to a new table all by themselves, the query takes only .453ms compared to 724 so there is definitely a big difference.

Question 4

You should use count(*) as a matter of style. If id is a pkid as you say, it's NOT NULL which means that they're the same. Except count(id) has the drawback that you have to ask that question if id is nullable.

Question 5

Can I ask why you are using a left outer join? Try changing it to an inner join ... The results should be identical

Question 6

If z.country is the limiting factor, I would suggest you put this first in a CTE query and then just check those results for an intersection with your point of interest. As the spatial index is probably less selective than county = 24 in this case, it is only getting in the way.

Question 7

You can see the problem with the expected vs actual row counts. The planner thinks that that there are 7,437 rows, however there are only 63. The statistics are off. Interestingly enough too, it's not using a bounding box index (index) search with DWithin can you paste the result of \d point. What version of PostGIS and PostgreSQL?

Try running ANALYZE point. Do you get the same plan when you move the the condition up?

JOIN zipcode AS z
 ON p.zipcode = z.zipcode
 AND z.county = 24

Question 8

I did run analyze and also tried the new AND condition in ON but was still getting 700ms run times. This is PGSQL 9.4 and PostGIS 2.2.

Question 9

As a side note, there is a reasonable chance that this behavior is modified in PostGIS 2.3.0 if you want to call it a bug.

From the docs on PostgreSQL

A positive number giving the estimated execution cost for the function, in units of cpu_operator_cost. If the function returns a set, this is the cost per returned row. If the cost is not specified, 1 unit is assumed for C-language and internal functions, and 100 units for functions in all other languages. Larger values cause the planner to try to avoid evaluating the function more often than necessary.

So the default cost was 1 (very cheap). D_Within using a GIST index is very cheap. But, that was increased to 100 (by proxy of the internal _ST_DWithin).

I'm not a huge fan of the CTE method myself. CTEs are an optimization fence. So doing this in such a fashion removes some potential room for future optimization. If saner defaults fix it, I would rather upgrade. At the end the day, we gotta get the job done and that method clearly works for you.

Question 10

Thanks to John Powell's hint I revised the query to put the county limiting condition in a with/CTE query and this improved performance quite a bit to 222ms vs 700. Still a far cry from the .74 ms I get when the data is in its own table. I'm still not sure why the planner doesnt limit the data set before running through an expensive postgis function, and I'll have to try with larger datasets when I have them but this appears to be a solution to this unique situation for now.

with points as (
 select p.id, p.geom from point p inner join zipcode z
 on p.zipcode = z.zipcode
 where county = 24
 ) 
SELECT count(points.id)
FROM points
WHERE ST_DWITHIN(points.geom, (ST_SetSRID(ST_Point(-121.479756008715,38.563236291512),4269)), 16090.0, false)

Question 11

We would have to see all three query plans, and the schema for the table (requested in my answer \d point).

Question 12

You should create an index on zipcode(county, zipcode), that should give you an index only scan on z.

You may also want to experiment with btree_gist extension creating either point(zipcode, geom) index or point(geom, zipcode) and zipcode(zipcode, county) index.

Evan Carroll Evan Carroll 7,2592 gold badges34 silver badges68 bronze badges · Accepted Answer · 2017-05-10 22:25:23Z

You can see the problem with the expected vs actual row counts. The planner thinks that that there are 7,437 rows, however there are only 63. The statistics are off. Interestingly enough too, it's not using a bounding box index (index) search with DWithin can you paste the result of \d point. What version of PostGIS and PostgreSQL?

Try running ANALYZE point. Do you get the same plan when you move the the condition up?

JOIN zipcode AS z
 ON p.zipcode = z.zipcode
 AND z.county = 24

I did run analyze and also tried the new AND condition in ON but was still getting 700ms run times. This is PGSQL 9.4 and PostGIS 2.2.

Stack Exchange Network

Limit Rows Through Spatial Function

4 Answers 4

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Limit Rows Through Spatial Function

4 Answers 4

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions