I'm trying to improve performance for the query below. No matter how I write the query (subquery in FROM clause, subquery in WHERE clause) postgres insists on running all ~570K rows through the expensive ST_DWITHIN function even though there are only 60 rows where county=24. How can I get postgres to filter on county=24 BEFORE running through the postgis func which seems to me would be much faster and far more efficient? 700ms isnt cause for too much concern but as this table grows to 10M+ I'm concerned about performance.
Also to note, p.id is a primary key, p.zipcode is an fk index, z.county is a fk index, and p.geom has a GiST index.
Query:
EXPLAIN ANALYZE
SELECT count(p.id)
FROM point AS p
LEFT JOIN zipcode AS z
ON p.zipcode = z.zipcode
WHERE z.county = 24
AND ST_DWithin(
p.geom,
ST_SetSRID(ST_Point(-121.479756008715,38.563236291512),4269),
16090.0,
false
)
EXPLAIN ANALYZE:
Aggregate (cost=250851.91..250851.92 rows=1 width=4) (actual time=724.007..724.007 rows=1 loops=1)
-> Hash Join (cost=152.05..250851.34 rows=228 width=4) (actual time=0.359..723.996 rows=51 loops=1)
Hash Cond: ((p.zipcode)::text = (z.zipcode)::text)
-> Seq Scan on point p (cost=0.00..250669.12 rows=7437 width=10) (actual time=0.258..723.867 rows=63 loops=1)
Filter: (((geom)::geography && '0101000020AD10000063DF8B52B45E5EC070FB752018484340'::geography) AND ('0101000020AD10000063DF8B52B45E5EC070FB752018484340'::geography && _st_expand((geom)::geography, 16090::double precision)) AND _st_dwithin((g (...)
Rows Removed by Filter: 557731
-> Hash (cost=151.38..151.38 rows=54 width=6) (actual time=0.095..0.095 rows=54 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 3kB
-> Bitmap Heap Scan on zipcode z (cost=4.70..151.38 rows=54 width=6) (actual time=0.023..0.079 rows=54 loops=1)
Recheck Cond: (county = 24)
Heap Blocks: exact=39
-> Bitmap Index Scan on fki_zipcode_county_foreign_key (cost=0.00..4.68 rows=54 width=0) (actual time=0.016..0.016 rows=54 loops=1)
Index Cond: (county = 24)
Planning time: 0.504 ms
Execution time: 724.064 ms
4 Answers 4
You can see the problem with the expected vs actual row counts. The planner thinks that that there are 7,437 rows, however there are only 63. The statistics are off. Interestingly enough too, it's not using a bounding box index (index) search with DWithin
can you paste the result of \d point
. What version of PostGIS and PostgreSQL?
Try running ANALYZE point
. Do you get the same plan when you move the the condition up?
JOIN zipcode AS z
ON p.zipcode = z.zipcode
AND z.county = 24
-
I did run analyze and also tried the new AND condition in ON but was still getting 700ms run times. This is PGSQL 9.4 and PostGIS 2.2.Josh– Josh2017年05月11日 19:27:05 +00:00Commented May 11, 2017 at 19:27
As a side note, there is a reasonable chance that this behavior is modified in PostGIS 2.3.0 if you want to call it a bug.
From the docs on PostgreSQL
A positive number giving the estimated execution cost for the function, in units of cpu_operator_cost. If the function returns a set, this is the cost per returned row. If the cost is not specified, 1 unit is assumed for C-language and internal functions, and 100 units for functions in all other languages. Larger values cause the planner to try to avoid evaluating the function more often than necessary.
So the default cost was 1 (very cheap). D_Within
using a GIST index is very cheap. But, that was increased to 100 (by proxy of the internal _ST_DWithin
).
I'm not a huge fan of the CTE method myself. CTEs are an optimization fence. So doing this in such a fashion removes some potential room for future optimization. If saner defaults fix it, I would rather upgrade. At the end the day, we gotta get the job done and that method clearly works for you.
Thanks to John Powell's hint I revised the query to put the county limiting condition in a with/CTE query and this improved performance quite a bit to 222ms vs 700. Still a far cry from the .74 ms I get when the data is in its own table. I'm still not sure why the planner doesnt limit the data set before running through an expensive postgis function, and I'll have to try with larger datasets when I have them but this appears to be a solution to this unique situation for now.
with points as (
select p.id, p.geom from point p inner join zipcode z
on p.zipcode = z.zipcode
where county = 24
)
SELECT count(points.id)
FROM points
WHERE ST_DWITHIN(points.geom, (ST_SetSRID(ST_Point(-121.479756008715,38.563236291512),4269)), 16090.0, false)
-
1We would have to see all three query plans, and the schema for the table (requested in my answer \d point).Evan Carroll– Evan Carroll2017年05月12日 03:47:24 +00:00Commented May 12, 2017 at 3:47
You should create an index on zipcode(county, zipcode)
, that should give you an index only scan on z.
You may also want to experiment with btree_gist
extension creating either point(zipcode, geom)
index or point(geom, zipcode)
and zipcode(zipcode, county)
index.
point
rows where county=24 to a new table all by themselves, the query takes only .453ms compared to 724 so there is definitely a big difference.count(*)
as a matter of style. Ifid
is a pkid as you say, it'sNOT NULL
which means that they're the same. Exceptcount(id)
has the drawback that you have to ask that question ifid
is nullable.