Faster spatial queries for 100M records with Postgres

Question 1

I am currenty using Postgres/PostGIS (one single instance) to run some geo queries to catch points that fall within specific shapes along with some extra data from the relative columns.

However even with indexes (and scaling up the db) some of the queries are too slow.

I have tried using GPU databases like BrytLyt which run smoothly on top of Postgres and it solves the problem but unfortunately is way far too expensive (AWS per hour cost).

The total amount of records is ~100M and the queries at the moment are taking 6/7 secs each one (I need ~ 0.1 secs or so).

Can you suggest any database that could suit for the problem? I am trying to avoid some "manual" sharding and pre-calculating values.

So for example the average price for house with 1 bedroom in a specific area.

This is the current table schema:

 Column | Type | Collation | Nullable | Default 
-----------------------+------------------+-----------+----------+---------
 uprn | bigint | | | 
 is_address_accurate | boolean | | | 
 postcode | text | | | 
 listing_type | text | | | 
 asking_price | double precision | | | 
 bedrooms | integer | | | 
 property_type | bpchar | | | 
 description | text | | | 
 total_floor_area_sqft | double precision | | | 
 date_appeared | date | | | 
 date_removed | date | | | 
 sold_price | double precision | | | 
 date_sold | date | | | 
 keywords | text | | | 
 asking_price_sqft | double precision | | | 
 sold_price_sqft | double precision | | | 
 days_on_market | integer | | | 
 images | text[] | | | 
 latitude | double precision | | | 
 longitude | double precision | | | 
 geom | geometry | | | 
 sector_ind | text | | | 
 district_ind | text | | | 
 area_ind | text | | | 
 total_floor_area | double precision | | | 
 transaction_type | text | | | 
 old_new | character(1) | | | 
 duration | character(1) | | | 
 floor_level | text | | | 
 floor_height | double precision | | | 
 current_energy_rating | text | | | 
 flat_top_storey | text | | | 
 flat_storey_count | smallint | | | 
 lodgement_date | date | | | 
 discount | double precision | | | 
 import_date | date | | | 
 building_number | smallint | | | 
 sub_building_name | text | | | 
 building_name | text | | | 
 thoroughfare | text | | | 
 post_town | text | | | 
Indexes:
 "sales_core_area_ind_idx" btree (area_ind)
 "sales_core_area_ind_property_type_bedrooms_idx" btree (area_ind, property_type, bedrooms)
 "sales_core_geom_idx" gist (geom)

An example of query I am running on:

explain analyze 
select avg(sold_price) 
from sales_core sc 
where ST_Intersects(sc.geom, (select ST_Transform(ST_SetSRID(geometry, 4326), 2163) 
 from postcodes 
 where name = 'SW') 
 ) 
 and bedrooms = 1 
 and property_type = 'F';

This is the explain analyze:

 QUERY PLAN 
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate (cost=1656833.89..1656833.90 rows=1 width=8) (actual time=141286.194..141286.194 rows=1 loops=1)
 InitPlan 1 (returns 0ドル)
 -> Seq Scan on postcodes (cost=0.00..3350.97 rows=1 width=2384) (actual time=14.079..19.155 rows=1 loops=1)
 Filter: ((name)::text = 'SW'::text)
 Rows Removed by Filter: 11836
 -> Bitmap Heap Scan on sales_core sc (cost=834695.55..1653222.45 rows=104190 width=8) (actual time=141286.187..141286.187 rows=0 loops=1)
 Recheck Cond: ((property_type = 'F'::bpchar) AND (bedrooms = 1))
 Rows Removed by Index Recheck: 4447029
 Filter: st_intersects(geom, 0ドル)
 Rows Removed by Filter: 1338789
 Heap Blocks: exact=35516 lossy=241191
 -> Bitmap Index Scan on sales_core_area_ind_property_type_bedrooms_idx (cost=0.00..834669.50 rows=312570 width=0) (actual time=2062.949..2062.949 rows=1338789 loops=1)
 Index Cond: ((property_type = 'F'::bpchar) AND (bedrooms = 1))
 Planning time: 0.192 ms
 Execution time: 141307.039 ms
(15 rows)

UPDATE

Following the suggestions from @Evan Carroll's answer, I have applied some changes. I have created a new column with "geography" type in this way:

update listings set location_2 = ST_Point(longitude, latitude)::geography

created an index in two steps for cluster (no extra parameters for filters):

CREATE INDEX ON listings USING gist(location_2);
CLUSTER listings ON using listings_location_2_idx;

I removed all the extra parameters as I am trying only a pure lookup based on the geography:

select count(*) from listings l join postcodes pc on ST_Intersects (l.location_2, ST_SetSRID(pc.geometry, 4326)::geography) where pc.name = 'SW';

The final results: count | 929245 Time: 1661312.474 ms (27:41.312)

The count seems correct (~1M records) but it is way far too slow to run (over 27 mins!).

I am not sure if I am doing something wrong (I went through the whole documentation but maybe I have lost something).

Any idea?

Question 2

would help if you provided some information about what you where trying to do besides get data back faster.

Question 3

We need EXPLAIN ANALYZE on a query you find too slow. and the \d table for all involved tables.

Question 4

@EvanCarrollQWERHJKL I have added as much as info possible.

Question 5

Do u mean '(area_ind, property_type, bedrooms)'? I looks used but not the "geom index".

Question 6

Sorry, I thought it was only 1 table involved. Please add the schema of postcodes as well.

Question 7

Base on the query plan given, you need an index on postcodes.name

 CREATE INDEX ON postcodes (name);

You can also fix your gist index on sales_core with either predicates or compounding it.

CREATE EXTENSION btree_gist;
CREATE INDEX ON sales_core USING gist(property_type, bedrooms, geom);

I would also highly suggest not storing as Geometry 4326, and instead storing as Geography 4326. The ST_Intersects should also be rewritten as a JOIN.

explain analyze 
select avg(sold_price) 
from sales_core sc
JOIN postcodes AS pc
 ON ST_Intersects( sc.geom, pc.geometry)
WHERE bedrooms = 1 
AND property_type = 'F'
AND pc.name = 'SW';

Clustering also makes a huge difference if your primary join condition is GIST. You perhaps need a consultant on PostGIS. There are lots of us that can help. But, I don't see why you can't get more performance out of Pg.

Question 8

Thank you, surely an improvement. It is still slow tho (~1.8 secs) but I didn't use geography yet.

Question 9

@Randomize migrate to the geography type and update the question.

Evan Carroll Evan Carroll 65.7k50 gold badges259 silver badges510 bronze badges · Answer 1 · 2017-11-13 16:56:25Z

Base on the query plan given, you need an index on postcodes.name

 CREATE INDEX ON postcodes (name);

You can also fix your gist index on sales_core with either predicates or compounding it.

CREATE EXTENSION btree_gist;
CREATE INDEX ON sales_core USING gist(property_type, bedrooms, geom);

I would also highly suggest not storing as Geometry 4326, and instead storing as Geography 4326. The ST_Intersects should also be rewritten as a JOIN.

explain analyze 
select avg(sold_price) 
from sales_core sc
JOIN postcodes AS pc
 ON ST_Intersects( sc.geom, pc.geometry)
WHERE bedrooms = 1 
AND property_type = 'F'
AND pc.name = 'SW';

Clustering also makes a huge difference if your primary join condition is GIST. You perhaps need a consultant on PostGIS. There are lots of us that can help. But, I don't see why you can't get more performance out of Pg.

Thank you, surely an improvement. It is still slow tho (~1.8 secs) but I didn't use geography yet.
@Randomize migrate to the geography type and update the question.

Stack Exchange Network

Faster spatial queries for 100M records with Postgres

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Faster spatial queries for 100M records with Postgres

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions