2

I am currenty using Postgres/PostGIS (one single instance) to run some geo queries to catch points that fall within specific shapes along with some extra data from the relative columns.

However even with indexes (and scaling up the db) some of the queries are too slow.

I have tried using GPU databases like BrytLyt which run smoothly on top of Postgres and it solves the problem but unfortunately is way far too expensive (AWS per hour cost).

The total amount of records is ~100M and the queries at the moment are taking 6/7 secs each one (I need ~ 0.1 secs or so).

Can you suggest any database that could suit for the problem? I am trying to avoid some "manual" sharding and pre-calculating values.

So for example the average price for house with 1 bedroom in a specific area.

This is the current table schema:

 Column | Type | Collation | Nullable | Default 
-----------------------+------------------+-----------+----------+---------
 uprn | bigint | | | 
 is_address_accurate | boolean | | | 
 postcode | text | | | 
 listing_type | text | | | 
 asking_price | double precision | | | 
 bedrooms | integer | | | 
 property_type | bpchar | | | 
 description | text | | | 
 total_floor_area_sqft | double precision | | | 
 date_appeared | date | | | 
 date_removed | date | | | 
 sold_price | double precision | | | 
 date_sold | date | | | 
 keywords | text | | | 
 asking_price_sqft | double precision | | | 
 sold_price_sqft | double precision | | | 
 days_on_market | integer | | | 
 images | text[] | | | 
 latitude | double precision | | | 
 longitude | double precision | | | 
 geom | geometry | | | 
 sector_ind | text | | | 
 district_ind | text | | | 
 area_ind | text | | | 
 total_floor_area | double precision | | | 
 transaction_type | text | | | 
 old_new | character(1) | | | 
 duration | character(1) | | | 
 floor_level | text | | | 
 floor_height | double precision | | | 
 current_energy_rating | text | | | 
 flat_top_storey | text | | | 
 flat_storey_count | smallint | | | 
 lodgement_date | date | | | 
 discount | double precision | | | 
 import_date | date | | | 
 building_number | smallint | | | 
 sub_building_name | text | | | 
 building_name | text | | | 
 thoroughfare | text | | | 
 post_town | text | | | 
Indexes:
 "sales_core_area_ind_idx" btree (area_ind)
 "sales_core_area_ind_property_type_bedrooms_idx" btree (area_ind, property_type, bedrooms)
 "sales_core_geom_idx" gist (geom)

An example of query I am running on:

explain analyze 
select avg(sold_price) 
from sales_core sc 
where ST_Intersects(sc.geom, (select ST_Transform(ST_SetSRID(geometry, 4326), 2163) 
 from postcodes 
 where name = 'SW') 
 ) 
 and bedrooms = 1 
 and property_type = 'F';

This is the explain analyze:

 QUERY PLAN 
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate (cost=1656833.89..1656833.90 rows=1 width=8) (actual time=141286.194..141286.194 rows=1 loops=1)
 InitPlan 1 (returns 0ドル)
 -> Seq Scan on postcodes (cost=0.00..3350.97 rows=1 width=2384) (actual time=14.079..19.155 rows=1 loops=1)
 Filter: ((name)::text = 'SW'::text)
 Rows Removed by Filter: 11836
 -> Bitmap Heap Scan on sales_core sc (cost=834695.55..1653222.45 rows=104190 width=8) (actual time=141286.187..141286.187 rows=0 loops=1)
 Recheck Cond: ((property_type = 'F'::bpchar) AND (bedrooms = 1))
 Rows Removed by Index Recheck: 4447029
 Filter: st_intersects(geom, 0ドル)
 Rows Removed by Filter: 1338789
 Heap Blocks: exact=35516 lossy=241191
 -> Bitmap Index Scan on sales_core_area_ind_property_type_bedrooms_idx (cost=0.00..834669.50 rows=312570 width=0) (actual time=2062.949..2062.949 rows=1338789 loops=1)
 Index Cond: ((property_type = 'F'::bpchar) AND (bedrooms = 1))
 Planning time: 0.192 ms
 Execution time: 141307.039 ms
(15 rows)

UPDATE

Following the suggestions from @Evan Carroll's answer, I have applied some changes. I have created a new column with "geography" type in this way:

update listings set location_2 = ST_Point(longitude, latitude)::geography

created an index in two steps for cluster (no extra parameters for filters):

CREATE INDEX ON listings USING gist(location_2);
CLUSTER listings ON using listings_location_2_idx;

I removed all the extra parameters as I am trying only a pure lookup based on the geography:

select count(*) from listings l join postcodes pc on ST_Intersects (l.location_2, ST_SetSRID(pc.geometry, 4326)::geography) where pc.name = 'SW';

The final results: count | 929245 Time: 1661312.474 ms (27:41.312)

The count seems correct (~1M records) but it is way far too slow to run (over 27 mins!).

I am not sure if I am doing something wrong (I went through the whole documentation but maybe I have lost something).

Any idea?

asked Nov 13, 2017 at 15:16
12
  • 1
    would help if you provided some information about what you where trying to do besides get data back faster. Commented Nov 13, 2017 at 15:17
  • We need EXPLAIN ANALYZE on a query you find too slow. and the \d table for all involved tables. Commented Nov 13, 2017 at 15:20
  • @EvanCarrollQWERHJKL I have added as much as info possible. Commented Nov 13, 2017 at 15:49
  • Do u mean '(area_ind, property_type, bedrooms)'? I looks used but not the "geom index". Commented Nov 13, 2017 at 15:59
  • Sorry, I thought it was only 1 table involved. Please add the schema of postcodes as well. Commented Nov 13, 2017 at 16:00

1 Answer 1

1

Base on the query plan given, you need an index on postcodes.name

 CREATE INDEX ON postcodes (name);

You can also fix your gist index on sales_core with either predicates or compounding it.

CREATE EXTENSION btree_gist;
CREATE INDEX ON sales_core USING gist(property_type, bedrooms, geom);

I would also highly suggest not storing as Geometry 4326, and instead storing as Geography 4326. The ST_Intersects should also be rewritten as a JOIN.

explain analyze 
select avg(sold_price) 
from sales_core sc
JOIN postcodes AS pc
 ON ST_Intersects( sc.geom, pc.geometry)
WHERE bedrooms = 1 
AND property_type = 'F'
AND pc.name = 'SW';

Clustering also makes a huge difference if your primary join condition is GIST. You perhaps need a consultant on PostGIS. There are lots of us that can help. But, I don't see why you can't get more performance out of Pg.

answered Nov 13, 2017 at 16:56
2
  • Thank you, surely an improvement. It is still slow tho (~1.8 secs) but I didn't use geography yet. Commented Nov 14, 2017 at 15:13
  • @Randomize migrate to the geography type and update the question. Commented Nov 14, 2017 at 17:39

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.