Making selections from two tables based on the reference table in PostgreSQL/Postgis

Question 1

Based on the address locations ad for the region of interest (as a reference), I am trying to make selections from two street layers street1 and street2. What I have tried is shown by the code below:

Select
 --from address table
 address_id,
 address_locations,
 --from street1
 foo,
 bar,
 --from street2
 alpha,
 beta
From
(
 Select
 ad.gid As address_id,
 ad.geom As address_locations,
 foo.street1 As foo,
 bar.street1 As bar,
 aplha.street2 As alpha,
 beta.street2 As beta, 
 St_Distance(ad.geom, street1.geom) As d 
 From
 public.ad, public.street1, public.street2 
 Where 
 ST_DWithin(ad.geom, street1.geom, 30.0)
 OR ST_DWithin(ad.geom, street1.geom, 50.0)
 AND ST_DWithin(ad.geom, street2.geom, 30.0)
 OR ST_DWithin(ad.geom, street2.geom, 50.0) 
 Order By address_id, d
) As nested_query;

When I used only ad and street1, the execution time was 142 milli-seconds. When I also include street2, the query takes much longer. The query searches for streets within 30 meters and then within 50 meters from street1 and then does the same for street2 and returns desired values from the columns. This query is running (no syntax errors) but taking a very long time (Execution time: 30 minutes and counting...).

Can someone please suggest to improve the query and consequently the execution time? Is there any alternative way to do the same job? Address table has 87 rows while both street1 and street2 have 16,060 rows. All tables have spatial indexes. SRID of all tables has been set to be same i.e., 3044. Both streets tables are not exactly overlapping but more or less the same. They have been used because they have different columns regarding street data.

Question 2

Well, the first thing that springs to mind are that the ST_Dwithin(ad.geom, stree1.geom, 30) and ST_Dwithin(ad.geom, stree1.geom, 50) are overlapping, which is pointless, as you will be searching for within 30 metres twice. If you are tying to expand the search if nothing is found within 30 metres, there are better ways of doing this, iteratively. Do you have spatial indexes on the tables? How many rows, what is the execution time?

Question 3

Shapefile is a file-based data format. If you load one into a database, it is no longer a shapefile. Please edit your question to more clearly explain how your tables are defined, and what indexes you have created. You should also seek questions about ST_Distance performance, since once you're done, you may have a duplicate question. Questions about performance should include row counts of the tables and durations with discrete values.

Question 4

@JohnBarça: Yes, all tables have spatial indexes. When I used only two tables (ad.shp and street1.shp), the execution time was 142 milli-seconds (even though with same overlapping search of ST_DWithin of 30 and then 50 meters). However, when I used both streets tables (street1.shp and street2.shp) with an address table (ad.shp), then the execution time is 30 minutes and counting.. (means it is not going to be end perhaps). Address table has 87 rows while both street1 and street2 have 16, 060 rows.

Question 5

Please edit the question in response to requests for clarification. It's not fair to the volunteers who would help to have to sift through comments for key information.

Question 6

@Vince: I have edited the question and added the details for clarification.

Question 7

This query can be improved in a number of ways.

As was pointed out in comments, ST_DWithin(ad.geom, street1.geom, 30.0) OR ST_DWithin(ad.geom, street1.geom, 50.0) is just going to return everything within 50m, because everything with 30m is also within 50m. If there is something you were trying to accomplish other than joining all streets within 50m, you might clarify why you thought this was necessary, but barring that you should just simplify to ST_DWithin(ad.geom, street1.geom, 50.0).

(That sidesteps dealing with another problem, which is that you haven't included any parentheses in your WHERE condition, so you're going to get ((a OR b) AND c) OR d when I think you actually want (a OR b) AND (c OR d))

Moreover, although it is a matter of taste, your intention would be clearer if you used the modern FROM...JOIN...ON style of joining, instead of listing all tables in FROM clause and then listed join criteria in WHERE clause. This yields:

FROM ad 
 JOIN street1 ON ST_DWithin(ad.geom, street1.geom, 50)
 JOIN street2 ON ST_DWithin(ad.geom, street2.geom, 50)

The ORDER BY in the nested query is not useful. The only reason to use ORDER BY in a nested query is if you are going to use a LIMIT to restrict the rows returned to the outer query. If you want the outer query ordered, move the ORDER BY to the outer query.

Your comments are a little ambiguous, but I think you are saying that street1 and street2 are the same geometries--essentially, the same streets--but have different attribute columns. If this is the case you could improve the query by doing an attribute join (say on a name or unique ID column) between street1 and street2, avoiding one of the spatial criteria:

FROM ad 
 JOIN street1 ON ST_DWithin(ad.geom, street1.geom, 50)
 JOIN street2 ON street1.id = street2.id

If these do in fact represent the same streets, but you do not have a unique attribute that can be used to join these tables, your database is not properly normalized, and I highly recommend that you add such an attribute.

If there are some streets that are the same, but some streets in street1 that are not in street2 or vice versa, this can still be accomplished using an OUTER JOIN. Let me know in comments and we can fix the query to do that.

I assume foo.street1 AS foo, and subsequent items in the inner query select list, should actually be street1.foo.

Possibly you intend to build a more complex query, but the nested query as it is being used is not actually doing anything, since you are just repeating all the same columns in the outer query, except for ST_Distance(ad.geom, street1.geom), which you are using in the ORDER BY. Therefore, you can eliminate the nested query structure. This won't speed up the query any, but will make clearer what you are doing.

The final version would be:

SELECT
 ad.gid AS address_id,
 ad.geom AS address_locations,
 street1.foo,
 street1.bar,
 street2.alpha,
 street2.beta
FROM ad 
 JOIN street1 ON ST_DWithin(ad.geom, street1.geom, 50)
 JOIN street2 ON street1.id = street2.id
ORDER BY ad.gid, St_Distance(ad.geom, street1.geom)
;

This is not a guarantee that your query will run faster, but we've eliminated two unnecessary spatial tests, and replaced one spatial join with a (faster) attribute join.

Question 8

I highly regret the ambiguous explanations but this detailed answer solved my problem. The approach adopted by me was definitely not the efficient one! Thank you!

score 1 · Accepted Answer · 2016-05-13 16:11:58Z

This query can be improved in a number of ways.

As was pointed out in comments, ST_DWithin(ad.geom, street1.geom, 30.0) OR ST_DWithin(ad.geom, street1.geom, 50.0) is just going to return everything within 50m, because everything with 30m is also within 50m. If there is something you were trying to accomplish other than joining all streets within 50m, you might clarify why you thought this was necessary, but barring that you should just simplify to ST_DWithin(ad.geom, street1.geom, 50.0).

(That sidesteps dealing with another problem, which is that you haven't included any parentheses in your WHERE condition, so you're going to get ((a OR b) AND c) OR d when I think you actually want (a OR b) AND (c OR d))

Moreover, although it is a matter of taste, your intention would be clearer if you used the modern FROM...JOIN...ON style of joining, instead of listing all tables in FROM clause and then listed join criteria in WHERE clause. This yields:

FROM ad 
 JOIN street1 ON ST_DWithin(ad.geom, street1.geom, 50)
 JOIN street2 ON ST_DWithin(ad.geom, street2.geom, 50)

The ORDER BY in the nested query is not useful. The only reason to use ORDER BY in a nested query is if you are going to use a LIMIT to restrict the rows returned to the outer query. If you want the outer query ordered, move the ORDER BY to the outer query.

Your comments are a little ambiguous, but I think you are saying that street1 and street2 are the same geometries--essentially, the same streets--but have different attribute columns. If this is the case you could improve the query by doing an attribute join (say on a name or unique ID column) between street1 and street2, avoiding one of the spatial criteria:

FROM ad 
 JOIN street1 ON ST_DWithin(ad.geom, street1.geom, 50)
 JOIN street2 ON street1.id = street2.id

If these do in fact represent the same streets, but you do not have a unique attribute that can be used to join these tables, your database is not properly normalized, and I highly recommend that you add such an attribute.

If there are some streets that are the same, but some streets in street1 that are not in street2 or vice versa, this can still be accomplished using an OUTER JOIN. Let me know in comments and we can fix the query to do that.

I assume foo.street1 AS foo, and subsequent items in the inner query select list, should actually be street1.foo.

Possibly you intend to build a more complex query, but the nested query as it is being used is not actually doing anything, since you are just repeating all the same columns in the outer query, except for ST_Distance(ad.geom, street1.geom), which you are using in the ORDER BY. Therefore, you can eliminate the nested query structure. This won't speed up the query any, but will make clearer what you are doing.

The final version would be:

SELECT
 ad.gid AS address_id,
 ad.geom AS address_locations,
 street1.foo,
 street1.bar,
 street2.alpha,
 street2.beta
FROM ad 
 JOIN street1 ON ST_DWithin(ad.geom, street1.geom, 50)
 JOIN street2 ON street1.id = street2.id
ORDER BY ad.gid, St_Distance(ad.geom, street1.geom)
;

This is not a guarantee that your query will run faster, but we've eliminated two unnecessary spatial tests, and replaced one spatial join with a (faster) attribute join.

I highly regret the ambiguous explanations but this detailed answer solved my problem. The approach adopted by me was definitely not the efficient one! Thank you!

Stack Exchange Network

Making selections from two tables based on the reference table in PostgreSQL/Postgis

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Making selections from two tables based on the reference table in PostgreSQL/Postgis

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions