Selecting N points within each geometry using SQL

Question 1

I have grid and point tables in my Postgres db. The points are not equally distributed over the grid. The grid table is 100k+ geometries and the points are in the 100's of millions so efficiency matters some. I want to select N points per grid cell so that I can have an equally distributed sample of points over the grid. I'd also like to, eventually, sample N points per grid cell for distinct device_id's (to avoid having the same GPS device dominate a sample group where it's stationary).

I've had a look at this question on Stackoverflow "Taking N-samples from each group in PostgreSQL" but it seems to focus more on forcing randomness into the data (at computational expense) which I'm not that interested in.

I get that to select 50 points within a specific geom can be achieved with:

SELECT 
 point_table.id,
 point_table.device_id,
 point_table.geom,
 point_table.some_data
FROM point_table
JOIN grid_table ON ST_Within(point_table.geom, grid_table.geom)
WHERE grid_table.id = 123
LIMIT 50

but I'm not sure how to get 50 samples for all grid IDs.

Question 2

You want a correlated sub-query for this - a query that essentially does a for-each loop over the running table - and I would suggest a LATERAL join for flexibility.

Here we find (up to) 50 pseudo-random points from points_table inside the bounding box of each grids_table.geom - for regular (planar, rectangular) cells, using the && operator is most efficient; use ST_Intersects otherwise:

SELECT
 grd.id,
 pts.*
FROM
 grid_table AS grd
 LEFT JOIN LATERAL (
 SELECT
 id,
 device_id,
 geom,
 some_data
 FROM
 points_table AS _pts
 WHERE
 grd.geom && _pts.geom
 -- ORDER BY
 -- random()
 LIMIT
 50
 ) AS pts ON TRUE
;

Update:

I just read that you eventually want unique device_id samples; use a rank and fetch query:

SELECT
 grd.id,
 pts.*
FROM
 grid_table AS grd
 LEFT JOIN LATERAL (
 SELECT
 id,
 device_id,
 geom
 FROM (
 SELECT
 id,
 device_id,
 geom,
 ROW_NUMBER() OVER(PARTITION BY device_id) AS _rank
 FROM
 points_table AS __pts
 WHERE
 grd.geom && __pts.geom
 ) AS _pts
 WHERE
 _rank = 1
 -- ORDER BY
 -- random()
 LIMIT
 50
 ) AS pts ON TRUE
;

You can influence ranking with an additional ORDER BY in the OVER() clause.

Note that it is mandatory to have a spatial index on points_table.geom for this to be efficient!

Question 3

I've always wondered about lateral joins... I'll give it a bash. Thanks

Question 4

@RedM here it behaves similar to an inlined, correlated SELECT *, ( SELECT <correlated_stuff> ... ) AS <alias> FROM <running_table> ... subquery; it is executed on every row in the running table, having access to the row values. However, unlike inline subqueries, LATERAL queries are FROM level members, enabling access to (sets of) composite return types (think: a table) rather than just a single value. And it has more to offer.

geozelot geozelot 31.4k4 gold badges38 silver badges59 bronze badges · Accepted Answer · 2023-05-30 14:51:58Z

You want a correlated sub-query for this - a query that essentially does a for-each loop over the running table - and I would suggest a LATERAL join for flexibility.

Here we find (up to) 50 pseudo-random points from points_table inside the bounding box of each grids_table.geom - for regular (planar, rectangular) cells, using the && operator is most efficient; use ST_Intersects otherwise:

SELECT
 grd.id,
 pts.*
FROM
 grid_table AS grd
 LEFT JOIN LATERAL (
 SELECT
 id,
 device_id,
 geom,
 some_data
 FROM
 points_table AS _pts
 WHERE
 grd.geom && _pts.geom
 -- ORDER BY
 -- random()
 LIMIT
 50
 ) AS pts ON TRUE
;

Update:

I just read that you eventually want unique device_id samples; use a rank and fetch query:

SELECT
 grd.id,
 pts.*
FROM
 grid_table AS grd
 LEFT JOIN LATERAL (
 SELECT
 id,
 device_id,
 geom
 FROM (
 SELECT
 id,
 device_id,
 geom,
 ROW_NUMBER() OVER(PARTITION BY device_id) AS _rank
 FROM
 points_table AS __pts
 WHERE
 grd.geom && __pts.geom
 ) AS _pts
 WHERE
 _rank = 1
 -- ORDER BY
 -- random()
 LIMIT
 50
 ) AS pts ON TRUE
;

You can influence ranking with an additional ORDER BY in the OVER() clause.

Note that it is mandatory to have a spatial index on points_table.geom for this to be efficient!

I've always wondered about lateral joins... I'll give it a bash. Thanks
@RedM here it behaves similar to an inlined, correlated SELECT *, ( SELECT <correlated_stuff> ... ) AS <alias> FROM <running_table> ... subquery; it is executed on every row in the running table, having access to the row values. However, unlike inline subqueries, LATERAL queries are FROM level members, enabling access to (sets of) composite return types (think: a table) rather than just a single value. And it has more to offer.

Stack Exchange Network

Selecting N points within each geometry using SQL

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Selecting N points within each geometry using SQL

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions