5

I have grid and point tables in my Postgres db. The points are not equally distributed over the grid. The grid table is 100k+ geometries and the points are in the 100's of millions so efficiency matters some. I want to select N points per grid cell so that I can have an equally distributed sample of points over the grid. I'd also like to, eventually, sample N points per grid cell for distinct device_id's (to avoid having the same GPS device dominate a sample group where it's stationary).

I've had a look at this question on Stackoverflow "Taking N-samples from each group in PostgreSQL" but it seems to focus more on forcing randomness into the data (at computational expense) which I'm not that interested in.

I get that to select 50 points within a specific geom can be achieved with:

SELECT 
 point_table.id,
 point_table.device_id,
 point_table.geom,
 point_table.some_data
FROM point_table
JOIN grid_table ON ST_Within(point_table.geom, grid_table.geom)
WHERE grid_table.id = 123
LIMIT 50

but I'm not sure how to get 50 samples for all grid IDs.

Taras
35.7k5 gold badges77 silver badges151 bronze badges
asked May 30, 2023 at 14:08

1 Answer 1

6

You want a correlated sub-query for this - a query that essentially does a for-each loop over the running table - and I would suggest a LATERAL join for flexibility.

Here we find (up to) 50 pseudo-random points from points_table inside the bounding box of each grids_table.geom - for regular (planar, rectangular) cells, using the && operator is most efficient; use ST_Intersects otherwise:

SELECT
 grd.id,
 pts.*
FROM
 grid_table AS grd
 LEFT JOIN LATERAL (
 SELECT
 id,
 device_id,
 geom,
 some_data
 FROM
 points_table AS _pts
 WHERE
 grd.geom && _pts.geom
 -- ORDER BY
 -- random()
 LIMIT
 50
 ) AS pts ON TRUE
;

Update:

I just read that you eventually want unique device_id samples; use a rank and fetch query:

SELECT
 grd.id,
 pts.*
FROM
 grid_table AS grd
 LEFT JOIN LATERAL (
 SELECT
 id,
 device_id,
 geom
 FROM (
 SELECT
 id,
 device_id,
 geom,
 ROW_NUMBER() OVER(PARTITION BY device_id) AS _rank
 FROM
 points_table AS __pts
 WHERE
 grd.geom && __pts.geom
 ) AS _pts
 WHERE
 _rank = 1
 -- ORDER BY
 -- random()
 LIMIT
 50
 ) AS pts ON TRUE
;

You can influence ranking with an additional ORDER BY in the OVER() clause.

Note that it is mandatory to have a spatial index on points_table.geom for this to be efficient!

answered May 30, 2023 at 14:51
2
  • I've always wondered about lateral joins... I'll give it a bash. Thanks Commented May 30, 2023 at 15:44
  • @RedM here it behaves similar to an inlined, correlated SELECT *, ( SELECT <correlated_stuff> ... ) AS <alias> FROM <running_table> ... subquery; it is executed on every row in the running table, having access to the row values. However, unlike inline subqueries, LATERAL queries are FROM level members, enabling access to (sets of) composite return types (think: a table) rather than just a single value. And it has more to offer. Commented May 30, 2023 at 15:58

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.