I have grid and point tables in my Postgres db. The points are not equally distributed over the grid. The grid table is 100k+ geometries and the points are in the 100's of millions so efficiency matters some. I want to select N points per grid cell so that I can have an equally distributed sample of points over the grid. I'd also like to, eventually, sample N points per grid cell for distinct device_id's (to avoid having the same GPS device dominate a sample group where it's stationary).
I've had a look at this question on Stackoverflow "Taking N-samples from each group in PostgreSQL" but it seems to focus more on forcing randomness into the data (at computational expense) which I'm not that interested in.
I get that to select 50 points within a specific geom can be achieved with:
SELECT
point_table.id,
point_table.device_id,
point_table.geom,
point_table.some_data
FROM point_table
JOIN grid_table ON ST_Within(point_table.geom, grid_table.geom)
WHERE grid_table.id = 123
LIMIT 50
but I'm not sure how to get 50 samples for all grid IDs.
1 Answer 1
You want a correlated sub-query for this - a query that essentially does a for-each loop over the running table - and I would suggest a LATERAL
join for flexibility.
Here we find (up to) 50 pseudo-random points from points_table
inside the bounding box of each grids_table.geom
- for regular (planar, rectangular) cells, using the &&
operator is most efficient; use ST_Intersects
otherwise:
SELECT
grd.id,
pts.*
FROM
grid_table AS grd
LEFT JOIN LATERAL (
SELECT
id,
device_id,
geom,
some_data
FROM
points_table AS _pts
WHERE
grd.geom && _pts.geom
-- ORDER BY
-- random()
LIMIT
50
) AS pts ON TRUE
;
Update:
I just read that you eventually want unique device_id
samples; use a rank and fetch query:
SELECT
grd.id,
pts.*
FROM
grid_table AS grd
LEFT JOIN LATERAL (
SELECT
id,
device_id,
geom
FROM (
SELECT
id,
device_id,
geom,
ROW_NUMBER() OVER(PARTITION BY device_id) AS _rank
FROM
points_table AS __pts
WHERE
grd.geom && __pts.geom
) AS _pts
WHERE
_rank = 1
-- ORDER BY
-- random()
LIMIT
50
) AS pts ON TRUE
;
You can influence ranking with an additional ORDER BY
in the OVER()
clause.
Note that it is mandatory to have a spatial index on points_table.geom
for this to be efficient!
-
I've always wondered about lateral joins... I'll give it a bash. ThanksRedM– RedM2023年05月30日 15:44:56 +00:00Commented May 30, 2023 at 15:44
-
@RedM here it behaves similar to an inlined, correlated
SELECT *, ( SELECT <correlated_stuff> ... ) AS <alias> FROM <running_table> ...
subquery; it is executed on every row in the running table, having access to the row values. However, unlike inline subqueries,LATERAL
queries areFROM
level members, enabling access to (sets of) composite return types (think: a table) rather than just a single value. And it has more to offer.geozelot– geozelot2023年05月30日 15:58:56 +00:00Commented May 30, 2023 at 15:58
Explore related questions
See similar questions with these tags.