PostgreSQL randomising combinations with LATERAL

Question 1

In the following example I have a table foo from which I'd like to pick out at random a row per group.

CREATE TABLE foo (
 line INT
);
INSERT INTO foo (line)
SELECT generate_series(0, 999, 1);

Let's say that I'd like to group by line % 10. I could do this with:

SELECT DISTINCT ON (bin) bin, line
FROM (
 SELECT line, line % 10 AS bin, random() x
 FROM foo
 ORDER BY x
) X

What I'd like to do is get random picks from each bin several times. I had thought I'd be able to do this with generate_series() and LATERAL

SELECT i, line, bin
FROM
(
 SELECT generate_series(1,3) i
) m,
LATERAL
(SELECT DISTINCT ON (bin) bin, line
FROM (
 SELECT line, line % 10 bin, random() x
 FROM foo
 ORDER BY x
) X
ORDER BY bin) Q
ORDER BY bin, i;

However, when I do this in PostgreSQL 9.5 I find I get the same line for a given bin for each iteration i, e.g.,

i;line;bin
1;530;0
2;530;0
3;530;0
1;611;1
2;611;1
3;611;1
...

I'm confused, as I thought the subquery containing the random() would be run differently for each line from the generate_series().

EDIT: I realised that I can achieve the same objective by generating more combinations and choosing from these with

SELECT DISTINCT ON (bin, round) round, bin, line
FROM (
 SELECT line, line % 10 as bin, round
 FROM foo, generate_series(1,3) round
 ORDER BY bin, random()
) X;

So my question is simply why didn't the first way work?

EDIT: The problem appears to be that LATERAL only acts like a for-loop if the subqueries are correlated in some way (thanks to @ypercube's comment). Hence my original approach can be fixed by adding the following small change

SELECT i, line, bin
FROM
(
 SELECT generate_series(1,3) i
) m,
LATERAL
(
SELECT DISTINCT ON (bin) bin, line
FROM (
 SELECT line, line % 10 bin, m.i, random() x -- <NOTE m.i HERE
 FROM foo
 ORDER BY x
) X
ORDER BY bin
LIMIT 3
) Q
ORDER BY bin, i;

Question 2

I've deleted my answer because, even the result is the same, it is not correct in this case.

Question 3

Comment here as the other answer was deleted: LATERAL and CROSS JOIN are equivalent if there is no correlation (inside the LATERAL subquery) ...

Question 4

... as , LATERAL is equivalent to CROSS JOIN LATERAL (if the precedence differences between , and CROSS JOIN are irrelevant).

Question 5

do you want this to be statistically true or actually true?

Question 6

@EvanCarroll I'm generating example data for an assignment for my students, so the sampling doesn't need to be very good. But I very much like your TABLESAMPLE approaches.

Question 7

I'd write the query like this, using LIMIT (3) instead of DISTINCT ON.

The generate_series(0, 9) is used to get all the distinct bins. You could use (SELECT DISTINCT line % 10 FROM foo) AS g (bin) instead, if the "bins" are not all the integers from 0 up to 9:

SELECT 
 g.bin, 
 ROW_NUMBER() OVER (PARTITION BY g.bin ORDER BY d.x) AS i,
 d.* 
FROM 
 generate_series(0, 9) AS g (bin), 
 LATERAL 
 ( SELECT f.*, random() x 
 FROM foo AS f 
 WHERE f.line % 10 = g.bin 
 ORDER BY x 
 LIMIT 3
 ) AS d
ORDER BY 
 bin, x ;

Also, if you don't need the random() number in the output, you could use ORDER BY random() in the subquery and remove x from the select and order by clauses - or replace ORDER BY d.x with ORDER BY d.line.

Question 8

Good idea about LIMIT, but I'm struggling to compare the results to the output from my original query (mostly because I've been up way too long...). Any chance you can edit it to give columns bin, i, line? Would I need something like row_number() for i?

Question 9

Yes, ROW_NUMBER() could be use. Editing the answer

Question 10

Your comment to my answer was the thing I really needed. I'll add a note in my question body. Thanks.

Question 11

What I'd like to do is get random picks from each bin several times.

There are lots of ways you can solve this problem. Each one introduces more randomness and takes more time.

TABLESAMPLE SYSTEM and tsm_system_rows
TABLESAMPLE BERNOULLI
Creating ad-hoc bins and rolling the dice with statistics.
Creating ad-hoc bins and ordering them randomly and picking.

In most circumstances, TABLEAMPLE SYSTEM and tsm_system_rows is plenty to get a "fair" sampling of the table. It has the added advantage of not having to visit the whole table.

In the event you need a more evenly spaced out sample, TABLESAMPLE BERNOULLI will visit the whole table and select from all of the pages inside.

In the event you want to proceed going ad-hoc, I think this will do you want too.

SELECT *
FROM (
 SELECT dense_rank() OVER (PARTITION BY bin ORDER BY random()), *
 FROM (
 SELECT line % 10 AS bin, line
 FROM foo 
 ) AS t
) AS t 
WHERE dense_rank <= 3
ORDER BY line;

ypercubeTM ypercubeTM 99.7k13 gold badges217 silver badges306 bronze badges · Accepted Answer · 2017-05-15 09:45:48Z

I'd write the query like this, using LIMIT (3) instead of DISTINCT ON.

The generate_series(0, 9) is used to get all the distinct bins. You could use (SELECT DISTINCT line % 10 FROM foo) AS g (bin) instead, if the "bins" are not all the integers from 0 up to 9:

SELECT 
 g.bin, 
 ROW_NUMBER() OVER (PARTITION BY g.bin ORDER BY d.x) AS i,
 d.* 
FROM 
 generate_series(0, 9) AS g (bin), 
 LATERAL 
 ( SELECT f.*, random() x 
 FROM foo AS f 
 WHERE f.line % 10 = g.bin 
 ORDER BY x 
 LIMIT 3
 ) AS d
ORDER BY 
 bin, x ;

Also, if you don't need the random() number in the output, you could use ORDER BY random() in the subquery and remove x from the select and order by clauses - or replace ORDER BY d.x with ORDER BY d.line.

Good idea about LIMIT, but I'm struggling to compare the results to the output from my original query (mostly because I've been up way too long...). Any chance you can edit it to give columns bin, i, line? Would I need something like row_number() for i?
Your comment to my answer was the thing I really needed. I'll add a note in my question body. Thanks.

Stack Exchange Network

PostgreSQL randomising combinations with LATERAL

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

PostgreSQL randomising combinations with LATERAL

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions