Deterministic order when using an order by random with postgresql

Question 1

I'm using a query like:

SELECT * FROM items ORDER BY RANDOM()

All is well if the number of rows is low. In my tests however, I would like to have something reproducible to verify. This is why I'm seeding the random number generator:

SELECT setseed(0.123);
SELECT * FROM items ORDER BY RANDOM();

It's nice and working well. It looks like the order is same time on each execution. Except that it's not completely reproducible. In some cases, the test succeeds and I get the expected order and result. In some execution of the same test, I don't. Why is that?

Question 2

You might have better luck with a subquery:

SELECT setseed(0.123);
SELECT *
FROM (SELECT i.*, RANDOM() as rand
 items i
 ) i
ORDER BY rand;

The reason is that the function RANDOM() is called many times during the sorting. Some sorting algorithms are non-deterministic -- and that affects downstream rows.

This isn't 100% guaranteed, because the subquery could still not be processed in order (although it should be on a single processor system). But you can further rectify this by using a hash rather than a random value. So:

order by md5(item_id || '0.123')

The item_id is assumed to be different on each row. The '0.123' is the added so you can easily change the ordering.

Question 3

The problem is linked to the fact that rows are first fetched in an unspecified order (if no ORDER BY clause is specified), and only then is the RANDOM() function called for each row. This means that the unspecified order will impact the row order after the ORDER BY RANDOM() is applied.

Example, using the same seed in both cases:

case 1

SELECT * FROM items
returns
item_1
item_2
item_3
item_4
SELECT * FROM items ORDER BY RANDOM();
may return
item_3
item_4
item_1
item_2

case 2

SELECT * FROM items
returns
item_4
item_3
item_2
item_1
SELECT * FROM items ORDER BY RANDOM();
may return
item_2
item_1
item_4
item_3

The solution is then to order the rows before ordering them by RANDOM(). The end result is 100% deterministic.

Question 4

You seem to want a repeatable random sort.

setseed() is the correct approach, however you need to set it within the query, so it applies to all further invocations of random().

Here is one solution using union all:

select item_id
from (
 select setseed(0.5), null item_id
 union all
 select null, item_id from items
 offset 1
) s
order by random()

This demonstrates how to proceed with a table that has only one column. You can extend this for more columns by adding more null columns to the first subquery (and accordingly listing the corresponding columns in the other union all member and in the outer query).

score 6 · Accepted Answer · 2020-06-30 17:43:27Z

You might have better luck with a subquery:

SELECT setseed(0.123);
SELECT *
FROM (SELECT i.*, RANDOM() as rand
 items i
 ) i
ORDER BY rand;

The reason is that the function RANDOM() is called many times during the sorting. Some sorting algorithms are non-deterministic -- and that affects downstream rows.

This isn't 100% guaranteed, because the subquery could still not be processed in order (although it should be on a single processor system). But you can further rectify this by using a hash rather than a random value. So:

order by md5(item_id || '0.123')

The item_id is assumed to be different on each row. The '0.123' is the added so you can easily change the ordering.

CollectivesTM on Stack Overflow

Deterministic order when using an order by random with postgresql

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related