quick random row selection in Postgres

Question 1

I have a table in postgres that contains couple of millions of rows. I have checked on the internet and I found the following

SELECT myid FROM mytable ORDER BY RANDOM() LIMIT 1;

It works, but it's really slow... is there another way to make that query, or a direct way to select a random row without reading all the table? By the way 'myid' is an integer but it can be an empty field.

Question 2

If you want to select multiple random rows, see this question: stackoverflow.com/q/8674718/247696

Question 3

You might want to experiment with OFFSET, as in

SELECT myid FROM mytable OFFSET floor(random() * N) LIMIT 1;

The N is the number of rows in mytable. You may need to first do a SELECT COUNT(*) to figure out the value of N.

Update (by Antony Hatchkins)

You must use floor here:

SELECT myid FROM mytable OFFSET floor(random() * N) LIMIT 1;

Consider a table of 2 rows; random()*N generates 0 <= x < 2 and for example SELECT myid FROM mytable OFFSET 1.7 LIMIT 1; returns 0 rows because of implicit rounding to nearest int.

Question 4

make It sense to use a N less than SELECT COUNT(*)?, I mean, not use all the values in the table but only a part of them?

Question 5

@Juan That depends on your requirements.

Question 6

using the EXPLAIN SELECT ... with different values of N give the same cost for the query, then I guess is better to go for the maximum value of N.

Question 7

see a bugfix in my answer below

Question 8

This has an off by one error. It will never return the first row and will generate an error 1/COUNT(*) because it will try to return the row after the last row.

Question 9

PostgreSQL 9.5 introduced a new approach for much faster sample selection: TABLESAMPLE

The syntax is

SELECT * FROM my_table TABLESAMPLE BERNOULLI(percentage);
SELECT * FROM my_table TABLESAMPLE SYSTEM(percentage);

This is not the optimal solution if you want only one row selected, because you need to know the COUNT of the table to calculate the exact percentage.

To avoid a slow COUNT and use fast TABLESAMPLE for tables from 1 row to billions of rows, you can do:

 SELECT * FROM my_table TABLESAMPLE SYSTEM(0.000001) LIMIT 1;
 -- if you got no result:
 SELECT * FROM my_table TABLESAMPLE SYSTEM(0.00001) LIMIT 1;
 -- if you got no result:
 SELECT * FROM my_table TABLESAMPLE SYSTEM(0.0001) LIMIT 1;
 -- if you got no result:
 SELECT * FROM my_table TABLESAMPLE SYSTEM(0.001) LIMIT 1;
 ...

This might not look so elegant, but probably is faster than any of the other answers.

To decide whether you want to use BERNULLI oder SYSTEM, read about the difference at https://www.2ndquadrant.com/en/blog/tablesample-in-postgresql-9-5-2/

Question 10

This is much much faster and easier than any other answer -- this one should be at the top.

Question 11

Why can't you just use a subquery to get the count? SELECT * FROM my_table TABLESAMPLE SYSTEM(SELECT 1/COUNT(*) FROM my_table) LIMIT 1;?

Question 12

@machineghost "To avoid a slow COUNT..." ... If your data is so small, that you can count in reasonable time, go for it! :-)

Question 13

@machineghost Use SELECT reltuples FROM pg_class WHERE relname = 'my_table' for count estimation.

Question 14

@Hynek-Pichi-Vychodil very good input! To ensure that the estimation is not outdated, it has to bee VACUUM ANALYZEd recently.. but a good database should be properly analyzed anyway.. And it all depends on the specific use-case. Usually huge tables don't grow so fast... Thanks!

Question 15

I tried this with a subquery and it worked fine. Offset, at least in Postgresql v8.4.4 works fine.

select * from mytable offset random() * (select count(*) from mytable) limit 1 ;

Question 16

In fact, v8.4 is essential for this to work, doesn't work for <=8.3.

Question 17

see a bugfix in my answer below

Question 18

You need to use floor:

SELECT myid FROM mytable OFFSET floor(random()*N) LIMIT 1;

Question 19

Consider a table of 2 rows; random()*N generates 0 <= x < 2 and for example SELECT myid FROM mytable OFFSET 1.7 LIMIT 1; returns 0 rows because of implicit rounding to nearest int.

Question 20

Unfortunately this doesn't work if you want to use a higher LIMIT... I need to get 3 items so I need to use the ORDER BY RANDOM() syntax.

Question 21

Three consecutive queries will still be faster than one order by random(), approximately as 3*O(N) < O(NlogN) - reallife figures will be slightly different due to indices.

Question 22

My problem is that the 3 items need to be distinct and a WHERE myid NOT IN (1st-myid) and WHERE myid NOT IN (1st-myid, 2nd-myid) wouldn't work since the decision is made by the OFFSET. Hmmm... I guess I could reduce N by 1 and 2 in the second and third SELECT.

Question 23

Could you or anyone expand this answer with an answer to why I need to use floor()? What advantage does it offer?

Question 24

Check this link out for some different options. http://www.depesz.com/index.php/2007/09/16/my-thoughts-on-getting-random-row/

Update: (A.Hatchkins)

The summary of the (very) long article is as follows.

The author lists four approaches:

1) ORDER BY random() LIMIT 1; -- slow

2) ORDER BY id where id>=random()*N LIMIT 1 -- nonuniform if there're gaps

3) random column -- needs to be updated every now and then

4) custom random aggregate -- cunning method, could be slow: random() needs to be generated N times

and suggests to improve method #2 by using

5) ORDER BY id where id=random()*N LIMIT 1 with subsequent requeries if the result is empty.

Question 25

I wonder why they didn't cover OFFSET? Using an ORDER is out of the question just to get a random row. Fortunately, OFFSET is well covered in the answers.

Question 26

The easiest and fastest way to fetch random row is to use the tsm_system_rows extension :

CREATE EXTENSION IF NOT EXISTS tsm_system_rows;

Then you can select the exact number of rows you want :

SELECT myid FROM mytable TABLESAMPLE SYSTEM_ROWS(1);

This is available with PostgreSQL 9.5 and later.

See: https://www.postgresql.org/docs/current/static/tsm-system-rows.html

Question 27

Fair warning, this isn't completely random. On smaller tables, I've had it always return the first rows in order.

Question 28

yes this is clearly explained in documentation (link above) : « Like the built-in SYSTEM sampling method, SYSTEM_ROWS performs block-level sampling, so that the sample is not completely random but may be subject to clustering effects, especially if only a small number of rows are requested. » . If you have a small dataset, the ORDER BY random() LIMIT 1; should be fast enough.

Question 29

I saw that. Just wanted to make it clear to anyone who doesn't click the link or if the link dies in the future.

Question 30

Also worth noting that this will only work for selecting random rows out of a table and THEN filtering, as opposed/compared to running a query and then picking one or some records at random.

Question 31

I've came up with a very fast solution without TABLESAMPLE. Much faster than OFFSET random()*N LIMIT 1. It doesn't even require table count.

The idea is to create an expression index with random but predictable data, for example md5(primary key).

Here is a test with 1M rows sample data:

create table randtest (id serial primary key, data int not null);
insert into randtest (data) select (random()*1000000)::int from generate_series(1,1000000);
create index randtest_md5_id_idx on randtest (md5(id::text));
explain analyze
select * from randtest where md5(id::text)>md5(random()::text)
order by md5(id::text) limit 1;

Result:

 Limit (cost=0.42..0.68 rows=1 width=8) (actual time=6.219..6.220 rows=1 loops=1)
 -> Index Scan using randtest_md5_id_idx on randtest (cost=0.42..84040.42 rows=333333 width=8) (actual time=6.217..6.217 rows=1 loops=1)
 Filter: (md5((id)::text) > md5((random())::text))
 Rows Removed by Filter: 1831
 Total runtime: 6.245 ms

This query can sometimes (with about 1/Number_of_rows probability) return 0 rows, so it needs to be checked and rerun. Also probabilities aren't exactly the same - some rows are more probable than others.

For comparison:

explain analyze SELECT id FROM randtest OFFSET random()*1000000 LIMIT 1;

Results vary widely, but can be pretty bad:

 Limit (cost=1442.50..1442.51 rows=1 width=4) (actual time=179.183..179.184 rows=1 loops=1)
 -> Seq Scan on randtest (cost=0.00..14425.00 rows=1000000 width=4) (actual time=0.016..134.835 rows=915702 loops=1)
 Total runtime: 179.211 ms
(3 rows)

Question 32

Fast, yes. Truly random, no. An md5 values that happens to be the next greater value after another existing value has a very slim chance to be picked, while values after a big gap in the number space have a much bigger chance (bigger by the number of possible values in between). The resulting distribution is not random.

Question 33

very interesting, could it work in a usecase of a lottery-like query: the query must look into all available tickets and randomly return only ONE single ticket. also can I use a pessimistic lock (select...for update) with your technique?

Question 34

For anything lottery related you should really use fair and cryptographically secure random sampling - for example pick a random number between 1 and max(id) until you find existing id. The method from this answer is neither fair nor secure - it's fast. Usable for things like 'get random 1% of rows to test something on', or 'show random 5 entries'.

Question 35

I added a randomly generated number to each row and generate a random number in my programming language that is added to each row. When calling, I pass a random number to the query (in this case 0.27)

SELECT * FROM
(
 (SELECT id, random FROM t where <condition> and random >= 0.27 ORDER BY random LIMIT 1)
 UNION ALL
 (SELECT id, random FROM t where <condition> and random < 0.27 ORDER BY random DESC LIMIT 1)
) as results
ORDER BY abs(0.27-random) LIMIT 1;

(Query taken from here)

If you have an index here on your the rows in your condition and the random row (containing the random numbers), I get a result in 6 ms on my 8.5 million row table. This is orders of magnitude faster than using anything like order by random().

To improve randomness, you can also generate a new random number for each result you have hit. (Without this some number will occur more often than others.)

Unlike TABLESAMPLE this also supports conditions.

NPE NPE 503k114 gold badges970 silver badges1k bronze badges · Accepted Answer · 2011-03-14 10:45:27Z

123

You might want to experiment with OFFSET, as in

SELECT myid FROM mytable OFFSET floor(random() * N) LIMIT 1;

The N is the number of rows in mytable. You may need to first do a SELECT COUNT(*) to figure out the value of N.

Update (by Antony Hatchkins)

You must use floor here:

SELECT myid FROM mytable OFFSET floor(random() * N) LIMIT 1;

Consider a table of 2 rows; random()*N generates 0 <= x < 2 and for example SELECT myid FROM mytable OFFSET 1.7 LIMIT 1; returns 0 rows because of implicit rounding to nearest int.

Share

Improve this answer

edited Nov 23, 2021 at 9:04

Andrii Abramov's user avatar

Andrii Abramov

10.8k12 gold badges81 silver badges107 bronze badges

answered Mar 14, 2011 at 10:45

NPE's user avatar

NPE NPE

503k114 gold badges970 silver badges1k bronze badges

6 Comments

Juan

Juan Over a year ago

make It sense to use a N less than SELECT COUNT(*)?, I mean, not use all the values in the table but only a part of them?

2011年03月14日T11:00:40.453Z+00:00

NPE

NPE Over a year ago

@Juan That depends on your requirements.

2011年03月14日T11:09:03.543Z+00:00

Juan

Juan Over a year ago

using the EXPLAIN SELECT ... with different values of N give the same cost for the query, then I guess is better to go for the maximum value of N.

2011年03月14日T11:32:54.353Z+00:00

Antony Hatchkins

Antony Hatchkins Over a year ago

see a bugfix in my answer below

2012年10月26日T08:53:02.717Z+00:00

Ian

Ian Over a year ago

This has an off by one error. It will never return the first row and will generate an error 1/COUNT(*) because it will try to return the row after the last row.

2014年03月21日T20:24:58.137Z+00:00

|

CollectivesTM on Stack Overflow

quick random row selection in Postgres

8 Answers 8

6 Comments

7 Comments

2 Comments

9 Comments

1 Comment

4 Comments

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

CollectivesTM on Stack Overflow

8 Answers 8

6 Comments

7 Comments

2 Comments

9 Comments

1 Comment

4 Comments

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related