I have a table in postgres that contains couple of millions of rows. I have checked on the internet and I found the following
SELECT myid FROM mytable ORDER BY RANDOM() LIMIT 1;
It works, but it's really slow... is there another way to make that query, or a direct way to select a random row without reading all the table? By the way 'myid' is an integer but it can be an empty field.
-
2If you want to select multiple random rows, see this question: stackoverflow.com/q/8674718/247696Flimm– Flimm2018年05月30日 08:27:27 +00:00Commented May 30, 2018 at 8:27
8 Answers 8
You might want to experiment with OFFSET
, as in
SELECT myid FROM mytable OFFSET floor(random() * N) LIMIT 1;
The N
is the number of rows in mytable
. You may need to first do a SELECT COUNT(*)
to figure out the value of N
.
Update (by Antony Hatchkins)
You must use floor
here:
SELECT myid FROM mytable OFFSET floor(random() * N) LIMIT 1;
Consider a table of 2 rows; random()*N
generates 0 <= x < 2
and for example SELECT myid FROM mytable OFFSET 1.7 LIMIT 1;
returns 0 rows because of implicit rounding to nearest int.
6 Comments
SELECT COUNT(*)
?, I mean, not use all the values in the table but only a part of them?EXPLAIN SELECT ...
with different values of N give the same cost for the query, then I guess is better to go for the maximum value of N.PostgreSQL 9.5 introduced a new approach for much faster sample selection: TABLESAMPLE
The syntax is
SELECT * FROM my_table TABLESAMPLE BERNOULLI(percentage);
SELECT * FROM my_table TABLESAMPLE SYSTEM(percentage);
This is not the optimal solution if you want only one row selected, because you need to know the COUNT of the table to calculate the exact percentage.
To avoid a slow COUNT and use fast TABLESAMPLE for tables from 1 row to billions of rows, you can do:
SELECT * FROM my_table TABLESAMPLE SYSTEM(0.000001) LIMIT 1;
-- if you got no result:
SELECT * FROM my_table TABLESAMPLE SYSTEM(0.00001) LIMIT 1;
-- if you got no result:
SELECT * FROM my_table TABLESAMPLE SYSTEM(0.0001) LIMIT 1;
-- if you got no result:
SELECT * FROM my_table TABLESAMPLE SYSTEM(0.001) LIMIT 1;
...
This might not look so elegant, but probably is faster than any of the other answers.
To decide whether you want to use BERNULLI oder SYSTEM, read about the difference at https://www.2ndquadrant.com/en/blog/tablesample-in-postgresql-9-5-2/
7 Comments
SELECT * FROM my_table TABLESAMPLE SYSTEM(SELECT 1/COUNT(*) FROM my_table) LIMIT 1;
?SELECT reltuples FROM pg_class WHERE relname = 'my_table'
for count estimation.I tried this with a subquery and it worked fine. Offset, at least in Postgresql v8.4.4 works fine.
select * from mytable offset random() * (select count(*) from mytable) limit 1 ;
2 Comments
You need to use floor
:
SELECT myid FROM mytable OFFSET floor(random()*N) LIMIT 1;
9 Comments
random()*N
generates 0 <= x < 2 and for example SELECT myid FROM mytable OFFSET 1.7 LIMIT 1;
returns 0 rows because of implicit rounding to nearest int.order by random()
, approximately as 3*O(N) < O(NlogN)
- reallife figures will be slightly different due to indices.WHERE myid NOT IN (1st-myid)
and WHERE myid NOT IN (1st-myid, 2nd-myid)
wouldn't work since the decision is made by the OFFSET. Hmmm... I guess I could reduce N by 1 and 2 in the second and third SELECT.floor()
? What advantage does it offer?Check this link out for some different options. http://www.depesz.com/index.php/2007/09/16/my-thoughts-on-getting-random-row/
Update: (A.Hatchkins)
The summary of the (very) long article is as follows.
The author lists four approaches:
1) ORDER BY random() LIMIT 1;
-- slow
2) ORDER BY id where id>=random()*N LIMIT 1
-- nonuniform if there're gaps
3) random column -- needs to be updated every now and then
4) custom random aggregate -- cunning method, could be slow: random() needs to be generated N times
and suggests to improve method #2 by using
5) ORDER BY id where id=random()*N LIMIT 1
with subsequent requeries if the result is empty.
1 Comment
The easiest and fastest way to fetch random row is to use the tsm_system_rows
extension :
CREATE EXTENSION IF NOT EXISTS tsm_system_rows;
Then you can select the exact number of rows you want :
SELECT myid FROM mytable TABLESAMPLE SYSTEM_ROWS(1);
This is available with PostgreSQL 9.5 and later.
See: https://www.postgresql.org/docs/current/static/tsm-system-rows.html
4 Comments
ORDER BY random() LIMIT 1;
should be fast enough.I've came up with a very fast solution without TABLESAMPLE
. Much faster than OFFSET random()*N LIMIT 1
. It doesn't even require table count.
The idea is to create an expression index with random but predictable data, for example md5(primary key)
.
Here is a test with 1M rows sample data:
create table randtest (id serial primary key, data int not null);
insert into randtest (data) select (random()*1000000)::int from generate_series(1,1000000);
create index randtest_md5_id_idx on randtest (md5(id::text));
explain analyze
select * from randtest where md5(id::text)>md5(random()::text)
order by md5(id::text) limit 1;
Result:
Limit (cost=0.42..0.68 rows=1 width=8) (actual time=6.219..6.220 rows=1 loops=1)
-> Index Scan using randtest_md5_id_idx on randtest (cost=0.42..84040.42 rows=333333 width=8) (actual time=6.217..6.217 rows=1 loops=1)
Filter: (md5((id)::text) > md5((random())::text))
Rows Removed by Filter: 1831
Total runtime: 6.245 ms
This query can sometimes (with about 1/Number_of_rows probability) return 0 rows, so it needs to be checked and rerun. Also probabilities aren't exactly the same - some rows are more probable than others.
For comparison:
explain analyze SELECT id FROM randtest OFFSET random()*1000000 LIMIT 1;
Results vary widely, but can be pretty bad:
Limit (cost=1442.50..1442.51 rows=1 width=4) (actual time=179.183..179.184 rows=1 loops=1)
-> Seq Scan on randtest (cost=0.00..14425.00 rows=1000000 width=4) (actual time=0.016..134.835 rows=915702 loops=1)
Total runtime: 179.211 ms
(3 rows)
3 Comments
I added a randomly generated number to each row and generate a random number in my programming language that is added to each row. When calling, I pass a random number to the query (in this case 0.27)
SELECT * FROM
(
(SELECT id, random FROM t where <condition> and random >= 0.27 ORDER BY random LIMIT 1)
UNION ALL
(SELECT id, random FROM t where <condition> and random < 0.27 ORDER BY random DESC LIMIT 1)
) as results
ORDER BY abs(0.27-random) LIMIT 1;
(Query taken from here)
If you have an index here on your the rows in your condition and the random row (containing the random numbers), I get a result in 6 ms on my 8.5 million row table. This is orders of magnitude faster than using anything like order by random().
To improve randomness, you can also generate a new random number for each result you have hit. (Without this some number will occur more often than others.)
Unlike TABLESAMPLE this also supports conditions.