I'm familiar with selecting random rows from a table in PostgreSQL. It's really slow. I've looked into some other databases and they all seem to be bad at it. Mostly because it's not a normal use case for any of them.
Is there any type of database that can select a random row quickly? The only hard requirement is that it be installable on Linux.
So far I have a table with 500 million rows and anticipate it growing to a few billion in the near future and would like to be able to select a random item. I'd also like to be able to delete the same row a short time later, so it'll need some sort of key/index for that.
Right now my psql schema has 3 columns: a text field, a date, and an integer sequence id. The date is not important and doesn't need to be in the new database.
-
1Keep the min/max sequence ids somewhere, and keep generating a random number between them until it hits a row?Mat– Mat2015年07月31日 08:11:13 +00:00Commented Jul 31, 2015 at 8:11
-
So how do you select the random row? Show us your query, table definitions and the explain plan.user1822– user18222015年07月31日 08:29:24 +00:00Commented Jul 31, 2015 at 8:29
1 Answer 1
I was using one of two ways to get a random row:
SELECT * FROM mytable ORDER BY RANDOM();
and
SELECT * FROM mytable LIMIT 1 OFFSET <random number>;
The offset method was faster for low offset numbers, otherwise super slow. An offset of 1 million took ~600 msec, while 100 million was 60 seconds (average query time for midpoint offset was 2.5 minutes). For order by random, all queries were ~5 minutes.
I ended up finding a good solution, similar to Mat's comment, but I don't need to keep track of how many rows there are.
I added an integer column called "randval" with a btree index. When a record is saved, I generate a random number between 1 and 2 billion. It took a while to migrate existing data (about half a day on slow hardware), but now random selects are super fast, typically around 1 millisecond using this query:
SELECT * FROM mytable WHERE randval >= <random number> ORDER BY randval LIMIT 1;
With 500 million rows, not every random number has a value, so the order by with limit gives us the next closest one and the index makes it quick, with a 150,000x speedup for what was an average 2.5-minute query.
Databases don't like doing random things, but they sure do well at specific things that just happen to have random values.