Recommended Database(s) for Selecting Random Rows

Question 1

I'm familiar with selecting random rows from a table in PostgreSQL. It's really slow. I've looked into some other databases and they all seem to be bad at it. Mostly because it's not a normal use case for any of them.

Is there any type of database that can select a random row quickly? The only hard requirement is that it be installable on Linux.

So far I have a table with 500 million rows and anticipate it growing to a few billion in the near future and would like to be able to select a random item. I'd also like to be able to delete the same row a short time later, so it'll need some sort of key/index for that.

Right now my psql schema has 3 columns: a text field, a date, and an integer sequence id. The date is not important and doesn't need to be in the new database.

Question 2

Keep the min/max sequence ids somewhere, and keep generating a random number between them until it hits a row?

Question 3

So how do you select the random row? Show us your query, table definitions and the explain plan.

Question 4

I was using one of two ways to get a random row:

SELECT * FROM mytable ORDER BY RANDOM();

and

SELECT * FROM mytable LIMIT 1 OFFSET <random number>;

The offset method was faster for low offset numbers, otherwise super slow. An offset of 1 million took ~600 msec, while 100 million was 60 seconds (average query time for midpoint offset was 2.5 minutes). For order by random, all queries were ~5 minutes.

I ended up finding a good solution, similar to Mat's comment, but I don't need to keep track of how many rows there are.

I added an integer column called "randval" with a btree index. When a record is saved, I generate a random number between 1 and 2 billion. It took a while to migrate existing data (about half a day on slow hardware), but now random selects are super fast, typically around 1 millisecond using this query:

SELECT * FROM mytable WHERE randval >= <random number> ORDER BY randval LIMIT 1;

With 500 million rows, not every random number has a value, so the order by with limit gives us the next closest one and the index makes it quick, with a 150,000x speedup for what was an average 2.5-minute query.

Databases don't like doing random things, but they sure do well at specific things that just happen to have random values.

Jason Champion Jason Champion 1818 bronze badges · Answer 1 · 2015-08-11 18:08:30Z

I was using one of two ways to get a random row:

SELECT * FROM mytable ORDER BY RANDOM();

and

SELECT * FROM mytable LIMIT 1 OFFSET <random number>;

The offset method was faster for low offset numbers, otherwise super slow. An offset of 1 million took ~600 msec, while 100 million was 60 seconds (average query time for midpoint offset was 2.5 minutes). For order by random, all queries were ~5 minutes.

I ended up finding a good solution, similar to Mat's comment, but I don't need to keep track of how many rows there are.

I added an integer column called "randval" with a btree index. When a record is saved, I generate a random number between 1 and 2 billion. It took a while to migrate existing data (about half a day on slow hardware), but now random selects are super fast, typically around 1 millisecond using this query:

SELECT * FROM mytable WHERE randval >= <random number> ORDER BY randval LIMIT 1;

With 500 million rows, not every random number has a value, so the order by with limit gives us the next closest one and the index makes it quick, with a 150,000x speedup for what was an average 2.5-minute query.

Databases don't like doing random things, but they sure do well at specific things that just happen to have random values.

Stack Exchange Network

Recommended Database(s) for Selecting Random Rows

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Recommended Database(s) for Selecting Random Rows

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions