2

I'm familiar with selecting random rows from a table in PostgreSQL. It's really slow. I've looked into some other databases and they all seem to be bad at it. Mostly because it's not a normal use case for any of them.

Is there any type of database that can select a random row quickly? The only hard requirement is that it be installable on Linux.

So far I have a table with 500 million rows and anticipate it growing to a few billion in the near future and would like to be able to select a random item. I'd also like to be able to delete the same row a short time later, so it'll need some sort of key/index for that.

Right now my psql schema has 3 columns: a text field, a date, and an integer sequence id. The date is not important and doesn't need to be in the new database.

asked Jul 31, 2015 at 7:52
2
  • 1
    Keep the min/max sequence ids somewhere, and keep generating a random number between them until it hits a row? Commented Jul 31, 2015 at 8:11
  • So how do you select the random row? Show us your query, table definitions and the explain plan. Commented Jul 31, 2015 at 8:29

1 Answer 1

3

I was using one of two ways to get a random row:

SELECT * FROM mytable ORDER BY RANDOM();

and

SELECT * FROM mytable LIMIT 1 OFFSET <random number>;

The offset method was faster for low offset numbers, otherwise super slow. An offset of 1 million took ~600 msec, while 100 million was 60 seconds (average query time for midpoint offset was 2.5 minutes). For order by random, all queries were ~5 minutes.

I ended up finding a good solution, similar to Mat's comment, but I don't need to keep track of how many rows there are.

I added an integer column called "randval" with a btree index. When a record is saved, I generate a random number between 1 and 2 billion. It took a while to migrate existing data (about half a day on slow hardware), but now random selects are super fast, typically around 1 millisecond using this query:

SELECT * FROM mytable WHERE randval >= <random number> ORDER BY randval LIMIT 1;

With 500 million rows, not every random number has a value, so the order by with limit gives us the next closest one and the index makes it quick, with a 150,000x speedup for what was an average 2.5-minute query.

Databases don't like doing random things, but they sure do well at specific things that just happen to have random values.

answered Aug 11, 2015 at 18:08

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.