Optimise huge Postgres table for fast read

Question 1

I'm about to create a Postgres table with 1.5 billion rows. The table will just have a single TEXT column.

The table is effectively a blacklist. When a user of my software saves certain data, this table is looked-up to make sure the value they're saving doesn't exist in it.

What can I do to optimise Postgres or that table to make that "read" as fast as possible? The table will only ever be written to approximately once a year.

Question 2

Add a second column which will hold a hash of the text value. Create the index on the hash. Even if there is a hash collision there will be only a few rows to read and perform a full comparison on the text vlaues.

Question 3

You can leave the extra column and create a functional index: CREATE INDEX foo ON t1((md5(c1)); Use WHERE md5(c1) = md5('content') in your select and you'll be fine. Check EXPLAIN

Question 4

hashing rung a bell, but I had to go check how it made things faster. This explanation helped. Is my understanding right... that I'd want to hash with MD5 so there are collisions, which is indexed, so the number of places to "look" for the data is actually far smaller than just indexing the column with the raw data in it. Right?

Question 5

Yes - by hashing you can efficiently find a small number of text values to compare fully.

Question 6

Using md5 will result in an index that is, in practice, unique. You cannot constrain it to be unique because you might find a hash collision. It's more likely than you think due to the birthday paradox - still quite unlikely, but possible. In practice though, the md5 will uniquely identify a value, so an index lookup for it will find just the page containing that value.

Question 7

For what it's worth, to get the read-speed I needed I ended-up taking a different approach (although date I say it on a database Q&A site!).

Instead of using a database table to store the data, I created a text file with one line per value, sorted alphabetically.

Whenever I need to query to see whether a given value exists, I use a binary search approach. I haven't done any performance metrics other than observation. It was clear this way was faster for what I needed.

Question 8

Which is essentially the same thing as creating the index on the MD5 sum suggested by Frank. A lookup on an index is a binary search as well

Question 9

@a_horse_with_no_name oh ok fair enough. I didn't realise that! Is there a link or somewhere that explains this for indexes in postgres? I've primarily only worked with other databases that take a different approach and always wondered how postgres did it but couldn't really find anything.

Question 10

Essentially all DBMS use a some kind of a tree to represent an index ("B-Tree" - balanced trees). I'm not aware of any DBMS that does this differently (for standard indexes). Postgres' indexes are explained in the manual: postgresql.org/docs/current/static/indexes.html

score 14 · Accepted Answer · 2015-01-13 09:01:59Z

14

Add a second column which will hold a hash of the text value. Create the index on the hash. Even if there is a hash collision there will be only a few rows to read and perform a full comparison on the text vlaues.

Share

Improve this answer

answered Jan 13, 2015 at 9:01

Michael Green's user avatar

Michael Green Michael Green

25.3k13 gold badges54 silver badges100 bronze badges

4

5

You can leave the extra column and create a functional index: CREATE INDEX foo ON t1((md5(c1)); Use WHERE md5(c1) = md5('content') in your select and you'll be fine. Check EXPLAIN

Frank Heikens
– Frank Heikens

2015年01月13日 12:12:19 +00:00
Commented Jan 13, 2015 at 12:12
hashing rung a bell, but I had to go check how it made things faster. This explanation helped. Is my understanding right... that I'd want to hash with MD5 so there are collisions, which is indexed, so the number of places to "look" for the data is actually far smaller than just indexing the column with the raw data in it. Right?

Turgs
– Turgs

2015年01月13日 13:00:46 +00:00
Commented Jan 13, 2015 at 13:00
Yes - by hashing you can efficiently find a small number of text values to compare fully.

Michael Green
– Michael Green

2015年01月13日 20:38:16 +00:00
Commented Jan 13, 2015 at 20:38
1

Using md5 will result in an index that is, in practice, unique. You cannot constrain it to be unique because you might find a hash collision. It's more likely than you think due to the birthday paradox - still quite unlikely, but possible. In practice though, the md5 will uniquely identify a value, so an index lookup for it will find just the page containing that value.

Craig Ringer
– Craig Ringer

2015年01月17日 04:41:13 +00:00
Commented Jan 17, 2015 at 4:41

Add a comment |

Stack Exchange Network

Optimise huge Postgres table for fast read

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Optimise huge Postgres table for fast read

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions