I'm about to create a Postgres table with 1.5 billion rows. The table will just have a single TEXT column.
The table is effectively a blacklist. When a user of my software saves certain data, this table is looked-up to make sure the value they're saving doesn't exist in it.
What can I do to optimise Postgres or that table to make that "read" as fast as possible? The table will only ever be written to approximately once a year.
2 Answers 2
Add a second column which will hold a hash of the text value. Create the index on the hash. Even if there is a hash collision there will be only a few rows to read and perform a full comparison on the text vlaues.
-
5You can leave the extra column and create a functional index: CREATE INDEX foo ON t1((md5(c1)); Use WHERE md5(c1) = md5('content') in your select and you'll be fine. Check EXPLAINFrank Heikens– Frank Heikens2015年01月13日 12:12:19 +00:00Commented Jan 13, 2015 at 12:12
-
hashing rung a bell, but I had to go check how it made things faster. This explanation helped. Is my understanding right... that I'd want to hash with MD5 so there are collisions, which is indexed, so the number of places to "look" for the data is actually far smaller than just indexing the column with the raw data in it. Right?Turgs– Turgs2015年01月13日 13:00:46 +00:00Commented Jan 13, 2015 at 13:00
-
Yes - by hashing you can efficiently find a small number of text values to compare fully.Michael Green– Michael Green2015年01月13日 20:38:16 +00:00Commented Jan 13, 2015 at 20:38
-
1Using md5 will result in an index that is, in practice, unique. You cannot constrain it to be unique because you might find a hash collision. It's more likely than you think due to the birthday paradox - still quite unlikely, but possible. In practice though, the md5 will uniquely identify a value, so an index lookup for it will find just the page containing that value.Craig Ringer– Craig Ringer2015年01月17日 04:41:13 +00:00Commented Jan 17, 2015 at 4:41
For what it's worth, to get the read-speed I needed I ended-up taking a different approach (although date I say it on a database Q&A site!).
Instead of using a database table to store the data, I created a text file with one line per value, sorted alphabetically.
Whenever I need to query to see whether a given value exists, I use a binary search approach. I haven't done any performance metrics other than observation. It was clear this way was faster for what I needed.
-
2Which is essentially the same thing as creating the index on the MD5 sum suggested by Frank. A lookup on an index is a binary search as welluser1822– user18222015年07月31日 10:54:16 +00:00Commented Jul 31, 2015 at 10:54
-
@a_horse_with_no_name oh ok fair enough. I didn't realise that! Is there a link or somewhere that explains this for indexes in postgres? I've primarily only worked with other databases that take a different approach and always wondered how postgres did it but couldn't really find anything.Turgs– Turgs2015年07月31日 11:00:21 +00:00Commented Jul 31, 2015 at 11:00
-
1Essentially all DBMS use a some kind of a tree to represent an index ("B-Tree" - balanced trees). I'm not aware of any DBMS that does this differently (for standard indexes). Postgres' indexes are explained in the manual: postgresql.org/docs/current/static/indexes.htmluser1822– user18222015年07月31日 11:09:24 +00:00Commented Jul 31, 2015 at 11:09
Explore related questions
See similar questions with these tags.