How should I index a UUID in Postgres?

Question 1

I'm new to PostgreSQL and somewhat new to databases in general. Is there an established way of how we should index UUID values in Postgres? I'm split between using hashing and using a trie, unless there's already something built-in that it uses automatically. Whatever I use is going to be handling huge amounts of data.

The SP-GiST operator family "text_ops" indexes using a trie. Because UUIDs are quite long and very dissimilar, these sound appealing even though I would only ever do full match searches.

There's also a hash option. Hashing is O(1), and I won't need to do any comparisons besides equality of course, but because UUIDs are quite long, I'm afraid that generating hashes from them would waste a lot of time.

Or is this something that depends too much on system and use specifics?

I'd rather use bigserial in most cases, but I've been told to use uuid for this. We need uuid because we might have multiple servers using different databases, so there isn't a guarantee that we'll have unique bigints. We could use a different sequence (and seed) for each server, but it's still not as flexible as UUIDs. For example, we wouldn't be able to migrate database entries from one server to another without converting the IDs and their references everywhere.

Question 2

I believe "federated database" is the buzzword for your situation. And, yes, UUIDs are the solution for that. That was the very reason UUIDs were invented decades ago: for sharing data amongst distributed systems without centralized coordination.

Question 3

Months later: Indeed, the "federated database" Basil Bourque brought up is what we're going for. Not only do we have multiple servers, but we have clients (which can be thought of as more parts of the federated DB) creating IDs while offline, too. That's why we use UUIDs.

Question 4

Years later: I'm cringing remembering what I did here. All we needed as bigserial primary keys for relations, not uuids, and the uuids were much slower. Uuids can be useful to store with a secondary index (default btree) purely as a key to communicate with the client instead of exposing the sequential one, for example as a post ID on a social app. If we ever sharded the DB, the internal IDs might change, that's fine. On the algos side, btree makes no sense here. And hashmaps aren't really constant-time unless you're assuming everything is in uniform-speed RAM, which it's not here.

Question 5

Use PostgreSQL's built-in uuid data type, and create a regular b-tree index on it.

There is no need to do anything special. This will result in an optimal index, and will also store the uuid field in as compact a form as is currently practical.

(Hash indexes in PostgreSQL prior to version 10 were not crash-safe and were really a historical relic that tended to perform no better than a b-tree anyway. Avoid them. On PostgreSQL 10 they've been made crash-safe and had some performance improvements made so you may wish to consider them.)

If for some reason you could not use the uuid type, you would generally create a b-tree on the text representation or, preferably, a bytea representation of the uuid.

Question 6

A couple of years later, in my experience, hash hasn't been much faster than b-tree, even in Postgres 10. But since hash indexes take so much less disk space than b-tree, it might be faster in a setup where big indexes become a problem, which I feel hasn't been the case for me. Well I'll keep an eye out now that I can actually use them safely in v10.

Question 7

BRIN-index? If you use time-based (version 1) UUIDs then they are generated so that their value increase. In which case BRIN is suitable.

https://www.postgresql.org/docs/9.5/brin-intro.html :

BRIN stands for Block Range Index. BRIN is designed for handling very large tables in which certain columns have some natural correlation with their physical location within the table. A block range is a group of pages that are physically adjacent in the table; for each block range, some summary info is stored by the index. For example, a table storing a store's sale orders might have a date column on which each order was placed, and most of the time the entries for earlier orders will appear earlier in the table as well; a table storing a ZIP code column might have all codes for a city grouped together naturally.

BRIN indexes can satisfy queries via regular bitmap index scans, and will return all tuples in all pages within each range if the summary info stored by the index is consistent with the query conditions. The query executor is in charge of rechecking these tuples and discarding those that do not match the query conditions — in other words, these indexes are lossy. Because a BRIN index is very small, scanning the index adds little overhead compared to a sequential scan, but may avoid scanning large parts of the table that are known not to contain matching tuples.

The specific data that a BRIN index will store, as well as the specific queries that the index will be able to satisfy, depend on the operator class selected for each column of the index. Data types having a linear sort order can have operator classes that store the minimum and maximum value within each block range, for instance; geometrical types might store the bounding box for all the objects in the block range.

The size of the block range is determined at index creation time by the pages_per_range storage parameter. The number of index entries will be equal to the size of the relation in pages divided by the selected value for pages_per_range. Therefore, the smaller the number, the larger the index becomes (because of the need to store more index entries), but at the same time the summary data stored can be more precise and more data blocks can be skipped during an index scan.

Perfect for huge and "mostly" ordered data.

See this post for some benchmarks:

https://www.percona.com/blog/2019/07/16/brin-index-for-postgresql-dont-forget-the-benefits/

They genereated a 1.3 GB table of naturally ordered data (timestamps incemented). Then they generated a BRIN index (with pages_per_range = 32) and a B-Tree index on this database. Then they compared the SELECT execution time and the size of the indices. What they got:

B-Tree:

Planning Time: 22.225 ms Execution Time: 2.657 ms

public | testtab_date_idx | index | postgres | testtab | 171 MB

BRIN:

Planning Time: 0.272 ms Execution Time: 87.703 ms

public | testtab_date_brin_idx | index | postgres | testtab | 64 kB

meanwhile with no-index it would be:

Planning Time: 0.296 ms Execution Time: 1766.454 ms

Just to give a sense of orders.

What is important to discuss furthermore is the complexity of index update after INSERT of the two. While for BRIN it is O(1), since you write sequentially on the next free space on the memory and accordingly create new BRIN entries, however for B-Tree as we well know it is O(logN) (B-Trees) (higher the tree longer it takes).

Question 8

Unfortunately, none of the RFC 4122 compliant UUID versions/variants are lexicographically sortable. Even the time-based ones - you can see this if you go to any online generator and generate a few. The first hex segment varies quickly, while the more stable hex segments come after. This is the opposite of what you want for a lexicographically sortable value. This is why SQL Server has the noncompliant NEWSEQUENTIALID. If one is willing to go non-RFC for PostGres, there's ULIDs which are made to be sortable.

Question 9

It is worth noting that as of PostGres v12, only b-trees support unique indexes. This means one cannot use a BRIN index as an PK. One could use an integer or bigint as the PK then have a ULID column that's BRIN indexed, but I would question whether the complexity is worth the benefits. B-trees work well with ULIDs since they're lexicographically sortable - i.e. new rows will only append to the tree rather than randomly inserting into it.

Question 10

You are right actually. Weird I remembered postgres uuid-ossp generates timestamp-first v1s. But hey then no. U r right it should be lexicographically sortable, ULID looks interesting.

Question 11

Hash indexes are missing in action in PostgreSQL. PostgreSQL knows it needs hash indexes, and that it's code for hash indexes is old and moldy, but they don't remove it because they are waiting for someone to come along and overhaul hash indexing. See this thread:

http://www.postgresql.org/message-id/[email protected]

Question 12

Someone has rescued hash indexes, I'm guessing because they play a critical role in data partitioning, which Pg10 has been focusing on: wiki.postgresql.org/wiki/… But they still don't give you everything I've seen theoretically useful in college database class ;)

Question 13

Use the default index type (ie B-Tree).

Although there's no much in it, hash was not faster. hash doesn't support unique either (if you need that).

Results obtained given:

A column type uuid plus about 150 bytes worth of other columns
10M rows
creating each index type then running analyze mytable
running a select * from mytable where myuuid = '<some uuid>'::uuid
middle timing value (including tool overhead, all results were within a narrowish range)

Index type	Milliseconds
No index	3600
btree (default)	90
unique btree	100
hash	100
unique hash	not supported

Question 14

The solution as of 2023 is to use a sequential UUID algorithm that is uuid compatible. There are quite a few sequential (sortable by byte sequence) UUID implementation for PostgreSQL now. My favorite one is the pg_uuidv7 extension by Florian Boulnois.

UUIDv7 is one of three new sequential UUID variants proposed as an extension to RFC 4122.

Question 15

It looks like "Expires: 28 October 2021" means it wasn't approved?

Question 16

@MirroredFate, I dunno. Might be. UUIDv7 is increasingly used in the wild, though, and I'd say that an expired standard proposal is still better than a number of the less formal variants of sortable, time-based UUIDs that I discuss in my blog post.

Question 17

You should consider ULIDs - they are monotonically increasing UUIDs that will be better for indexing than the current (as of v. 17) method of producing UUIDs.

The gen_random_uuid () function is not monotonically increasing which means that, when you are INSERTing a large number of records, many blocks in the table will rapidly become "hot".

You can see this in the following code (fiddle here):

SELECT
 x AS monotonic,
 gen_random_uuid() AS gruuid
FROM
 GENERATE_SERIES(1, 1000) AS t(x)
ORDER BY 
 gruuid
LIMIT 5;

Result (will vary from run to run!):

monotonic gruuid
657 00803ed4-3ee6-480d-8442-f1cbe5ba4a73
962 009e9b16-f3d1-4f50-b918-f6b5acfd542c
780 00d55e6c-df95-4e9c-86a7-1e330b007216
323 00f0a536-e57b-44dd-a214-64a0734bb228
617 01a4eda5-5eaf-4388-9854-e19032880a6f

So, it can be seen that the UUID-s are not monotonic.

With a monotonic index (much as with an IDENTITY column), only the new blocks will be hot reducing your write load - both for the table itself and for WAL.

There's a good discussion here as well as an accompanying extension in Rust. The same site also has links to other ULID implementations for PostgreSQL (in C, C++, PL/pgSQL x 2, Go and another 2 Rust ones). The PL/pgSQL could be of use if the installation of extensions isn't permitted - either because of company policy or non-implementation by a cloud provider.

Hot off the presses!

A recent commit (~27 days ago - discussion here) means that V7 UUID-s (also monotonic) will be natively available shortly. See here for a discussion of the difference between V7 and ULID-s. Not sure how much of it applies to the PostgreSQL implementation.

Craig Ringer Craig Ringer 57.9k6 gold badges162 silver badges193 bronze badges · Accepted Answer · 2015-05-26 00:55:43Z

Use PostgreSQL's built-in uuid data type, and create a regular b-tree index on it.

There is no need to do anything special. This will result in an optimal index, and will also store the uuid field in as compact a form as is currently practical.

(Hash indexes in PostgreSQL prior to version 10 were not crash-safe and were really a historical relic that tended to perform no better than a b-tree anyway. Avoid them. On PostgreSQL 10 they've been made crash-safe and had some performance improvements made so you may wish to consider them.)

If for some reason you could not use the uuid type, you would generally create a b-tree on the text representation or, preferably, a bytea representation of the uuid.

A couple of years later, in my experience, hash hasn't been much faster than b-tree, even in Postgres 10. But since hash indexes take so much less disk space than b-tree, it might be faster in a setup where big indexes become a problem, which I feel hasn't been the case for me. Well I'll keep an eye out now that I can actually use them safely in v10.

Stack Exchange Network

How should I index a UUID in Postgres?

6 Answers 6

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

How should I index a UUID in Postgres?

6 Answers 6

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions