Having md5(NULL) return a non-NULL value in postgresql

Question 1

I have a table with the structure (content_md5 UUID, content TEXT), where I'd like to use content_md5 (which is the value of md5(content)) as the primary key, and use it as a foreign key in other tables.

This is a "static" table, where the content (some largish documents) would be referred to by their md5 value for simplicity, and to prevent duplication in the table (which wouldn't be a given with a simple SERIAL PKEY).

However, content can be NULL, which is different from an empty value to declare a non-existing content field in the referencing table.
Since md5(NULL) returns NULL, and NULL is not allowed in a primary key constraint, I'd like to have a way of having md5(NULL) return all zeros instead of NULL.

Example:

-- setup
CREATE TABLE example (content_md5 UUID PRIMARY KEY, content TEXT);
CREATE TABLE test (id SERIAL PRIMARY KEY, tags TEXT, content_md5 UUID REFERENCES example(content_md5) ON DELETE RESTRICT);
INSERT INTO example VALUES ('00000000000000000000000000000000'::uuid, NULL);
-- usage
INSERT INTO example VALUES (md5('some text')::uuid, 'some text');
INSERT INTO test (tags, content_md5) VALUES ('some content defining tags', md5('some text')::uuid);
SELECT tags, content FROM test LEFT JOIN example USING (content_md5);
-- QUESTION: Having an md5-like function to return zero-filled "md5"/uuid?
INSERT INTO example VALUES (md5(NULL)::uuid, NULL); -- ignored, because already existing record
INSERT INTO test (tags, content_md5) VALUES ('non-existing-document', md5(NULL)::uuid);

Is it possible to somehow cast the returned value to a zero-filled string, create a custom function based on md5() which replaces NULL with 00000000000000000000000000000000, or some other way to achieve this result?

/edit: Or perhaps I don't need any NULL values in this table, and can just set the referencing foreign key column to NULL to achieve the same result?

Question 2

But won't 00000000000000000000000000000000 become a duplicate after the second NULL value - and therefore won't be able to serve as a PRIMARY KEY?

Question 3

@Vérace, as the PKEY, my intent is for it to always be the only zero-filled row to reference a content value of NULL. (The PKEY column is UNIQUE).

Question 4

Why would you have a user passing in a null value to this function and expect an md5 hash to be returned? If you have an undefined value shouldn't the resulting hash also be undefined? Also, it should be noted that the returned value is a text string not a uuid

Question 5

Yes to the last sentence. In a normal db design, the "no content" would be represented by having no corresponding row in the contents table, and the referencing column being null. OTOH having a single dummy row to represent lack of contents is unusual and would lead to cumbersome queries too.

Question 6

@DanielVérité, thanks! That's probably the better solution to my case, and eliminates the need for any "NULL" rows in the contents table, custom functions and the like.

Question 7

I suggest this alternative design:

-- setup
CREATE TABLE example (content_id serial PRIMARY KEY, content text);
CREATE TABLE test (id serial PRIMARY KEY, tags TEXT, content_id int REFERENCES example);
CREATE UNIQUE INDEX ON example ((md5(content)::uuid)) INCLUDE (content_id); -- !
-- usage
INSERT INTO example(content) VALUES (NULL); -- allowed multiple times
INSERT INTO example(content) VALUES ('some text');
INSERT INTO test (tags, content_id)
SELECT 'some content defining tags', content_id
FROM example
WHERE md5(content)::uuid = md5('some text')::uuid;

db<>fiddle here

Major points

Use a serial column (content_id) as surrogate PK of table example - and as FK reference everywhere. 4 bytes instead of 16.

Enforce uniqueness with a unique index on the expression md5(example)::uuid. Be aware that hash collisions are possible (even if very unlikely while your table isn't huge).

While being at it, add the serial PK column to the index with an INCLUDE clause (Postgres 11 or later) to make it a covering index for fast index-only lookup.

As opposed to a PK column, this allows NULL, and NULL is not considered to be a duplicate of NULL, which should cover your use case. See:

Allow null in unique column

In Postgres 10 or older don't add content_id to the index. Then you don't get index-only scans, of course:

CREATE UNIQUE INDEX ON example ((md5(content)::uuid));

Unless you want to allow only a single instance of NULL, which could be enforced with a function like you posted (introducing the risk of a collision - even if unlikely) or a tiny partial index in addition to the one above:

CREATE UNIQUE INDEX ON example (content_id)
WHERE md5(content)::uuid IS NULL;

See:

Create unique constraint with null columns

Do not store the md5 value as table column (redundantly) at all.

If you want to keep using the function you posted in your answer, consider optimizing it:

CREATE OR REPLACE FUNCTION pg_temp.md5zero(data text)
 RETURNS uuid PARALLEL SAFE IMMUTABLE LANGUAGE sql AS
$func$
SELECT COALESCE(md5(data)::uuid, '00000000000000000000000000000000')
$func$

Faster, and can be inlined. See:

Query slow when using function in the WHERE clause

Question 8

I can't use the INCLUDE clause until I manage to upgrade the cluster (still 9.6), but this looks great. Also many thanks for optimizing my function, I learned a lot from this.

Question 9

@nyov: I addressed that above. And I improved my index suggestion. Better keep a complete index in addition to the tiny index enforcing unique null.

Question 10

Your request sort of breaks the concept of primary keys--you want your primary key to be dependent upon another column (why not make that other column--content in your case--the primary key?), and yet at the same time you want that derivative column to be unique. It's possible to have this setup, but the design lends itself to confusion (i.e., future DBAs/developers will need to try to decipher what your design decisions were).

Also, md5() doesn't return a UUID type (though I suppose you intend to cast into UUID).

That said, I think you can use COALESCE() along with a sequence:

edb=# create sequence abc_seq;
CREATE SEQUENCE
edb=# create table abc (content_md5 text primary key, content text);
CREATE TABLE
edb=# insert into abc values (md5(coalesce('mycontent',nextval('abc_seq')::text)),'mycontent');
INSERT 0 1
edb=# insert into abc values (md5(coalesce(null,nextval('abc_seq')::text)),null);
INSERT 0 1
edb=# select * from abc;
 content_md5 | content 
----------------------------------+-----------
 c8afdb36c52cf4727836669019e69222 | mycontent
 c4ca4238a0b923820dcc509a6f75849b | 
(2 rows)

Please also be aware that you can't set a DEFAULT on content_md5 because of the following:

edb=# create table abc (content_md5 text primary key default md5(coalesce(content,nextval('abc_seq')::text)), content text);
ERROR: cannot use column references in default expression

Question 11

md5() doesn't return a UUID, but it can be cast to it, since it's a 16 byte data type: See dba.stackexchange.com/a/115316/187993

Question 12

sure--my suggestion to use COALESCE() and a sequence still stands

Question 13

The content_md5 column should always be the md5(content) value. Your example would create rows where the md5-value is based on the sequence number, which could clash with a content-value of the same value which is non-NULL. I must have explained this wrong, I'll update the Q.

Question 14

If that's the case, it's not possible md5(NULL) is always going to be the same value, and you can't have a unique constraint/primary key and duplicate values -- round square doesn't exist. Ideally, you should have a not-null constraint on your content column

Question 15

I've found a way to create a custom function which does what I want.
I'm not sure this is the best way to solve it, but it works for me, so here goes:

CREATE OR REPLACE FUNCTION md5zero(data text) RETURNS text AS $$
BEGIN
 IF data IS NULL
 THEN
 RETURN '00000000000000000000000000000000';
 ELSE
 RETURN md5(data);
 END IF;
END;
$$ LANGUAGE plpgsql;
-- TEST: 
INSERT INTO example VALUES (md5zero(NULL)::uuid, NULL); -- ignored, because already existing record
-- ERROR: duplicate key value violates unique constraint "example_pkey"
-- DETAIL: Key (content_md5)=(00000000-0000-0000-0000-000000000000) already exists.
-- Time: 0.477 ms
INSERT INTO test (tags, content_md5) VALUES ('non-existing-document', md5zero(NULL)::uuid);
-- id | tags | content_md5 
------+-----------------------+--------------------------------------
-- 4 | non-existing-document | 00000000-0000-0000-0000-000000000000

score 2 · Accepted Answer · 2019-11-13 03:32:47Z

I suggest this alternative design:

-- setup
CREATE TABLE example (content_id serial PRIMARY KEY, content text);
CREATE TABLE test (id serial PRIMARY KEY, tags TEXT, content_id int REFERENCES example);
CREATE UNIQUE INDEX ON example ((md5(content)::uuid)) INCLUDE (content_id); -- !
-- usage
INSERT INTO example(content) VALUES (NULL); -- allowed multiple times
INSERT INTO example(content) VALUES ('some text');
INSERT INTO test (tags, content_id)
SELECT 'some content defining tags', content_id
FROM example
WHERE md5(content)::uuid = md5('some text')::uuid;

db<>fiddle here

Major points

Use a serial column (content_id) as surrogate PK of table example - and as FK reference everywhere. 4 bytes instead of 16.

Enforce uniqueness with a unique index on the expression md5(example)::uuid. Be aware that hash collisions are possible (even if very unlikely while your table isn't huge).

While being at it, add the serial PK column to the index with an INCLUDE clause (Postgres 11 or later) to make it a covering index for fast index-only lookup.

As opposed to a PK column, this allows NULL, and NULL is not considered to be a duplicate of NULL, which should cover your use case. See:

Allow null in unique column

In Postgres 10 or older don't add content_id to the index. Then you don't get index-only scans, of course:

CREATE UNIQUE INDEX ON example ((md5(content)::uuid));

Unless you want to allow only a single instance of NULL, which could be enforced with a function like you posted (introducing the risk of a collision - even if unlikely) or a tiny partial index in addition to the one above:

CREATE UNIQUE INDEX ON example (content_id)
WHERE md5(content)::uuid IS NULL;

See:

Create unique constraint with null columns

Do not store the md5 value as table column (redundantly) at all.

If you want to keep using the function you posted in your answer, consider optimizing it:

CREATE OR REPLACE FUNCTION pg_temp.md5zero(data text)
 RETURNS uuid PARALLEL SAFE IMMUTABLE LANGUAGE sql AS
$func$
SELECT COALESCE(md5(data)::uuid, '00000000000000000000000000000000')
$func$

Faster, and can be inlined. See:

Query slow when using function in the WHERE clause

I can't use the INCLUDE clause until I manage to upgrade the cluster (still 9.6), but this looks great. Also many thanks for optimizing my function, I learned a lot from this.
@nyov: I addressed that above. And I improved my index suggestion. Better keep a complete index in addition to the tiny index enforcing unique null.

Stack Exchange Network

Having md5(NULL) return a non-NULL value in postgresql

3 Answers 3

Major points

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Having md5(NULL) return a non-NULL value in postgresql

3 Answers 3

Major points

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions