I have a table with the structure (content_md5 UUID, content TEXT)
, where I'd like to use content_md5
(which is the value of md5(content)
) as the primary key, and use it as a foreign key in other tables.
This is a "static" table, where the content (some largish documents) would be referred to by their md5
value for simplicity, and to prevent duplication in the table (which wouldn't be a given with a simple SERIAL
PKEY).
However, content
can be NULL
, which is different from an empty value to declare a non-existing content field in the referencing table.
Since md5(NULL)
returns NULL
, and NULL
is not allowed in a primary key constraint, I'd like to have a way of having md5(NULL)
return all zeros instead of NULL
.
Example:
-- setup
CREATE TABLE example (content_md5 UUID PRIMARY KEY, content TEXT);
CREATE TABLE test (id SERIAL PRIMARY KEY, tags TEXT, content_md5 UUID REFERENCES example(content_md5) ON DELETE RESTRICT);
INSERT INTO example VALUES ('00000000000000000000000000000000'::uuid, NULL);
-- usage
INSERT INTO example VALUES (md5('some text')::uuid, 'some text');
INSERT INTO test (tags, content_md5) VALUES ('some content defining tags', md5('some text')::uuid);
SELECT tags, content FROM test LEFT JOIN example USING (content_md5);
-- QUESTION: Having an md5-like function to return zero-filled "md5"/uuid?
INSERT INTO example VALUES (md5(NULL)::uuid, NULL); -- ignored, because already existing record
INSERT INTO test (tags, content_md5) VALUES ('non-existing-document', md5(NULL)::uuid);
Is it possible to somehow cast the returned value to a zero-filled string, create a custom function based on md5()
which replaces NULL
with 00000000000000000000000000000000
, or some other way to achieve this result?
/edit: Or perhaps I don't need any NULL
values in this table, and can just set the referencing foreign key column to NULL
to achieve the same result?
3 Answers 3
I suggest this alternative design:
-- setup
CREATE TABLE example (content_id serial PRIMARY KEY, content text);
CREATE TABLE test (id serial PRIMARY KEY, tags TEXT, content_id int REFERENCES example);
CREATE UNIQUE INDEX ON example ((md5(content)::uuid)) INCLUDE (content_id); -- !
-- usage
INSERT INTO example(content) VALUES (NULL); -- allowed multiple times
INSERT INTO example(content) VALUES ('some text');
INSERT INTO test (tags, content_id)
SELECT 'some content defining tags', content_id
FROM example
WHERE md5(content)::uuid = md5('some text')::uuid;
db<>fiddle here
Major points
Use a serial column (content_id
) as surrogate PK of table example
- and as FK reference everywhere. 4 bytes instead of 16.
Enforce uniqueness with a unique index on the expression md5(example)::uuid
. Be aware that hash collisions are possible (even if very unlikely while your table isn't huge).
While being at it, add the serial
PK column to the index with an INCLUDE
clause (Postgres 11 or later) to make it a covering index for fast index-only lookup.
As opposed to a PK column, this allows NULL
, and NULL
is not considered to be a duplicate of NULL
, which should cover your use case. See:
In Postgres 10 or older don't add content_id
to the index. Then you don't get index-only scans, of course:
CREATE UNIQUE INDEX ON example ((md5(content)::uuid));
Unless you want to allow only a single instance of NULL, which could be enforced with a function like you posted (introducing the risk of a collision - even if unlikely) or a tiny partial index in addition to the one above:
CREATE UNIQUE INDEX ON example (content_id)
WHERE md5(content)::uuid IS NULL;
See:
Do not store the md5 value as table column (redundantly) at all.
If you want to keep using the function you posted in your answer, consider optimizing it:
CREATE OR REPLACE FUNCTION pg_temp.md5zero(data text)
RETURNS uuid PARALLEL SAFE IMMUTABLE LANGUAGE sql AS
$func$
SELECT COALESCE(md5(data)::uuid, '00000000000000000000000000000000')
$func$
Faster, and can be inlined. See:
-
I can't use the
INCLUDE
clause until I manage to upgrade the cluster (still 9.6), but this looks great. Also many thanks for optimizing my function, I learned a lot from this.nyov– nyov2019年11月14日 22:10:23 +00:00Commented Nov 14, 2019 at 22:10 -
1@nyov: I addressed that above. And I improved my index suggestion. Better keep a complete index in addition to the tiny index enforcing unique null.Erwin Brandstetter– Erwin Brandstetter2019年11月14日 22:39:38 +00:00Commented Nov 14, 2019 at 22:39
Your request sort of breaks the concept of primary keys--you want your primary key to be dependent upon another column (why not make that other column--content
in your case--the primary key?), and yet at the same time you want that derivative column to be unique. It's possible to have this setup, but the design lends itself to confusion (i.e., future DBAs/developers will need to try to decipher what your design decisions were).
Also, md5()
doesn't return a UUID
type (though I suppose you intend to cast into UUID
).
That said, I think you can use COALESCE()
along with a sequence:
edb=# create sequence abc_seq;
CREATE SEQUENCE
edb=# create table abc (content_md5 text primary key, content text);
CREATE TABLE
edb=# insert into abc values (md5(coalesce('mycontent',nextval('abc_seq')::text)),'mycontent');
INSERT 0 1
edb=# insert into abc values (md5(coalesce(null,nextval('abc_seq')::text)),null);
INSERT 0 1
edb=# select * from abc;
content_md5 | content
----------------------------------+-----------
c8afdb36c52cf4727836669019e69222 | mycontent
c4ca4238a0b923820dcc509a6f75849b |
(2 rows)
Please also be aware that you can't set a DEFAULT
on content_md5
because of the following:
edb=# create table abc (content_md5 text primary key default md5(coalesce(content,nextval('abc_seq')::text)), content text);
ERROR: cannot use column references in default expression
-
1
md5()
doesn't return aUUID
, but it can be cast to it, since it's a 16 byte data type: See dba.stackexchange.com/a/115316/187993nyov– nyov2019年11月12日 20:11:57 +00:00Commented Nov 12, 2019 at 20:11 -
1sure--my suggestion to use
COALESCE()
and a sequence still standsrichyen– richyen2019年11月12日 20:27:13 +00:00Commented Nov 12, 2019 at 20:27 -
The
content_md5
column should always be themd5(content)
value. Your example would create rows where the md5-value is based on the sequence number, which could clash with a content-value of the same value which is non-NULL. I must have explained this wrong, I'll update the Q.nyov– nyov2019年11月12日 20:58:58 +00:00Commented Nov 12, 2019 at 20:58 -
If that's the case, it's not possible
md5(NULL)
is always going to be the same value, and you can't have a unique constraint/primary key and duplicate values -- round square doesn't exist. Ideally, you should have a not-null constraint on yourcontent
columnrichyen– richyen2019年11月12日 21:16:04 +00:00Commented Nov 12, 2019 at 21:16
I've found a way to create a custom function which does what I want.
I'm not sure this is the best way to solve it, but it works for me, so here goes:
CREATE OR REPLACE FUNCTION md5zero(data text) RETURNS text AS $$
BEGIN
IF data IS NULL
THEN
RETURN '00000000000000000000000000000000';
ELSE
RETURN md5(data);
END IF;
END;
$$ LANGUAGE plpgsql;
-- TEST:
INSERT INTO example VALUES (md5zero(NULL)::uuid, NULL); -- ignored, because already existing record
-- ERROR: duplicate key value violates unique constraint "example_pkey"
-- DETAIL: Key (content_md5)=(00000000-0000-0000-0000-000000000000) already exists.
-- Time: 0.477 ms
INSERT INTO test (tags, content_md5) VALUES ('non-existing-document', md5zero(NULL)::uuid);
-- id | tags | content_md5
------+-----------------------+--------------------------------------
-- 4 | non-existing-document | 00000000-0000-0000-0000-000000000000
00000000000000000000000000000000
become a duplicate after the secondNULL
value - and therefore won't be able to serve as a PRIMARY KEY?NULL
. (The PKEY column is UNIQUE).NULL
" rows in the contents table, custom functions and the like.