I am keeping a database of books I've read, using the following two tables in PostgreSQL:
CREATE TABLE authors (
id SERIAL PRIMARY KEY,
name text
);
CREATE TABLE books (
id SERIAL PRIMARY KEY,
title text,
author_id integer REFERENCES authors(id) ON UPDATE CASCADE ON DELETE CASCADE,
UNIQUE(title, author_id)
);
Now when going through my list of authors, I found the following two entries:
id | name
----------
1 | Mark Twain
2 | Samuel Clemens
What I'd like to do is delete the "Mark Twain" entry, and effectively update all books referencing "Mark Twain" to reference "Samuel Clemens". I know I could do this manually, but I want a solution that works, regardless of which tables are referencing the authors(id)
I thought about doing it like this (within a transaction):
- Change Mark Twain
id
to 2, lettingUPDATE CASCADE
take care of changing the references. - Delete Mark Twain entry
But this runs into a few problems, mainly:
- The first step creates a duplicate primary key
- I'm not sure how to reference the right row to delete, once they both have the same ID!
- The
DELETE CASCADE
worries me for the second step
There's also a subtler problem, that can be illustrated with a portion of my (poorly curated) books
table:
id | title | author_id
------------------------------------
1 | "Huckleberry Finn" | 1
2 | "Huckleberry Finn" | 2
Here, even if my two-step process succeeded, I would be violating the UNIQUE
contstraint on books
.
Is there a way to do this, and work around most/all of these issues? Using Postgres 9.4.
2 Answers 2
Assuming you just want to delete duplicates in books
after merging duplicate authors.
BEGIN;
LOCK books, authors;
CREATE TEMP TABLE dupes ON COMMIT DROP AS (SELECT 2 AS dupe, 1 AS org);
DELETE FROM books b -- delete duplicate books
USING dupes d
WHERE b.author_id = d.dupe
AND EXISTS (
SELECT 1
FROM books
WHERE title = b.title
AND author_id = d.org
);
UPDATE books b -- now we relink all remaining books
SET author_id = d.org
FROM dupes d
WHERE b.author_id = d.dupe;
DELETE FROM authors a -- now we can delete all dupes
USING dupes d
WHERE a.id = d.dupe;
COMMIT;
The temp table could hold many rows to remove many dupes at once.
Repeat the first two steps for every tables referencing authors.id
. If there are many I would create and execute the statements dynamically ...
I lock the tables explicitly to avoid concurrent disturbances.
Automation
A basic function could look like this:
CREATE OR REPLACE FUNCTION f_remove_dupe(_tbl text, _col text, _dupe int, _org int)
RETURNS void
LANGUAGE plpgsql AS
$func$
DECLARE
_ftbl text;
_fcol text;
BEGIN
FOR _ftbl, _fcol IN
-- table and column name behind all referencing FKs
SELECT c.conrelid::regclass::text, f.attname
FROM pg_attribute a
JOIN pg_constraint c ON a.attrelid = c.confrelid AND a.attnum = c.confkey[1]
JOIN pg_attribute f ON f.attrelid = c.conrelid AND f.attnum = c.conkey[1]
WHERE a.attrelid = _tbl::regclass
AND a.attname = _col
AND c.contype = 'f'
LOOP
EXIT WHEN _ftbl IS NULL; -- skip if not found
EXECUTE format('
UPDATE %1$s
SET %2$I = 2ドル
WHERE %2$I = 1ドル'
, _ftbl, _fcol)
USING _dupe, _org;
END LOOP;
EXECUTE format('
DELETE FROM %I WHERE %I = 1ドル'
, _tbl, _col)
USING _dupe;
END
$func$;
Call:
SELECT f_remove_dupe('authors', 'id', 2, 1);
This simple version ...
- ... only works for a single dupe.
- ... ignores
UNIQUE
constraints in referencing tables. - ... assumes all FK constraints only use the one column, ignoring multi-column FKs
- ... ignores possible interference from concurrent transactions.
Adapt to your requirements.
Related:
-
Thanks for the answer. It does suffer from what I was trying to avoid, however: explicitly listing out tables that reference
authors.id
. I guess the only way to handle that dynamically is to look directly at the references Postgres keeps (I assume it stores them in some meta table). Also note this fails to deal with the case wherebooks.id
itself is referenced in another table, but that requires a recursive procedure to handle. In the end, I'll probably try to write some general stored procedure, and if it works, I'll be sure to post it! Thanks again.Steve D– Steve D2015年11月28日 20:21:03 +00:00Commented Nov 28, 2015 at 20:21 -
@SteveD: Full automation may prove difficult. I added a template basic function for the job.Erwin Brandstetter– Erwin Brandstetter2015年11月29日 05:21:59 +00:00Commented Nov 29, 2015 at 5:21
-
Thanks for the update! I'm curious about your comment about full automation. Everything is perfect except having to worry about the case where a UNIQUE constraint fails in a referencing table. Do you think a recursive (depth-first search) solution could handle this? Again, thanks for the code!Steve D– Steve D2015年11月29日 05:30:45 +00:00Commented Nov 29, 2015 at 5:30
-
@SteveD: You could check for possible
UNIQUE
orPRIMARY KEY
constraints for each table inside the loop with a similar query on catalog tables and run a nested loop on results in a similar fashion - with a dynamicDELETE
statement like in my static example. This will hardly be perfect, there could beCHECK
constraints or triggers or rules - unless you know better for your DB, of course.Erwin Brandstetter– Erwin Brandstetter2015年11月29日 05:37:09 +00:00Commented Nov 29, 2015 at 5:37
I needed this, and I came up with the following method to do all the merging in one query with an audit table of what changed.
Explore related questions
See similar questions with these tags.
Aliases
orAlternateNames
table to maintain the alternate names. You can choose 'Samuel Clemens' as your preferred name, but the Aliases would allow the other names (and some writers have several aliases) to be connected to the preferred name. And, of course, deleting 'Mark Twain' would just pop up again if a new book by Mark Twain makes it into your data. Keeping the name as an Alias leaves you room to recognize the name and keep it out of the preferred name.authors
table, and he is linked to books in thebooks
table, I still need to change those references inbooks
. Also note that this is a toy example, whereas the realauthors
table has 25+ columns. Keeping an 'alternate' table for each column would get unwieldy fast!authors
. The Alias table (or a bit more complex set of tables than just Alias) would allow the system to manage names more automatically. In part because every name is potentially an alias.