"Merge" two rows in a Postgres table, with foreign keys

Question 1

I am keeping a database of books I've read, using the following two tables in PostgreSQL:

CREATE TABLE authors (
 id SERIAL PRIMARY KEY,
 name text
);
CREATE TABLE books (
 id SERIAL PRIMARY KEY,
 title text,
 author_id integer REFERENCES authors(id) ON UPDATE CASCADE ON DELETE CASCADE,
 UNIQUE(title, author_id)
);

Now when going through my list of authors, I found the following two entries:

id | name
----------
 1 | Mark Twain
 2 | Samuel Clemens

What I'd like to do is delete the "Mark Twain" entry, and effectively update all books referencing "Mark Twain" to reference "Samuel Clemens". I know I could do this manually, but I want a solution that works, regardless of which tables are referencing the authors(id)

I thought about doing it like this (within a transaction):

Change Mark Twain id to 2, letting UPDATE CASCADE take care of changing the references.
Delete Mark Twain entry

But this runs into a few problems, mainly:

The first step creates a duplicate primary key
I'm not sure how to reference the right row to delete, once they both have the same ID!
The DELETE CASCADE worries me for the second step

There's also a subtler problem, that can be illustrated with a portion of my (poorly curated) books table:

id | title | author_id
------------------------------------
 1 | "Huckleberry Finn" | 1
 2 | "Huckleberry Finn" | 2

Here, even if my two-step process succeeded, I would be violating the UNIQUE contstraint on books.

Is there a way to do this, and work around most/all of these issues? Using Postgres 9.4.

Question 2

Actually you probably need an Aliases or AlternateNames table to maintain the alternate names. You can choose 'Samuel Clemens' as your preferred name, but the Aliases would allow the other names (and some writers have several aliases) to be connected to the preferred name. And, of course, deleting 'Mark Twain' would just pop up again if a new book by Mark Twain makes it into your data. Keeping the name as an Alias leaves you room to recognize the name and keep it out of the preferred name.

Question 3

@RLF: I'm not really sure how your suggestion sidesteps the issue of "merging". If "Mark Twain" ends up in the authors table, and he is linked to books in the books table, I still need to change those references in books. Also note that this is a toy example, whereas the real authors table has 25+ columns. Keeping an 'alternate' table for each column would get unwieldy fast!

Question 4

You mention you wanted to delete the "Mark Twain" entry. I was suggesting that deleting a name would just lead to repeated maintenance as names not matching your primary re-entered your data. The only thing you mentioned that needs an alternate is the authors. The Alias table (or a bit more complex set of tables than just Alias) would allow the system to manage names more automatically. In part because every name is potentially an alias.

Question 5

Why not just update all of the books(author_id) to the id you want and then delete the extra row from the authors table? Also, a minor change to the schema would help if you kept a books table and a look up table of author_book.

Question 6

Just throwing this out there... if resolving merges is a significant part of your architecture, then there may be a structural solution in combining the ON DELETE SET DEFAULT syntax with a special function that defines what the DEFAULT is. This might be tricky, but you can set a session variable that switches the DEFAULT from the default DEFAULT to a temporary variable where in the case of merging, that temp variable is the pk of the surviving row. Therefore when the depreciated row is deleted, all referencing tables will use the DEFAULT (in this case, the surviving row pk) as their new value.

Question 7

Assuming you just want to delete duplicates in books after merging duplicate authors.

BEGIN;
LOCK books, authors;
CREATE TEMP TABLE dupes ON COMMIT DROP AS (SELECT 2 AS dupe, 1 AS org);
DELETE FROM books b -- delete duplicate books
USING dupes d
WHERE b.author_id = d.dupe
AND EXISTS (
 SELECT 1
 FROM books
 WHERE title = b.title
 AND author_id = d.org
 );
UPDATE books b -- now we relink all remaining books
SET author_id = d.org
FROM dupes d 
WHERE b.author_id = d.dupe;
DELETE FROM authors a -- now we can delete all dupes
USING dupes d
WHERE a.id = d.dupe;
COMMIT;

The temp table could hold many rows to remove many dupes at once.

Repeat the first two steps for every tables referencing authors.id. If there are many I would create and execute the statements dynamically ...

I lock the tables explicitly to avoid concurrent disturbances.

Automation

A basic function could look like this:

CREATE OR REPLACE FUNCTION f_remove_dupe(_tbl text, _col text, _dupe int, _org int)
 RETURNS void
 LANGUAGE plpgsql AS
$func$
DECLARE
 _ftbl text;
 _fcol text;
BEGIN
 FOR _ftbl, _fcol IN
 -- table and column name behind all referencing FKs
 SELECT c.conrelid::regclass::text, f.attname
 FROM pg_attribute a 
 JOIN pg_constraint c ON a.attrelid = c.confrelid AND a.attnum = c.confkey[1]
 JOIN pg_attribute f ON f.attrelid = c.conrelid AND f.attnum = c.conkey[1]
 WHERE a.attrelid = _tbl::regclass
 AND a.attname = _col
 AND c.contype = 'f'
 LOOP
 EXIT WHEN _ftbl IS NULL; -- skip if not found
 EXECUTE format('
 UPDATE %1$s
 SET %2$I = 2ドル
 WHERE %2$I = 1ドル'
 , _ftbl, _fcol)
 USING _dupe, _org;
 END LOOP;
 EXECUTE format('
 DELETE FROM %I WHERE %I = 1ドル'
 , _tbl, _col)
 USING _dupe;
END
$func$;

Call:

SELECT f_remove_dupe('authors', 'id', 2, 1);

This simple version ...

... only works for a single dupe.
... ignores UNIQUE constraints in referencing tables.
... assumes all FK constraints only use the one column, ignoring multi-column FKs
... ignores possible interference from concurrent transactions.

Adapt to your requirements.

Question 8

Thanks for the answer. It does suffer from what I was trying to avoid, however: explicitly listing out tables that reference authors.id. I guess the only way to handle that dynamically is to look directly at the references Postgres keeps (I assume it stores them in some meta table). Also note this fails to deal with the case where books.id itself is referenced in another table, but that requires a recursive procedure to handle. In the end, I'll probably try to write some general stored procedure, and if it works, I'll be sure to post it! Thanks again.

Question 9

@SteveD: Full automation may prove difficult. I added a template basic function for the job.

Question 10

Thanks for the update! I'm curious about your comment about full automation. Everything is perfect except having to worry about the case where a UNIQUE constraint fails in a referencing table. Do you think a recursive (depth-first search) solution could handle this? Again, thanks for the code!

Question 11

@SteveD: You could check for possible UNIQUE or PRIMARY KEY constraints for each table inside the loop with a similar query on catalog tables and run a nested loop on results in a similar fashion - with a dynamic DELETE statement like in my static example. This will hardly be perfect, there could be CHECK constraints or triggers or rules - unless you know better for your DB, of course.

Question 12

I needed this, and I came up with the following method to do all the merging in one query with an audit table of what changed.

http://sqlfiddle.com/#!17/40ac9/1

score 8 · Accepted Answer · 2015-11-28 20:13:20Z

Assuming you just want to delete duplicates in books after merging duplicate authors.

BEGIN;
LOCK books, authors;
CREATE TEMP TABLE dupes ON COMMIT DROP AS (SELECT 2 AS dupe, 1 AS org);
DELETE FROM books b -- delete duplicate books
USING dupes d
WHERE b.author_id = d.dupe
AND EXISTS (
 SELECT 1
 FROM books
 WHERE title = b.title
 AND author_id = d.org
 );
UPDATE books b -- now we relink all remaining books
SET author_id = d.org
FROM dupes d 
WHERE b.author_id = d.dupe;
DELETE FROM authors a -- now we can delete all dupes
USING dupes d
WHERE a.id = d.dupe;
COMMIT;

The temp table could hold many rows to remove many dupes at once.

Repeat the first two steps for every tables referencing authors.id. If there are many I would create and execute the statements dynamically ...

I lock the tables explicitly to avoid concurrent disturbances.

Automation

A basic function could look like this:

CREATE OR REPLACE FUNCTION f_remove_dupe(_tbl text, _col text, _dupe int, _org int)
 RETURNS void
 LANGUAGE plpgsql AS
$func$
DECLARE
 _ftbl text;
 _fcol text;
BEGIN
 FOR _ftbl, _fcol IN
 -- table and column name behind all referencing FKs
 SELECT c.conrelid::regclass::text, f.attname
 FROM pg_attribute a 
 JOIN pg_constraint c ON a.attrelid = c.confrelid AND a.attnum = c.confkey[1]
 JOIN pg_attribute f ON f.attrelid = c.conrelid AND f.attnum = c.conkey[1]
 WHERE a.attrelid = _tbl::regclass
 AND a.attname = _col
 AND c.contype = 'f'
 LOOP
 EXIT WHEN _ftbl IS NULL; -- skip if not found
 EXECUTE format('
 UPDATE %1$s
 SET %2$I = 2ドル
 WHERE %2$I = 1ドル'
 , _ftbl, _fcol)
 USING _dupe, _org;
 END LOOP;
 EXECUTE format('
 DELETE FROM %I WHERE %I = 1ドル'
 , _tbl, _col)
 USING _dupe;
END
$func$;

Call:

SELECT f_remove_dupe('authors', 'id', 2, 1);

This simple version ...

... only works for a single dupe.
... ignores UNIQUE constraints in referencing tables.
... assumes all FK constraints only use the one column, ignoring multi-column FKs
... ignores possible interference from concurrent transactions.

Adapt to your requirements.

Thanks for the answer. It does suffer from what I was trying to avoid, however: explicitly listing out tables that reference authors.id. I guess the only way to handle that dynamically is to look directly at the references Postgres keeps (I assume it stores them in some meta table). Also note this fails to deal with the case where books.id itself is referenced in another table, but that requires a recursive procedure to handle. In the end, I'll probably try to write some general stored procedure, and if it works, I'll be sure to post it! Thanks again.
@SteveD: Full automation may prove difficult. I added a template basic function for the job.
Thanks for the update! I'm curious about your comment about full automation. Everything is perfect except having to worry about the case where a UNIQUE constraint fails in a referencing table. Do you think a recursive (depth-first search) solution could handle this? Again, thanks for the code!
@SteveD: You could check for possible UNIQUE or PRIMARY KEY constraints for each table inside the loop with a similar query on catalog tables and run a nested loop on results in a similar fashion - with a dynamic DELETE statement like in my static example. This will hardly be perfect, there could be CHECK constraints or triggers or rules - unless you know better for your DB, of course.

Stack Exchange Network

"Merge" two rows in a Postgres table, with foreign keys

2 Answers 2

Automation

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

"Merge" two rows in a Postgres table, with foreign keys

2 Answers 2

Automation

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions