15

I need to calculate the depth of a descendant from it's ancestor. When a record has object_id = parent_id = ancestor_id, it is considered a root node (the ancestor). I have been trying to get a WITH RECURSIVE query running with PostgreSQL 9.4.

I do not control the data or the columns. The data and table schema comes from an external source. The table is growing continuously. Right now by about 30k records per day. Any node in the tree can be missing and they will be pulled from an external source at some point. They are usually pulled in created_at DESC order but the data is pulled with asynchronous background jobs.

We initially had a code solution to this problem, but now having 5M+ rows, it takes almost 30 minutes to complete.

Example table definition and test data:

CREATE TABLE objects (
 id serial NOT NULL PRIMARY KEY,
 customer_id integer NOT NULL,
 object_id integer NOT NULL,
 parent_id integer,
 ancestor_id integer,
 generation integer NOT NULL DEFAULT 0
);
INSERT INTO objects(id, customer_id , object_id, parent_id, ancestor_id, generation)
VALUES (2, 1, 2, 1, 1, -1), --no parent yet
 (3, 2, 3, 3, 3, -1), --root node
 (4, 2, 4, 3, 3, -1), --depth 1
 (5, 2, 5, 4, 3, -1), --depth 2
 (6, 2, 6, 5, 3, -1), --depth 3
 (7, 1, 7, 7, 7, -1), --root node
 (8, 1, 8, 7, 7, -1), --depth 1
 (9, 1, 9, 8, 7, -1); --depth 2

Note that object_id is not unique, but the combination (customer_id, object_id) is unique.
Running a query like this:

WITH RECURSIVE descendants(id, customer_id, object_id, parent_id, ancestor_id, depth) AS (
 SELECT id, customer_id, object_id, parent_id, ancestor_id, 0
 FROM objects
 WHERE object_id = parent_id
 UNION
 SELECT o.id, o.customer_id, o.object_id, o.parent_id, o.ancestor_id, d.depth + 1
 FROM objects o
 INNER JOIN descendants d ON d.parent_id = o.object_id
 WHERE
 d.id <> o.id
 AND
 d.customer_id = o.customer_id
) SELECT * FROM descendants d;

I would like the generation column to be set as the depth that was calculated. When a new record is added, the generation column is set as -1. There are some cases where a parent_id may not have been pulled yet. If the parent_id does not exist, it should leave the generation column set to -1.

The final data should look like:

id | customer_id | object_id | parent_id | ancestor_id | generation
2 1 2 1 1 -1
3 2 3 3 3 0
4 2 4 3 3 1
5 2 5 4 3 2
6 2 6 5 3 3
7 1 7 7 7 0
8 1 8 7 7 1
9 1 9 8 7 2

The result of the query should be to update the generation column to the correct depth.

I started working from the answers to this related question on SO.

asked Jan 12, 2016 at 5:08
8
  • So you want to update the table with the result of your recursive CTE? Commented Jan 12, 2016 at 7:38
  • Yes, I would like the generation column to be UPDATE'd to what it's depth is. If there is no parent (objects.parent_id doesn't match any objects.object_id) the generation would remain as -1. Commented Jan 12, 2016 at 14:57
  • So the ancestor_id is already set, so you only need to assign the generation from the CTE.depth ? Commented Jan 12, 2016 at 16:22
  • Yes, object_id, parent_id, and ancestor_id are already set from the data we get from the API. I'd like to set the generation column to whatever the depth is. One other note, the object_id is not unique, as customer_id 1 could have object_id 1, and customer_id 2 could have object_id 1. The primary id on the table is unique. Commented Jan 12, 2016 at 19:36
  • Is this a one-time update or are you continuously adding to a growing table? Seems like the latter case. Makes a big difference. And can only root nodes be missing (yet) or any node in the tree? Commented Jan 15, 2016 at 1:26

2 Answers 2

15

The query you have is basically correct. The only mistake is in the second (recursive) part of the CTE where you have:

INNER JOIN descendants d ON d.parent_id = o.object_id

It should be the other way around:

INNER JOIN descendants d ON d.object_id = o.parent_id 

You want to join the objects with their parents (that have already been found).

So the query that calculates depth can be written (nothing else changed, only formatting):

-- calculate generation / depth, no updates
WITH RECURSIVE descendants
 (id, customer_id, object_id, parent_id, ancestor_id, depth) AS
 AS ( SELECT id, customer_id, object_id, parent_id, ancestor_id, 0
 FROM objects
 WHERE object_id = parent_id
 UNION ALL
 SELECT o.id, o.customer_id, o.object_id, o.parent_id, o.ancestor_id, d.depth + 1
 FROM objects o
 INNER JOIN descendants d ON d.customer_id = o.customer_id
 AND d.object_id = o.parent_id 
 WHERE d.id <> o.id
 ) 
SELECT * 
FROM descendants d
ORDER BY id ;

For the update, you simply replace the last SELECT, with the UPDATE, joining the result of the cte, back to the table:

-- update nodes
WITH RECURSIVE descendants
 -- nothing changes here except
 -- ancestor_id and parent_id 
 -- which can be omitted form the select lists
 ) 
UPDATE objects o 
SET generation = d.depth 
FROM descendants d
WHERE o.id = d.id 
 AND o.generation = -1 ; -- skip unnecessary updates

Tested on SQLfiddle

Additional comments:

  • the ancestor_id and the parent_id are not needed to be in the select list (ancestor is obvious, parent a bit tricky to figure out why), so you can keep them in the SELECT query if you want but you can safely remove them from the UPDATE.
  • the (customer_id, object_id) seems like a candidate for a UNIQUE constraint. If your data comply with this, add such a constraint. The joins performed in the recursive CTE would not make sense if it wasn't unique (a node could have 2 parents otherwise).
  • if you add that constraint, the (customer_id, parent_id) would be a candidate for a FOREIGN KEY constraint that REFERENCES the (unique) (customer_id, object_id). You most probably do not want to add that FK constraint though, since by your description, you are adding new rows and some rows can reference others that haven't been yet added.
  • There are certainly problems with the efficiency of the query, if it's going to be performed in a big table. Not in the first run, as almost the whole table will be updated anyway. But the second time, you'll want only new rows (and those that were not touched by the 1st run) to be considered for update. The CTE as it is will have to build a big result.
    The AND o.generation = -1 in the final update will make sure that the rows that were updated in the 1st run will not be updated again but the CTE is still an expensive part.

The following is an attempt to address these issues: improve the CTE as to consider as few rows as possible and use (customer_id, obejct_id) instead of (id) to identify rows (so id is completely removed from the query. It can be used as the 1st update or a subsequent:

WITH RECURSIVE descendants 
 (customer_id, object_id, depth) 
 AS ( SELECT customer_id, object_id, 0
 FROM objects
 WHERE object_id = parent_id
 AND generation = -1
 UNION ALL
 SELECT o.customer_id, o.object_id, p.generation + 1
 FROM objects o
 JOIN objects p ON p.customer_id = o.customer_id
 AND p.object_id = o.parent_id
 AND p.generation > -1
 WHERE o.generation = -1
 UNION ALL
 SELECT o.customer_id, o.object_id, d.depth + 1
 FROM objects o
 INNER JOIN descendants d ON o.customer_id = d.customer_id
 AND o.parent_id = d.object_id
 WHERE o.parent_id <> o.object_id
 AND o.generation = -1
 )
UPDATE objects o 
SET generation = d.depth 
FROM descendants d
WHERE o.customer_id = d.customer_id
 AND o.object_id = d.object_id
 AND o.generation = -1 -- this is not really needed

Note how the CTE has 3 parts. The first two are the stable parts. The 1st part find the root nodes that haven't been updated before and have still generation=-1 so they must be newly added nodes. The 2nd part finds children (with generation=-1) of parent nodes that have previously been updated.
The 3rd, recursive part, finds all the descendants of the first two parts, as before.

Tested on SQLfiddle-2

answered Jan 13, 2016 at 22:58
3

@ypercube already provides ample explanation, so I'll cut to the chase what I have to add.

If the parent_id does not exist, it should leave the generation column set to -1.

I assume this is supposed to apply recursively, i.e. the rest of the tree always has generation = -1 after any missing node.

If any node in the tree can be missing (yet) we need to find rows with generation = -1 that ...
... are root nodes
... or have a parent with generation > -1.
And traverse the tree from there. Child nodes of this selection must have generation = -1 as well.

Take the generation of the parent incremented by one or fall back to 0 for root nodes:

WITH RECURSIVE tree AS (
 SELECT c.customer_id, c.object_id, COALESCE(p.generation + 1, 0) AS depth
 FROM objects c
 LEFT JOIN objects p ON c.customer_id = p.customer_id
 AND c.parent_id = p.object_id
 AND p.generation > -1
 WHERE c.generation = -1
 AND (c.parent_id = c.object_id OR p.generation > -1)
 -- root node ... or parent with generation > -1
 UNION ALL
 SELECT customer_id, c.object_id, p.depth + 1
 FROM objects c
 JOIN tree p USING (customer_id)
 WHERE c.parent_id = p.object_id
 AND c.parent_id <> c.object_id -- exclude root nodes
 AND c.generation = -1 -- logically redundant, but see below!
 )
UPDATE objects o 
SET generation = t.depth
FROM tree t
WHERE o.customer_id = t.customer_id
AND o.object_id = t.object_id;

The non-recursive part is a single SELECT this way, but logically equivalent to @ypercube's two union'ed SELECT. Not sure which is faster, you'll have to test.
The much more important point for performance is:

Index!

If you repeatedly add rows to a big table this way, add a partial index:

CREATE INDEX objects_your_name_idx ON objects (customer_id, parent_id, object_id)
WHERE generation = -1;

This will achieve more for performance than all other improvements discussed so far - for repeated small additions to a big table.

I added the index condition to the recursive part of the CTE (even though logically redundant) to help the query planner understand that the partial index is applicable.

In addition you should probably also have the UNIQUE constraint on (object_id, customer_id) that @ypercube already mentioned. Or, if you cannot impose uniqueness for some reason (why?) add a plain index instead. The order of index columns matters, btw:

answered Jan 15, 2016 at 2:59
4
  • 1
    I will add the indexes and constraints suggested by you and @ypercube. Looking through the data, I don't see any reason that they couldn't happen (other than the foreign key as sometimes the parent_id is not set yet). I will also set the generation column be nullable and the default set as NULL instead of -1. Then I won't have a lot of "-1" filters and the partial indexes can be WHERE generation IS NULL, etc. Commented Jan 15, 2016 at 3:11
  • @Diggity: NULL should work just fine if you adapt the rest, yes. Commented Jan 15, 2016 at 3:16
  • @Erwin nice. I originally thought similar as you. An index ON objects (customer_id, parent_id, object_id) WHERE generation = -1; and perhaps another ON objects (customer_id, object_id) WHERE generation > -1;. The update will also have to "switch" all updated rows from one index to another, so not sure if this is a good idea for the initial run of the UPDATE. Commented Jan 15, 2016 at 7:07
  • Indexing for recursive queries can be really hard. Commented Jan 15, 2016 at 7:09

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.