Grouping by sequential relationships with Postgres

Question 1

This is quite similar to this question about continuous ranges, and this question about grouping by sequential numbers, but differs in that the sequences are not numeric. Given the following relationships as key pairs

a -- b -- c e -- f -- g
 | /
 | /
 d

This is the table with example data (also on SQLFiddle):

CREATE TABLE relationships (
 name varchar(1),
 related varchar(1)
);
INSERT INTO relationships (name, related) VALUES 
('a', 'a'),
('a', 'b'),
('b', 'b'),
('b', 'a'),
('b', 'c'),
('b', 'd'),
('c', 'c'),
('c', 'b'),
('c', 'd'),
('d', 'd'),
('d', 'c'),
('d', 'b'),
('e', 'e'),
('e', 'f'),
('f', 'f'),
('f', 'e'),
('f', 'g'),
('g', 'g');

What is the most efficient way to produce an output that looks like this:

| group | members |
------------------------|
| 1 | {a, b, c, d} |
| 2 | {e, f, g} |

or this:

| name | group |
-----------------
| a | 1 |
| b | 1 |
| c | 1 |
| d | 1 |
| e | 2 |
| f | 2 |
| g | 2 |

I have thought about doing this operation outside of Postgres, but it seems like there must be a way to achieve this result with either a window function or PL/pgSQL.

Question 2

Sorry, the "group" is what I'm trying to create. I have an SQLFiddle with an example schema - sqlfiddle.com/#!15/ef46b

Question 3

How do you find if a and b are related (member of the same group)? Alternatively, why is there a split between d and e?

Question 4

And how do you know what the starting node of the graph is?

Question 5

@dezso - As mentioned above, it's in this schema - sqlfiddle.com/#!15/ef46b. Basically, 'name' is an entity, and 'related' is an entity that it relates to.

Question 6

@a_horse_with_no_name - you don't ;-). That's part of the problem.

Question 7

Ugly, but it works!

First, define array_agg_mult from this question

CREATE AGGREGATE array_agg_mult (anyarray) (
 SFUNC = array_cat
 ,STYPE = anyarray
 ,INITCOND = '{}'
);

and then run the query

WITH summary AS (
 SELECT name, array_agg(related) AS touches
 FROM relationships
 GROUP BY name
), 
grouped AS (
 SELECT name, (
 SELECT array_agg(uniques) FROM (
 select distinct unnest(array_agg_mult(sub.touches)) AS uniques 
 ORDER BY uniques
 ) x
 ) my_group
 FROM summary LEFT JOIN LATERAL (
 SELECT touches
 FROM summary r
 WHERE summary.touches && r.touches
 GROUP BY name, touches
 ) sub ON true
 GROUP BY summary.name
 ORDER BY summary.name
) 
SELECT DISTINCT my_group, row_number() over() as group_id 
FROM grouped 
GROUP BY my_group;

Which produces the following:

| my_group | group_id |
| {a,b,c,d} | 2 |
| {e,f,g} | 1 |

SQLFiddle here - http://sqlfiddle.com/#!15/c8a5b/20. I'm a novice with this kind of query, so please let me know if there is a more efficient way to do this!

Question 8

This is actually a classic problem in computer science: given a graph, determine all of the connected components.

This is what is called a "linear-time" problem. In other words, it can be solved by examining each of the input data elements exactly once. The simplest way to do this is by using either a depth-first or a breadth-first search:

Set id to 1
For each x in the set of nodes:
- If x has not already been assigned a group id:
  - Do either a depth-first or breadth-first search (your choice) using the "relationships" table starting with x and assign group id id to all nodes found.
  - Increment id

If I understand the PostgreSQL code above, this is basically what you are already doing! I think the combination of array_agg and LEFT JOIN LATERAL is essentially doing a breadth-first search. If you are doing this operation only once, then this seems like a reasonable way to do it in PostgreSQL.

However, if you are going to be repeating this operation often, you will probably get a more efficient execution by creating a second table that stores group_id:

CREATE TABLE groups (
 name varchar(1) primary key,
 group_id integer
);

Then you can store this information as you determine it, and print it out very quickly with a simple query:

SELECT group_id, array_agg(name)::text FROM groups GROUP BY group_id

In fact, you can build this table iteratively as new relationships are added:

For each new relationship (a, b):
- SELECT distinct group_id FROM groups WHERE name in (a, b);
- If the result set is empty:
  - Set new = SELECT max(group_id)+1 FROM groups
  - INSERT INTO groups VALUES (a, new);
  - If a <> b:
    - INSERT INTO groups VALUES (b, new);
- Else if the first row of the result is g:
  - If a is not already in the table:
    - INSERT INTO groups VALUES (a, g);
  - If a <> b and b is not already in the table:
    - INSERT INTO groups VALUES (b, g);
  - If the result set has two rows and the second row is h:
    - UPDATE groups SET group_id = g WHERE group_id = h;
  - (* This last condition handles the case where a new relationship joins two existing groups *)

Is this more efficient? That will depend greatly on the number of edges (relationships) in your data set and upon whether you are adding them all at once or a few at a time.

jczaplew jczaplew 2432 silver badges6 bronze badges · Answer 1 · 2015-07-16 19:54:11Z

Ugly, but it works!

First, define array_agg_mult from this question

CREATE AGGREGATE array_agg_mult (anyarray) (
 SFUNC = array_cat
 ,STYPE = anyarray
 ,INITCOND = '{}'
);

and then run the query

WITH summary AS (
 SELECT name, array_agg(related) AS touches
 FROM relationships
 GROUP BY name
), 
grouped AS (
 SELECT name, (
 SELECT array_agg(uniques) FROM (
 select distinct unnest(array_agg_mult(sub.touches)) AS uniques 
 ORDER BY uniques
 ) x
 ) my_group
 FROM summary LEFT JOIN LATERAL (
 SELECT touches
 FROM summary r
 WHERE summary.touches && r.touches
 GROUP BY name, touches
 ) sub ON true
 GROUP BY summary.name
 ORDER BY summary.name
) 
SELECT DISTINCT my_group, row_number() over() as group_id 
FROM grouped 
GROUP BY my_group;

Which produces the following:

| my_group | group_id |
| {a,b,c,d} | 2 |
| {e,f,g} | 1 |

SQLFiddle here - http://sqlfiddle.com/#!15/c8a5b/20. I'm a novice with this kind of query, so please let me know if there is a more efficient way to do this!

Michael McClennen Michael McClennen 1111 bronze badge · Answer 2 · 2015-07-16 23:34:01Z

This is actually a classic problem in computer science: given a graph, determine all of the connected components.

This is what is called a "linear-time" problem. In other words, it can be solved by examining each of the input data elements exactly once. The simplest way to do this is by using either a depth-first or a breadth-first search:

Set id to 1
For each x in the set of nodes:
- If x has not already been assigned a group id:
  - Do either a depth-first or breadth-first search (your choice) using the "relationships" table starting with x and assign group id id to all nodes found.
  - Increment id

If I understand the PostgreSQL code above, this is basically what you are already doing! I think the combination of array_agg and LEFT JOIN LATERAL is essentially doing a breadth-first search. If you are doing this operation only once, then this seems like a reasonable way to do it in PostgreSQL.

However, if you are going to be repeating this operation often, you will probably get a more efficient execution by creating a second table that stores group_id:

CREATE TABLE groups (
 name varchar(1) primary key,
 group_id integer
);

Then you can store this information as you determine it, and print it out very quickly with a simple query:

SELECT group_id, array_agg(name)::text FROM groups GROUP BY group_id

In fact, you can build this table iteratively as new relationships are added:

For each new relationship (a, b):
- SELECT distinct group_id FROM groups WHERE name in (a, b);
- If the result set is empty:
  - Set new = SELECT max(group_id)+1 FROM groups
  - INSERT INTO groups VALUES (a, new);
  - If a <> b:
    - INSERT INTO groups VALUES (b, new);
- Else if the first row of the result is g:
  - If a is not already in the table:
    - INSERT INTO groups VALUES (a, g);
  - If a <> b and b is not already in the table:
    - INSERT INTO groups VALUES (b, g);
  - If the result set has two rows and the second row is h:
    - UPDATE groups SET group_id = g WHERE group_id = h;
  - (* This last condition handles the case where a new relationship joins two existing groups *)

Is this more efficient? That will depend greatly on the number of edges (relationships) in your data set and upon whether you are adding them all at once or a few at a time.

Stack Exchange Network

Grouping by sequential relationships with Postgres

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Grouping by sequential relationships with Postgres

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions