This is quite similar to this question about continuous ranges, and this question about grouping by sequential numbers, but differs in that the sequences are not numeric. Given the following relationships as key pairs
a -- b -- c e -- f -- g
| /
| /
d
This is the table with example data (also on SQLFiddle):
CREATE TABLE relationships (
name varchar(1),
related varchar(1)
);
INSERT INTO relationships (name, related) VALUES
('a', 'a'),
('a', 'b'),
('b', 'b'),
('b', 'a'),
('b', 'c'),
('b', 'd'),
('c', 'c'),
('c', 'b'),
('c', 'd'),
('d', 'd'),
('d', 'c'),
('d', 'b'),
('e', 'e'),
('e', 'f'),
('f', 'f'),
('f', 'e'),
('f', 'g'),
('g', 'g');
What is the most efficient way to produce an output that looks like this:
| group | members |
------------------------|
| 1 | {a, b, c, d} |
| 2 | {e, f, g} |
or this:
| name | group |
-----------------
| a | 1 |
| b | 1 |
| c | 1 |
| d | 1 |
| e | 2 |
| f | 2 |
| g | 2 |
I have thought about doing this operation outside of Postgres, but it seems like there must be a way to achieve this result with either a window function or PL/pgSQL.
-
Sorry, the "group" is what I'm trying to create. I have an SQLFiddle with an example schema - sqlfiddle.com/#!15/ef46bjczaplew– jczaplew2015年07月16日 14:16:53 +00:00Commented Jul 16, 2015 at 14:16
-
1How do you find if a and b are related (member of the same group)? Alternatively, why is there a split between d and e?András Váczi– András Váczi2015年07月16日 14:29:21 +00:00Commented Jul 16, 2015 at 14:29
-
And how do you know what the starting node of the graph is?user1822– user18222015年07月16日 14:33:28 +00:00Commented Jul 16, 2015 at 14:33
-
@dezso - As mentioned above, it's in this schema - sqlfiddle.com/#!15/ef46b. Basically, 'name' is an entity, and 'related' is an entity that it relates to.jczaplew– jczaplew2015年07月16日 14:44:40 +00:00Commented Jul 16, 2015 at 14:44
-
@a_horse_with_no_name - you don't ;-). That's part of the problem.jczaplew– jczaplew2015年07月16日 14:44:47 +00:00Commented Jul 16, 2015 at 14:44
2 Answers 2
Ugly, but it works!
First, define array_agg_mult
from this question
CREATE AGGREGATE array_agg_mult (anyarray) (
SFUNC = array_cat
,STYPE = anyarray
,INITCOND = '{}'
);
and then run the query
WITH summary AS (
SELECT name, array_agg(related) AS touches
FROM relationships
GROUP BY name
),
grouped AS (
SELECT name, (
SELECT array_agg(uniques) FROM (
select distinct unnest(array_agg_mult(sub.touches)) AS uniques
ORDER BY uniques
) x
) my_group
FROM summary LEFT JOIN LATERAL (
SELECT touches
FROM summary r
WHERE summary.touches && r.touches
GROUP BY name, touches
) sub ON true
GROUP BY summary.name
ORDER BY summary.name
)
SELECT DISTINCT my_group, row_number() over() as group_id
FROM grouped
GROUP BY my_group;
Which produces the following:
| my_group | group_id |
| {a,b,c,d} | 2 |
| {e,f,g} | 1 |
SQLFiddle here - http://sqlfiddle.com/#!15/c8a5b/20. I'm a novice with this kind of query, so please let me know if there is a more efficient way to do this!
This is actually a classic problem in computer science: given a graph, determine all of the connected components.
This is what is called a "linear-time" problem. In other words, it can be solved by examining each of the input data elements exactly once. The simplest way to do this is by using either a depth-first or a breadth-first search:
- Set id to 1
- For each x in the set of nodes:
- If x has not already been assigned a group id:
- Do either a depth-first or breadth-first search (your choice) using the "relationships" table starting with x and assign group id id to all nodes found.
- Increment id
- If x has not already been assigned a group id:
If I understand the PostgreSQL code above, this is basically what you are already doing! I think the combination of array_agg and LEFT JOIN LATERAL is essentially doing a breadth-first search. If you are doing this operation only once, then this seems like a reasonable way to do it in PostgreSQL.
However, if you are going to be repeating this operation often, you will probably get a more efficient execution by creating a second table that stores group_id:
CREATE TABLE groups (
name varchar(1) primary key,
group_id integer
);
Then you can store this information as you determine it, and print it out very quickly with a simple query:
SELECT group_id, array_agg(name)::text FROM groups GROUP BY group_id
In fact, you can build this table iteratively as new relationships are added:
- For each new relationship (a, b):
- SELECT distinct group_id FROM groups WHERE name in (a, b);
- If the result set is empty:
- Set new = SELECT max(group_id)+1 FROM groups
- INSERT INTO groups VALUES (a, new);
- If a <> b:
- INSERT INTO groups VALUES (b, new);
- Else if the first row of the result is g:
- If a is not already in the table:
- INSERT INTO groups VALUES (a, g);
- If a <> b and b is not already in the table:
- INSERT INTO groups VALUES (b, g);
- If the result set has two rows and the second row is h:
- UPDATE groups SET group_id = g WHERE group_id = h;
- (* This last condition handles the case where a new relationship joins two existing groups *)
- If a is not already in the table:
Is this more efficient? That will depend greatly on the number of edges (relationships) in your data set and upon whether you are adding them all at once or a few at a time.