Graph and minimum spanning tree in Python

Question 1

Here is my Graph class that implements a graph and has nice a method to generate its spanning tree using Kruskal's algorithm.

I want to:

Make it pythonic
Improve readability
Improve the abstraction (but not changing the use of outer and inner dicts to represent the graph)

Performance is not a concern.

Code:

class Graph(object):
 def __init__(self):
 self.g = {}
 def add(self, vertex1, vertex2, weight):
 if vertex1 not in self.g:
 self.g[vertex1] = {}
 if vertex2 not in self.g:
 self.g[vertex2] = {}
 self.g[vertex1][vertex2] = weight
 self.g[vertex2][vertex1] = weight
 def has_link(self, v1, v2):
 return v2 in self[v1] or v1 in self[v2]
 def edges(self):
 data = []
 for from_vertex, destinations in self.g.items():
 for to_vertex, weight in destinations.items():
 if (to_vertex, from_vertex, weight) not in data:
 data.append((from_vertex, to_vertex, weight))
 return data
 def sorted_by_weight(self, desc=False):
 return sorted(self.edges(), key=lambda x: x[2], reverse=desc)
 def spanning_tree(self, minimum=True):
 mst = Graph()
 parent = {}
 rank = {}
 def find_parent(vertex):
 while parent[vertex] != vertex:
 vertex = parent[vertex]
 return vertex
 def union(root1, root2):
 if rank[root1] > rank[root2]:
 parent[root2] = root1
 else:
 parent[root2] = root1
 if rank[root2] == rank[root1]:
 rank[root2] += 1
 for vertex in self.g:
 parent[vertex] = vertex
 rank[vertex] = 0
 for v1, v2, weight in self.sorted_by_weight(not minimum):
 parent1 = find_parent(v1)
 parent2 = find_parent(v2)
 if parent1 != parent2:
 mst.add(v1, v2, weight)
 union(parent1, parent2)
 if len(self) == len(mst):
 break
 return mst
 def __len__(self):
 return len(self.g.keys())
 def __getitem__(self, node):
 return self.g[node]
 def __iter__(self):
 for edge in self.edges():
 yield edge
 def __str__(self):
 return "\n".join('from %s to %s: %d' % edge for edge in self.edges())
graph = Graph()
graph.add('a', 'b', 2)
graph.add('a', 'd', 3)
graph.add('a', 'c', 3)
graph.add('b', 'a', 2)
graph.add('b', 'c', 4)
graph.add('b', 'e', 3)
graph.add('c', 'a', 3)
graph.add('c', 'b', 4)
graph.add('c', 'd', 5)
graph.add('c', 'e', 1)
graph.add('d', 'a', 3)
graph.add('d', 'c', 5)
graph.add('d', 'f', 7)
graph.add('f', 'd', 7)
graph.add('f', 'e', 8)
graph.add('f', 'g', 9)
print(graph.spanning_tree())
print()
print(graph.spanning_tree(False))

Question 2

There are no docstrings. How do I use this class? What arguments should I pass to the methods and what do they return?
There's no way to add a vertex with no edges to a graph.
The test code shows that it is quite laborious to initialize a graph. It would make sense for the constructor to take an iterable of edges.
The attribute g is not intended for use outside the class (callers should use the public methods). It is conventional to name internal attributes with names starting with one underscore.

The code in add could be simplified by making use of collections.defaultdict:

def __init__(self, edges=()):
 """Construct a graph containing the edges from the given iterable.
 An edge between vertices v1 and v2 with weight w should be
 specified as a tuple (v1, v2, w).
 """
 self._g = defaultdict(dict)
 for edge in edges:
 self.add(*edge)
def add(self, v1, v2, w):
 """Add an edge between vertices v1 and v2 with weight w.
 If an edge already exists between these vertices, set its
 weight to w.
 """
 self._g[v1][v2] = self._g[v2][v1] = w

The has_link method uses "link", but the rest of the code uses "edge". It would be better to be consistent.
The __getitem__ method uses "node", but the rest of the code uses "vertex". It would be better to be consistent.

Since add ensures that edges are stored in both directed, it's only necessary to test one direction:

def has_edge(self, v1, v2):
 """Return True if the graph has an edge between vertices v1 and v2."""
 return v2 in self._g[v1]

The edges method accumulates a list of edges. To avoid an edge appearing twice, the code checks each edge to see if it is already in the list. But lists do not have an efficient membership test, so the runtime is quadratic in the number of edges. It would be more efficient to make a set of edges:
```
def edges(self):
 """Return the edges in the graph as a set of tuples (v1, v2, w)."""
 edges = set()
 for v1, destinations in self._g.items():
 for v2, w in destinations.items():
 if (v2, v1, w) not in edges:
 edges.add((v1, v2, w))
 return edges
```
In the sorted_by_weight method, it would be better to name the keyword argument reverse, for consistency with Python's built-in sorted.
In spanning_tree, the parent and rank dictionaries can be initialized directly:
```
parent = {v: v for v in self._g}
rank = {v: 0 for v in self._g}
```
The parent data structure is known as a disjoint-set forest. An important optimization on this data structure is path compression: that is, when you search the parent data structure for the root of the tree containing a vertex, you update the parent data structure so that future searches run quickly:
```
def find_root(v):
 if parent[v] != v:
 parent[v] = find_root(parent[v])
 return parent[v]
```
In the union function, you always attach the tree for root2 to the tree for root1:
```
if rank[root1] > rank[root2]:
 parent[root2] = root1
else:
 parent[root2] = root1
```
It's more efficient to always attach the smaller tree to the bigger tree, so that the length of the searches in find_root doesn't increase:
```
if rank[root1] > rank[root2]:
 parent[root2] = root1
else:
 parent[root1] = root2
```
The __len__ and __getitem__ methods treat the graph as a collection of vertices, but __iter__ treats it as a collection of edges. This seems likely to lead to confusion. It would be better to be consistent and have all special methods treat the graph as a collection in the same way.

Question 3

great suggestions! In (11) your code looks a bit neater at a cost of two loops. Gabriel has a single loop though ;)

Question 4

@OlegMelnikov: The idea behind (11) is to improve the clarity of the code — in my version you can see immediately that every vertex is its own parent, and every rank is zero. The extra loop doesn't affect the asymptotic complexity, so I'm not inclined to worry about it.

Question 5

@ Gareth: Fair enough :)

Question 6

Maybe you can use a class to represent edges, like this:

Since Kruskal would sort the edges, this would be more convenient:

class Edge(object):
 def __init__(self, u, v, w):
 self.u = u
 self.v = v
 self.weight = w
 def __repr__(self):
 return '%s -> %s : %s ' %(self.u, self.v, self.weight)

There's a logic hole in this implementation.

You can try this graph:

g = Graph()
g.add('a','b',6)
g.add('a','c',8)
g.add('a','d',8)
g.add('b','c',7)
g.add('b','d',9)
g.add('c','d',4)
print(g.spanning_tree())

The result is:

from a to b: 6
from c to d: 4

It's not right since this is not a valid tree.

Question 7

I also found this graph code didn't produce a valid tree

Gareth Rees Gareth Rees 50.1k3 gold badges130 silver badges210 bronze badges · Accepted Answer · 2016-05-18 11:41:14Z

There are no docstrings. How do I use this class? What arguments should I pass to the methods and what do they return?
There's no way to add a vertex with no edges to a graph.
The test code shows that it is quite laborious to initialize a graph. It would make sense for the constructor to take an iterable of edges.
The attribute g is not intended for use outside the class (callers should use the public methods). It is conventional to name internal attributes with names starting with one underscore.

The code in add could be simplified by making use of collections.defaultdict:

def __init__(self, edges=()):
 """Construct a graph containing the edges from the given iterable.
 An edge between vertices v1 and v2 with weight w should be
 specified as a tuple (v1, v2, w).
 """
 self._g = defaultdict(dict)
 for edge in edges:
 self.add(*edge)
def add(self, v1, v2, w):
 """Add an edge between vertices v1 and v2 with weight w.
 If an edge already exists between these vertices, set its
 weight to w.
 """
 self._g[v1][v2] = self._g[v2][v1] = w

The has_link method uses "link", but the rest of the code uses "edge". It would be better to be consistent.
The __getitem__ method uses "node", but the rest of the code uses "vertex". It would be better to be consistent.

Since add ensures that edges are stored in both directed, it's only necessary to test one direction:

def has_edge(self, v1, v2):
 """Return True if the graph has an edge between vertices v1 and v2."""
 return v2 in self._g[v1]

The edges method accumulates a list of edges. To avoid an edge appearing twice, the code checks each edge to see if it is already in the list. But lists do not have an efficient membership test, so the runtime is quadratic in the number of edges. It would be more efficient to make a set of edges:
```
def edges(self):
 """Return the edges in the graph as a set of tuples (v1, v2, w)."""
 edges = set()
 for v1, destinations in self._g.items():
 for v2, w in destinations.items():
 if (v2, v1, w) not in edges:
 edges.add((v1, v2, w))
 return edges
```
In the sorted_by_weight method, it would be better to name the keyword argument reverse, for consistency with Python's built-in sorted.
In spanning_tree, the parent and rank dictionaries can be initialized directly:
```
parent = {v: v for v in self._g}
rank = {v: 0 for v in self._g}
```
The parent data structure is known as a disjoint-set forest. An important optimization on this data structure is path compression: that is, when you search the parent data structure for the root of the tree containing a vertex, you update the parent data structure so that future searches run quickly:
```
def find_root(v):
 if parent[v] != v:
 parent[v] = find_root(parent[v])
 return parent[v]
```
In the union function, you always attach the tree for root2 to the tree for root1:
```
if rank[root1] > rank[root2]:
 parent[root2] = root1
else:
 parent[root2] = root1
```
It's more efficient to always attach the smaller tree to the bigger tree, so that the length of the searches in find_root doesn't increase:
```
if rank[root1] > rank[root2]:
 parent[root2] = root1
else:
 parent[root1] = root2
```
The __len__ and __getitem__ methods treat the graph as a collection of vertices, but __iter__ treats it as a collection of edges. This seems likely to lead to confusion. It would be better to be consistent and have all special methods treat the graph as a collection in the same way.

great suggestions! In (11) your code looks a bit neater at a cost of two loops. Gabriel has a single loop though ;)
@OlegMelnikov: The idea behind (11) is to improve the clarity of the code — in my version you can see immediately that every vertex is its own parent, and every rank is zero. The extra loop doesn't affect the asymptotic complexity, so I'm not inclined to worry about it.

Stack Exchange Network

Graph and minimum spanning tree in Python

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Graph and minimum spanning tree in Python

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions