I wrote an algorithm for finding the connected components in a 2d-matrix in Python 2.x. I am looking for comments on the quality of my code, organization, formatting/following conventions, etc. For example, do the two static functions nodify
and denodify
follow the rules? I had a couple of problems with dealing with sets and removing duplicate networks; my solution works, but it is not elegant.
The network-finding works recursively in connected_values
. I wonder if the passing around of the matrix
is done in a way that makes sense; initially it was a property of Node
, but I figured that (a) it was weird for a child to reference its parent which references it, and (b) it is bad for memory management (I do not know how to do weak references in Python like in Swift).
Thoughts on the efficiency of the algorithm and ways to not get into infinite loops would be much appreciated.
The code below has comments. More can be added if something is unclear.
# A simple class (struct-like) describing the location of an item in a matrix
class Position:
def __init__(self, r, c):
self.row = r
self.column = c
# For comparison
def __eq__(self, other):
if isinstance(other, self.__class__):
return self.row == other.row and self.column == other.column
# So that it can go in sets
def __hash__(self):
return hash((self.row, self.column))
# An object describing an item in a matrix that contains some helpful methods
class Node:
def __init__(self, row, column, val):
self.pos = Position(row, column)
self.value = val
# Returns the nodes adjacent (left, right, top, bottom) to self in a given matrix
def neighbors(self, matrix):
nodes = []
y = self.pos.row
x = self.pos.column
try:
if x - 1 >= 0:
to_left = matrix[y][x-1].value
nodes.append(Node(y, x-1, to_left))
except: pass
try:
if x + 1 <= len(matrix[y]) - 1:
to_right = matrix[y][x+1].value
nodes.append(Node(y, x+1, to_right))
except: pass
try:
if y - 1 >= 0:
to_top = matrix[y-1][x].value
nodes.append(Node(y-1, x, to_top))
except: pass
try:
if y + 1 <= len(matrix) - 1:
to_bottom = matrix[y+1][x].value
nodes.append(Node(y+1, x, to_bottom))
except: pass
return nodes
# Returns the nodes with the same value as self of self's neighbors in a given matrix
def value_neighbors(self, matrix):
return [node for node in self.neighbors(matrix) if node.value == self.value]
# Looks prettier when printing
def __str__(self):
return `self.value, (self.pos.column, self.pos.row)`[1:-1]
# So that Nodes can go in sets
def __hash__(self):
return hash((self.pos, self.value))
# Turns a matrix into one containing Nodes
@staticmethod
def nodify(matrix):
for y, row in enumerate(matrix):
for x, item in enumerate(row):
node = Node(y, x, item)
matrix[y][x] = node
return matrix
# Takes apart a matrix with Nodes to just contain the values
@staticmethod
def denodify(matrix):
for y, row in enumerate(matrix):
for x, node in enumerate(row):
matrix[y][x] = node.value
return matrix
# For comparison
def __eq__(self, other):
if isinstance(other, self.__class__):
return self.value == other.value and self.pos == other.pos
return False
# A global set containing nodes already visited so that infinite loops do not occur
already_checked = set()
# A recursive method returning the continuous network of value_neighbors in a matrix starting from a node
def connected_values(node, matrix):
global already_checked
already_checked.add(node)
nodes = []
for value_neighbor in node.value_neighbors(matrix):
nodes.append(value_neighbor)
if value_neighbor not in already_checked:
nodes += connected_values(value_neighbor, matrix)
return nodes
# A method that gets all of the connected values networks in a matrix
def all_connected_values(matrix):
global already_checked
groups = []
for row in matrix:
for node in row:
already_checked = set()
values = set(connected_values(node, matrix))
values.add(node)
groups.append(values)
# Formats the networks and prints them out. A set is used so that duplicate networks are not shown
print '\n'.join({str(group[0].value) + ": " + ', '.join(map(str, sorted([(node.pos.column, node.pos.row) for node in group], key=lambda t:[t[0],t[1]]))) for group in map(list, groups)})
print
# Prints out the original matrix
print '\n'.join(map(str, Node.denodify(matrix)))
# Example matrix, doesn't necessarily have to be rectangular
example_matrix = [
[1, 1, 1, 2],
[2, 1, 2, 2],
[4, 1, 1, 0],
[4, 4, 1, 0]
]
Node.nodify(example_matrix)
all_connected_values(example_matrix)
Here is an example of a non-rectangular matrix:
[
[1, 1, 1, 2],
[2, 1, 2, 2],
[4, 1, 1, 0],
[4, 4, 1, 0],
[9, 0, 9, 9, 8, 1],
[5, 0, 0, 5, 4],
[5, 5, 7, 2, 9],
[],
[5, 5, 7, 2, 9]
]
Edit: explanation of program purpose
Here is an image that describes what the program is finding in the first example. The output of the program is comprised of lines starting with the value, then a colon, and then the coordinates of the set of connected nodes with that value.
This program can be thought of as finding the ‘islands’ in a matrix. For example, this ‘map’ has the blue and red islands with water in between. This program would tell you which areas are whose, and where the water is:
~BB~~~~RRR
~~BB~~~~~~
~~~~~~R~~~
~~B~~RR~~~
~~~~~~~~~~
This is really what my program is for: finding the ‘islands’ that are marked with the same value. By the way, this map can be put through my program if you format it as a 2d array first.
3 Answers 3
Position
can simply be a namedtuple
. That will take care of everything in your current class definition. It also allows tuple unpacking such as y, x = self.pos
. But actually you never use row
and column
separately so plain tuples would also work.
except: pass
will lead to very frustrating debugging if something changes and goes wrong. Don't do it. You don't need to catch exceptions here because you can use conditions. If they don't work you can know you've done something wrong.
Rather than repeating the same code 4 times, define a (possibly inner) function which checks both candidate values of x
and y
and appends a node if possible.
You could inherit Node
from a namedtuple
to handle the boilerplate magic method definitions.
In Python it's atypical for a function to both modify the argument and return it. Returning something usually indicates a new, independent version with the original unchanged.
This is horrifying:
print '\n'.join({str(group[0].value) + ": " + ', '.join(map(str, sorted([(node.pos.column, node.pos.row) for node in group], key=lambda t:[t[0],t[1]]))) for group in map(list, groups)})
At the very least you could break it up into multiple lines and use indentation properly, e.g.
print('\n'.join({
str(group[0].value) + ": " + ', '.join(map(str,
sorted([(node.pos.column, node.pos.row)
for node in group],
key=lambda t: [t[0], t[1]])))
for group in map(list, groups)}))
But really you should use intermediate variables and functions to make it manageable.
key=lambda t: [t[0], t[1]]
is redundant, that's the default sort order.
Use pprint
to print things nicely, it's much easier.
A few more points to extend Alex Hall's great answer:
global already_checked
is not a great sign. Even thoughall_connected_values
resets it, it complicates reusability and testing. I believe that defining it locally in the initial recursive call should work:def _connected_values(node, matrix, already_checked=None): already_checked = already_checked or set() ... # recur? _connected_values(node, matrix, already_checked) def all_connected_values(matrix): ... _connected_values(node, matrix)
It's a little awkward that you need to
nodify
before callingall_connected_values
, but then itdenodify
s before returning. (And in a print statement, no less, which a reader probably won't expect to be mutating things.)If the extra memory usage is ok, maybe consider having
all_connected_values
take in the matrix as-is, and internally create a nodified copy for its usage.This:
example_matrix = [ [1, 1, 1, 2], [2, 1, 2, 2], [4, 1, 1, 0], [4, 4, 1, 0] ]
would more conventionally be left-indented:
example_matrix = [ [1, 1, 1, 1], ... ]
Seconding the recommendation to use
namedtuple
s.
API
It's important to think about how somebody can use a program. This program is only useful for printing connected positions. It cannot be used to answer questions like:
- How many connected components are there?
- How many nodes are in component X?
Users also don't have any control over how the network is printed. And since sets are unordered, the output may be different for the same input, which is usually undesirable.
I suggest to rework the program so that it can answer the above questions, and that it produces consistent output for identical inputs.
global
You should avoid using global
as much as possible.
And in this task, it's certainly possible.
You can for example pass the already_checked
around as an additional parameter to functions that need it,
or it could be a property of a class.
Whichever alternative would be better than using a global variable.
Incidentally, even without a substantial rewrite,
you don't actually need the global
keyword in this program.
The global
keyword is useful when you want to reassign a global value.
For example in connected_values
you don't reassign the value of already_checked
, so you can simply drop the line global already_checked
, it's absolutely pointless.
In all_connected_values
you reassign it, but you don't need to.
You could replace already_checked = set()
with already_checked.clear()
.
It's important to understand the difference of these alternatives.
Program organization
The program is poorly organized.
Some functionality is in Node
,
other in free functions,
and there are calls between one another.
I suggest to think carefully about the responsibilities that each class and function should have, ideally just one each, and rework the design.