Size of the largest connected component in a grid in Python

Question 1

Given an R x C grid of 1s and 0s (or True and False values), I need a function that can find the size of the largest connected component of 1s. For example, for the following grid,

grid = [[0, 1, 0, 1],
 [1, 1, 1, 0],
 [0, 1, 0, 0],
 [0, 0, 0, 1]]

The answer is 5.

Here is my implementation:

def largest_connected_component(nrows, ncols, grid):
 """Find largest connected component of 1s on a grid."""
 def traverse_component(i, j):
 """Returns no. of unseen elements connected to (i,j)."""
 seen[i][j] = True
 result = 1
 # Check all four neighbours
 if i > 0 and grid[i-1][j] and not seen[i-1][j]:
 result += traverse_component(i-1, j)
 if j > 0 and grid[i][j-1] and not seen[i][j-1]:
 result += traverse_component(i, j-1)
 if i < len(grid)-1 and grid[i+1][j] and not seen[i+1][j]:
 result += traverse_component(i+1, j)
 if j < len(grid[0])-1 and grid[i][j+1] and not seen[i][j+1]:
 result += traverse_component(i, j+1)
 return result
 seen = [[False] * ncols for _ in range(nrows)]
 # Tracks size of largest connected component found
 component_size = 0
 for i in range(nrows):
 for j in range(ncols):
 if grid[i][j] and not seen[i][j]:
 temp = traverse_component(i, j)
 if temp > component_size:
 component_size = temp
 return component_size

Feel free to use the following code to generate random grids to test the function,

from random import randint
N = 20
grid = [[randint(0,1) for _ in range(N)] for _ in range(N)]

Problem: My implementation runs too slow (by about a factor of 3). Since I wrote this as a naive approach by myself, I am guessing there are clever optimizations that can be made.

Context: This is for solving the Gridception problem from Round 2 of Google Codejam 2018. My goal is to solve the problem in Python 3. As a result, there is a hard constraint of using only the Python 3 standard library.

I have figured out that this particular portion of the full solution is my performance bottleneck and thus, my solution fails to clear the Large Input due to being too slow.

Thank you so much!

Edit: Adding some timeit benchmarks

For a randomly generated 20 x 20 grid, my implementation takes 219 +/- 41 μs (a grid of 0s takes 30 μs, and a grid of 1s takes 380 μs).

Question 2

This question is very similar to counting the connected components. Maybe the approach using a set can help speed it up a little.

Question 3

In programming challenges, proper performances improvement usually come from a smarter algorithm. Unfortunately, I have no algorithm better than the one you have implemented.

I have found a single trick to shave off some time.

Remove all the logic around seen

In all places where you access elements from grid and seen, we have basically: if grid[pos] and not seen[pos].

An idea could be to update grid in place to remove seen elements from it. From an engineering point of view, it is not very nice: I would not expect an function computing the size of the biggest connected components to update the provided input. For a programming challenge, we can probably accept such a thing...

We'd get:

def largest_connected_component(nrows, ncols, grid):
 """Find largest connected component of 1s on a grid."""
 def traverse_component(i, j):
 """Returns no. of unseen elements connected to (i,j)."""
 grid[i][j] = False
 result = 1
 # Check all four neighbours
 if i > 0 and grid[i-1][j]:
 result += traverse_component(i-1, j)
 if j > 0 and grid[i][j-1]:
 result += traverse_component(i, j-1)
 if i < len(grid)-1 and grid[i+1][j]:
 result += traverse_component(i+1, j)
 if j < len(grid[0])-1 and grid[i][j+1]:
 result += traverse_component(i, j+1)
 return result
 # Tracks size of largest connected component found
 component_size = 0
 for i in range(nrows):
 for j in range(ncols):
 if grid[i][j]:
 temp = traverse_component(i, j)
 if temp > component_size:
 component_size = temp
 return component_size

Another idea in order to do the same type of things without changing grid could be to store "positive" elements in a set. This also remove the need to check for edge cases of the grid. The great thing is that we can populate that set with less array accesses. This is still pretty hackish:

def largest_connected_component(nrows, ncols, grid):
 """Find largest connected component of 1s on a grid."""
 def traverse_component(pos):
 """Returns no. of unseen elements connected to (i,j)."""
 elements.remove(pos)
 i, j = pos
 result = 1
 # Check all four neighbours
 for new_pos in [(i-1, j), (i+1, j), (i, j-1), (i, j+1)]:
 if new_pos in elements:
 result += traverse_component(new_pos)
 return result
 # Tracks size of largest connected component found
 elements = set()
 for i, line in enumerate(grid):
 for j, cell in enumerate(line):
 if cell:
 elements.add((i, j))
 return max(traverse_component(pos) for pos in set(elements) if pos in elements)

Edit: rewriting the solution to avoid the copy of elements, we have a slightly faster solution:

def largest_connected_component(nrows, ncols, grid):
 """Find largest connected component of 1s on a grid."""
 def traverse_component(pos):
 """Returns no. of unseen elements connected to (i,j)."""
 i, j = pos
 result = 1
 # Check all four neighbours
 for new_pos in [(i-1, j), (i+1, j), (i, j-1), (i, j+1)]:
 if new_pos in elements:
 elements.remove(new_pos)
 result += traverse_component(new_pos)
 return result
 # Tracks size of largest connected component found
 elements = set((i, j) for i, line in enumerate(grid) for j, cell in enumerate(line) if cell)
 largest = 0
 while elements:
 pos = elements.pop()
 largest = max(largest, traverse_component(pos))
 return largest

Question 4

Clever! Regarding the thought about a smarter algorithm, I think there may be a smarter algorithm for the overall program, given that this is only one part of a larger program. Speaking of which, you should also note that the list will need to be copied when passed to the function if the list is going to be used later, because of the way lists references work in Python.

Question 5

@Graham yes, that's the whole issue with updating the structure in place. I tried to add a call to copy.deepcopy but we (obviously) lose all the benefits we had from not defining a seen matrix. Also I've updated my answer to add a quick alternative but I didn't get a chance to perform full benchmarks before I left my computer. We'll see next year! Have a good evening!

Question 6

I can assure you, given the rest of my code, that none of the mutable variables here are used again, so they can be sacrificed in place if needed!

Question 7

@Josay Is there a particular reason you import itertools in the last code snippet?

Question 8

@XYZT just a leftover from things I tried. I've edited my code.

Question 9

Are you absolutely sure this is the bottleneck? Looking at the linked problem and the solution analysis, I'm not sure if this component is even needed to solve the given problem. It's possible your overall algorithm is inefficient in some way, but obviously I couldn't really tell unless I saw the whole program that this is part of.

@Josay's already given a good improvement, but in the grand scheme of things, this doesn't really shave off that much measurable time for larger grids. The original solution was a pretty good algorithm for solving the problem of largest connected subsections.

General comments

Having three lines here is unnecessary:

temp = traverse_component(i, j)
if temp > component_size:
 component_size = temp

because of one nice Python built-in, max:

component_size = max(component_size, traverse_component(i,j))

component_size could be named more descriptively as largest_size.

Question 10

The analysis for the problem says, "For each quadrant center and combination of colors, we want to get the largest connected component where each cell in this connected component has the same color as the color assigned to the quadrant it belongs to." Is this not what I am trying to do? Or did I misunderstand?

Question 11

@XYZT How do you connect the four quadrants to measure the overall component size?

Question 12

I hope it's appropriate to link my whole code here: gist.github.com/theXYZT/f13ac490e842552a2f470afac1001b7b

Question 13

@XYZT: For that it would probably be better to ask a new question with the whole code.

Question 14

As an alternative, I will suggest a solution that instead of iterating through all of the grid's nodes in search of a component that was not visited in the main loop, it avoids visiting a visited node in this loop via a modified BFS.

The gist of this algorithm is to perform an explorative BFS, which triggers a DFS (like your algorithm) to measure the size of the group.

Difference being that during the DFS step, it adds empty (non-visited) slots to the BFS queue, thus avoiding the need to re-check the component group.

The time complexity for this algorithm is O(nm), but it comes at an added space complexity cost of O(nm), due to the BFS queue growth on the regular BFS algorithm and the added DFS items

from collections import deque
EMPTY = 0
FILLED = 1
VISITED = 2
def largest_connected_component(grid):
 def is_valid(x, y):
 return (0 <= x < len(grid) and 0 <= y < len(grid[0]) and
 grid[x][y] != VISITED)
 def traverse_component(x, y):
 grid[x][y] = VISITED
 result = 1
 for adjacent in ((x + 1, y),
 (x - 1, y),
 (x, y + 1),
 (x, y - 1)):
 if (is_valid(*adjacent)):
 if (grid[adjacent[0]][adjacent[1]] == EMPTY):
 q.append(adjacent)
 grid[adjacent[0]][adjacent[1]] = VISITED
 else:
 result += traverse_component(*adjacent)
 return result
 max_filled_size = 0
 q = deque()
 if (grid[0][0] == EMPTY):
 q.append((0, 0))
 grid[0][0] = VISITED
 else:
 max_filled_size = max(max_filled_size, traverse_component(0, 0))
 while q:
 x, y = q.popleft()
 for adjacent in ((x + 1, y),
 (x - 1, y),
 (x, y + 1),
 (x, y - 1)):
 if (is_valid(*adjacent)):
 if (grid[adjacent[0]][adjacent[1]] == EMPTY):
 q.append(adjacent)
 grid[adjacent[0]][adjacent[1]] = VISITED
 else: # FILLED
 max_filled_size = max(max_filled_size, traverse_component(*adjacent))
 return max_filled_size
# Examples
print(largest_connected_component([[0, 1, 0], [1, 0, 1], [0, 1, 1]]))
print(largest_connected_component([[1, 1, 1], [0, 1, 0], [1, 0, 1]]))
print(largest_connected_component([[1, 0, 0, 1, 1, 1, 1, 0],
 [1, 0, 1, 0, 0, 0, 0, 1],
 [1, 0, 1, 0, 0, 1, 0, 1],
 [1, 0, 0, 0, 1, 0, 1, 0],
 [0, 0, 0, 0, 1, 1, 1, 0],
 [1, 0, 1, 1, 0, 0, 1, 0],
 ]))

SylvainD 29.8k1 gold badge49 silver badges93 bronze badges · Answer 1 · 2018-12-31 16:00:42Z

In programming challenges, proper performances improvement usually come from a smarter algorithm. Unfortunately, I have no algorithm better than the one you have implemented.

I have found a single trick to shave off some time.

Remove all the logic around seen

In all places where you access elements from grid and seen, we have basically: if grid[pos] and not seen[pos].

An idea could be to update grid in place to remove seen elements from it. From an engineering point of view, it is not very nice: I would not expect an function computing the size of the biggest connected components to update the provided input. For a programming challenge, we can probably accept such a thing...

We'd get:

def largest_connected_component(nrows, ncols, grid):
 """Find largest connected component of 1s on a grid."""
 def traverse_component(i, j):
 """Returns no. of unseen elements connected to (i,j)."""
 grid[i][j] = False
 result = 1
 # Check all four neighbours
 if i > 0 and grid[i-1][j]:
 result += traverse_component(i-1, j)
 if j > 0 and grid[i][j-1]:
 result += traverse_component(i, j-1)
 if i < len(grid)-1 and grid[i+1][j]:
 result += traverse_component(i+1, j)
 if j < len(grid[0])-1 and grid[i][j+1]:
 result += traverse_component(i, j+1)
 return result
 # Tracks size of largest connected component found
 component_size = 0
 for i in range(nrows):
 for j in range(ncols):
 if grid[i][j]:
 temp = traverse_component(i, j)
 if temp > component_size:
 component_size = temp
 return component_size

Another idea in order to do the same type of things without changing grid could be to store "positive" elements in a set. This also remove the need to check for edge cases of the grid. The great thing is that we can populate that set with less array accesses. This is still pretty hackish:

def largest_connected_component(nrows, ncols, grid):
 """Find largest connected component of 1s on a grid."""
 def traverse_component(pos):
 """Returns no. of unseen elements connected to (i,j)."""
 elements.remove(pos)
 i, j = pos
 result = 1
 # Check all four neighbours
 for new_pos in [(i-1, j), (i+1, j), (i, j-1), (i, j+1)]:
 if new_pos in elements:
 result += traverse_component(new_pos)
 return result
 # Tracks size of largest connected component found
 elements = set()
 for i, line in enumerate(grid):
 for j, cell in enumerate(line):
 if cell:
 elements.add((i, j))
 return max(traverse_component(pos) for pos in set(elements) if pos in elements)

Edit: rewriting the solution to avoid the copy of elements, we have a slightly faster solution:

def largest_connected_component(nrows, ncols, grid):
 """Find largest connected component of 1s on a grid."""
 def traverse_component(pos):
 """Returns no. of unseen elements connected to (i,j)."""
 i, j = pos
 result = 1
 # Check all four neighbours
 for new_pos in [(i-1, j), (i+1, j), (i, j-1), (i, j+1)]:
 if new_pos in elements:
 elements.remove(new_pos)
 result += traverse_component(new_pos)
 return result
 # Tracks size of largest connected component found
 elements = set((i, j) for i, line in enumerate(grid) for j, cell in enumerate(line) if cell)
 largest = 0
 while elements:
 pos = elements.pop()
 largest = max(largest, traverse_component(pos))
 return largest

Clever! Regarding the thought about a smarter algorithm, I think there may be a smarter algorithm for the overall program, given that this is only one part of a larger program. Speaking of which, you should also note that the list will need to be copied when passed to the function if the list is going to be used later, because of the way lists references work in Python.
@Graham yes, that's the whole issue with updating the structure in place. I tried to add a call to copy.deepcopy but we (obviously) lose all the benefits we had from not defining a seen matrix. Also I've updated my answer to add a quick alternative but I didn't get a chance to perform full benchmarks before I left my computer. We'll see next year! Have a good evening!
I can assure you, given the rest of my code, that none of the mutable variables here are used again, so they can be sacrificed in place if needed!
@Josay Is there a particular reason you import itertools in the last code snippet?
@XYZT just a leftover from things I tried. I've edited my code.

Graham 1,4199 silver badges16 bronze badges · Answer 2 · 2018-12-31 16:49:14Z

Are you absolutely sure this is the bottleneck? Looking at the linked problem and the solution analysis, I'm not sure if this component is even needed to solve the given problem. It's possible your overall algorithm is inefficient in some way, but obviously I couldn't really tell unless I saw the whole program that this is part of.

@Josay's already given a good improvement, but in the grand scheme of things, this doesn't really shave off that much measurable time for larger grids. The original solution was a pretty good algorithm for solving the problem of largest connected subsections.

General comments

Having three lines here is unnecessary:

temp = traverse_component(i, j)
if temp > component_size:
 component_size = temp

because of one nice Python built-in, max:

component_size = max(component_size, traverse_component(i,j))

component_size could be named more descriptively as largest_size.

The analysis for the problem says, "For each quadrant center and combination of colors, we want to get the largest connected component where each cell in this connected component has the same color as the color assigned to the quadrant it belongs to." Is this not what I am trying to do? Or did I misunderstand?
@XYZT How do you connect the four quadrants to measure the overall component size?
I hope it's appropriate to link my whole code here: gist.github.com/theXYZT/f13ac490e842552a2f470afac1001b7b
@XYZT: For that it would probably be better to ask a new question with the whole code.

Pedro Lopes 1115 bronze badges · Answer 3 · 2019-05-23 18:54:26Z

As an alternative, I will suggest a solution that instead of iterating through all of the grid's nodes in search of a component that was not visited in the main loop, it avoids visiting a visited node in this loop via a modified BFS.

The gist of this algorithm is to perform an explorative BFS, which triggers a DFS (like your algorithm) to measure the size of the group.

Difference being that during the DFS step, it adds empty (non-visited) slots to the BFS queue, thus avoiding the need to re-check the component group.

The time complexity for this algorithm is O(nm), but it comes at an added space complexity cost of O(nm), due to the BFS queue growth on the regular BFS algorithm and the added DFS items

from collections import deque
EMPTY = 0
FILLED = 1
VISITED = 2
def largest_connected_component(grid):
 def is_valid(x, y):
 return (0 <= x < len(grid) and 0 <= y < len(grid[0]) and
 grid[x][y] != VISITED)
 def traverse_component(x, y):
 grid[x][y] = VISITED
 result = 1
 for adjacent in ((x + 1, y),
 (x - 1, y),
 (x, y + 1),
 (x, y - 1)):
 if (is_valid(*adjacent)):
 if (grid[adjacent[0]][adjacent[1]] == EMPTY):
 q.append(adjacent)
 grid[adjacent[0]][adjacent[1]] = VISITED
 else:
 result += traverse_component(*adjacent)
 return result
 max_filled_size = 0
 q = deque()
 if (grid[0][0] == EMPTY):
 q.append((0, 0))
 grid[0][0] = VISITED
 else:
 max_filled_size = max(max_filled_size, traverse_component(0, 0))
 while q:
 x, y = q.popleft()
 for adjacent in ((x + 1, y),
 (x - 1, y),
 (x, y + 1),
 (x, y - 1)):
 if (is_valid(*adjacent)):
 if (grid[adjacent[0]][adjacent[1]] == EMPTY):
 q.append(adjacent)
 grid[adjacent[0]][adjacent[1]] = VISITED
 else: # FILLED
 max_filled_size = max(max_filled_size, traverse_component(*adjacent))
 return max_filled_size
# Examples
print(largest_connected_component([[0, 1, 0], [1, 0, 1], [0, 1, 1]]))
print(largest_connected_component([[1, 1, 1], [0, 1, 0], [1, 0, 1]]))
print(largest_connected_component([[1, 0, 0, 1, 1, 1, 1, 0],
 [1, 0, 1, 0, 0, 0, 0, 1],
 [1, 0, 1, 0, 0, 1, 0, 1],
 [1, 0, 0, 0, 1, 0, 1, 0],
 [0, 0, 0, 0, 1, 1, 1, 0],
 [1, 0, 1, 1, 0, 0, 1, 0],
 ]))

Stack Exchange Network

Size of the largest connected component in a grid in Python

3 Answers 3

General comments

You must log in to answer this question.

Linked

Hot Network Questions

Size of the largest connected component in a grid in Python

3 Answers 3

General comments

You must log in to answer this question.

Linked

Related

Hot Network Questions