How to count disconnected components in grid efficiently, if we can move left, right, down and top from any cell

Question 1

I solved this problem: A. Ice Skating, Codeforces

My approach was to create grid from given snow drifts coordinates, then convert this grid into graph and then count how many disconnected components are there. The key observation to solve this problem was to see that any disconnected component can be connected with another by using one snow drift, so we also can assume we that we connect everything to the component that contains node 0.

#include <iostream>
#include <vector>
#include <list>
using namespace std;
void dfs(vector<list<int>> &adj_list, int node, vector<bool> &visited)
{
 visited[node] = true;
 for (const int ngbr : adj_list[node])
 {
 if (visited[ngbr])
 continue;
 dfs(adj_list, ngbr, visited);
 }
}
int main()
{
 ios_base::sync_with_stdio(false);
 cin.tie(nullptr);
 cout.tie(nullptr);
 int n;
 cin >> n;
 vector<vector<int>> grid(1000, vector<int>(1000, -1));
 int node = 0;
 for (int i = 0; i < n; ++i)
 {
 int x, y;
 cin >> x >> y;
 --x;
 --y;
 grid[y][x] = node;
 ++node;
 }
 vector<list<int>> adj_list(n, list<int>());
 for (int i = 0; i < 1000; ++i)
 {
 for (int j = 0; j < 1000; ++j)
 {
 if (grid[i][j] == -1)
 continue;
 for (int k = 0; k < 1000; ++k)
 {
 if (k != j && grid[i][k] != -1)
 adj_list[grid[i][j]].push_back(grid[i][k]);
 if (k != i && grid[k][j] != -1)
 adj_list[grid[i][j]].push_back(grid[k][j]);
 }
 }
 }
 vector<bool> visited(n);
 dfs(adj_list, 0, visited);
 int ans = 0;
 for (int i = 1; i < n; ++i)
 {
 if (!visited[i])
 {
 ++ans;
 dfs(adj_list, i, visited);
 }
 }
 cout << ans << '\n';
}

All test cases passed, but the complexity of this, at least I think so but I'm not very convinced is O(10^6 + 10^5) (or maybe 10^8). That's because the most inner loop will be runned at most 100 times, because the condition will fail most of the time as we have N (<= 100) snow drifts.

It's not something very performant, so I wonder how could I improve time complexity of this?

Question 2

meaningful identifier

dfs is great, but main is rather long, and generically named. Consider having it delegate tasks to appropriately named helper function(s), such as one that's responsible for reading in the problem setup.

magic numbers

 for (int j = 0; j < 1000; ++j)
 {
 ...
 for (int k = 0; k < 1000; ++k)

Those come straight from the problem statement. Still, they deserve a constexpr or at least a #define.

sparsity

We have a hundred drifts on a grid of a million cells. Which is to say it's a sparse setup, with .01% cell occupancy.

 grid[y][x] = node;

That's a somewhat expensive assignment in terms of memory, and the time to scan a raster for node occupancy is especially expensive.

When thinking about this, you may as well assume the world_size parameter can be bigger than \10ドル^3\$, and might be e.g. \10ドル^5\$ or other arbitrarily large numbers. Armed with a solution for such a world, you can easily conquer the stated problem.

So rather than testing individual cell occupancy, we will want a pair of datastructures that can be queried by \$x\$ or \$y\$ coordinate. Sorting might prove useful, but for now a hash map should suffice. For a given coordinate it may need to produce the set of all matching drifts.

components

Trivially we could consider each drift a connected component of size \1ドル\$, and then greedily connect them. But given the cost of one drift per connection, would we be better off with first producing connected components, labeling the biggest giant, and then connecting components to that, in descending size order? Let's see.

Would it make a difference? Well, consider a world of four drifts containing a N-S component and a disconnected E-W component. Clearly we should add a drift. But notice that it doesn't matter where exactly we put it, since only connectivity matters. To minimize travel times we would place it in the middle. But since that's not part of our cost function, any of the other four (corner) locations would do just as well.

Now imagine that we had a pair of "L"-shaped disconnected components, each of size \3ドル\$. Does it matter where we place that one additional drift? No. It would only make a difference if placing a single drift could connect three disconnected components, something that's not possible in 2-space. So we can rest easy that the details of new drift coordinates don't really matter -- we only care that the drift we've placed induces a new edge in the connectivity graph.

algorithm

Arbitrarily put the first drift in the "Giant" connected component.
Pick an arbitrary disconnected (unclassified) node. Assess whether it's already connected to Giant and skip if it is. (It might already be connected to some other component -- we don't care.) Use its \$x\$ coordinate to connect it to Giant via a new drift, arbitrarily ignoring \$y\$.
Lather, rinse, return to step 2.

Hmmm, turns out a hash map suffices for this. And actually, we don't even need to query it for all matching drifts. It's enough to always use the \$y\$ coordinate of Giant's first node when constructing a new drift.

A hash map is going to be the winning datastructure here, for memory cost proportional to number of drifts. But interestingly, a still fairly modest allocation of \2ドル \times 10^3\$ elements lets us efficiently solve the problem using just array references, without hashing, perhaps in a more cache-friendly and performant manner. Each coordinate needs just 125 bytes, a pair of cache lines, for a bit vector. Or keep it simple and use byte vectors.

J_H J_H 41.7k3 gold badges38 silver badges157 bronze badges · Accepted Answer · 2024-07-27 18:41:35Z

meaningful identifier

dfs is great, but main is rather long, and generically named. Consider having it delegate tasks to appropriately named helper function(s), such as one that's responsible for reading in the problem setup.

magic numbers

 for (int j = 0; j < 1000; ++j)
 {
 ...
 for (int k = 0; k < 1000; ++k)

Those come straight from the problem statement. Still, they deserve a constexpr or at least a #define.

sparsity

We have a hundred drifts on a grid of a million cells. Which is to say it's a sparse setup, with .01% cell occupancy.

 grid[y][x] = node;

That's a somewhat expensive assignment in terms of memory, and the time to scan a raster for node occupancy is especially expensive.

When thinking about this, you may as well assume the world_size parameter can be bigger than \10ドル^3\$, and might be e.g. \10ドル^5\$ or other arbitrarily large numbers. Armed with a solution for such a world, you can easily conquer the stated problem.

So rather than testing individual cell occupancy, we will want a pair of datastructures that can be queried by \$x\$ or \$y\$ coordinate. Sorting might prove useful, but for now a hash map should suffice. For a given coordinate it may need to produce the set of all matching drifts.

components

Trivially we could consider each drift a connected component of size \1ドル\$, and then greedily connect them. But given the cost of one drift per connection, would we be better off with first producing connected components, labeling the biggest giant, and then connecting components to that, in descending size order? Let's see.

Would it make a difference? Well, consider a world of four drifts containing a N-S component and a disconnected E-W component. Clearly we should add a drift. But notice that it doesn't matter where exactly we put it, since only connectivity matters. To minimize travel times we would place it in the middle. But since that's not part of our cost function, any of the other four (corner) locations would do just as well.

Now imagine that we had a pair of "L"-shaped disconnected components, each of size \3ドル\$. Does it matter where we place that one additional drift? No. It would only make a difference if placing a single drift could connect three disconnected components, something that's not possible in 2-space. So we can rest easy that the details of new drift coordinates don't really matter -- we only care that the drift we've placed induces a new edge in the connectivity graph.

algorithm

Arbitrarily put the first drift in the "Giant" connected component.
Pick an arbitrary disconnected (unclassified) node. Assess whether it's already connected to Giant and skip if it is. (It might already be connected to some other component -- we don't care.) Use its \$x\$ coordinate to connect it to Giant via a new drift, arbitrarily ignoring \$y\$.
Lather, rinse, return to step 2.

Hmmm, turns out a hash map suffices for this. And actually, we don't even need to query it for all matching drifts. It's enough to always use the \$y\$ coordinate of Giant's first node when constructing a new drift.

A hash map is going to be the winning datastructure here, for memory cost proportional to number of drifts. But interestingly, a still fairly modest allocation of \2ドル \times 10^3\$ elements lets us efficiently solve the problem using just array references, without hashing, perhaps in a more cache-friendly and performant manner. Each coordinate needs just 125 bytes, a pair of cache lines, for a bit vector. Or keep it simple and use byte vectors.

Stack Exchange Network

How to count disconnected components in grid efficiently, if we can move left, right, down and top from any cell

1 Answer 1

meaningful identifier

magic numbers

sparsity

components

algorithm

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

How to count disconnected components in grid efficiently, if we can move left, right, down and top from any cell

1 Answer 1

meaningful identifier

magic numbers

sparsity

components

algorithm

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions