Optimizing graph partitioning by connected components of generated subgraph

Question 1

I'm trying to optimize a function in a process which is going to be run millions of times. The function is supposed to partition the vertices of a graph, which will be used to make a quotient graph. Theoretically, the process is:

Input a graph, G
For each vertex, v in G, we choose a random edge of v (the same edge can be chosen for two different vertices)
We have a subgraph S, where we only use the edges chosen in (2)
We partition our vertices by which connected component they are in

Initially, I simply created S in NetworkX, but found that the process of creating a proper graph was unnecessarily slow. So next, I tried to recreate the bare-bones of the process where I locally created an adjacency dictionary and used that with the NetworkX documentation for finding connected components. Finally, I found it fastest to simply have a dictionary of which connected component each vertex is in, yet still it seems a bit redundant, as for each edge chosen, I need to update the dictionary for the entire connected component.

My fastest version:

def sp3(G):
 nodes = list(G.nodes)
 P = []
 dic = {}
 for v in nodes:
 dic[v] = frozenset([v])
 for v in nodes:
 nb = random.choice(list(G.neighbors(v)))
 U = frozenset.union(dic[v],dic[nb])
 for a in U:
 dic[a] = U
 for v in nodes:
 P.append(dic[v])
 P = list(set(P))
 return P

Slower version:

def sp1(G):
 nodes = list(G.nodes)
 dic = {v:set() for v in nodes}
 for v in nodes:
 nb = random.choice(list(G.neighbors(v)))
 dic[v].add(nb)
 dic[nb].add(v)
 P = list(CCbfs(dic))
 return P
def CCbfs(dic):
 seen = set()
 for v in dic.keys():
 if v not in seen:
 c = frozenset(bfs(dic, v))
 yield c
 seen.update(c)
def bfs(dic,n):
 d = dic
 seen = set()
 nextlevel = {n}
 while nextlevel:
 thislevel = nextlevel
 nextlevel = set()
 for v in thislevel:
 if v not in seen:
 yield v
 seen.add(v)
 nextlevel.update(d[v])

I've been using cProfile and got the following on a 2690 vertex, nearly maximal planar graph (in reality I'm working with ~600 vertex subgraphs of this graph, but for convenience I just ran it on the whole thing):

 ncalls tottime percall cumtime percall filename:lineno(function)
 1 0.000 0.000 99.996 99.996 {built-in method builtins.exec}
 1 0.000 0.000 99.996 99.996 <string>:1(<module>)
 1 1.083 1.083 99.996 99.996 Imp2.py:1201(SPit)
 2500 12.531 0.005 52.974 0.021 Imp2.py:1108(sp1)
 2500 22.591 0.009 45.939 0.018 Imp2.py:1085(sp3)
 13450000 9.168 0.000 29.084 0.000 random.py:256(choice)
 13450000 12.218 0.000 18.138 0.000 random.py:224(_randbelow)
 768750 2.602 0.000 13.118 0.000 Imp2.py:149(CCbfs)
 7491250 7.274 0.000 9.910 0.000 Imp2.py:156(bfs)
 13450000 6.418 0.000 9.225 0.000 graph.py:1209(neighbors)
 2500 6.728 0.003 6.728 0.003 Imp2.py:1110(<dictcomp>)
 20988370 4.325 0.000 4.325 0.000 {method 'getrandbits' of '_random.Random' objects}
 6725000 3.312 0.000 3.312 0.000 {method 'union' of 'frozenset' objects}
 13455000 2.810 0.000 2.810 0.000 {built-in method builtins.iter}
 20175000 2.541 0.000 2.541 0.000 {method 'add' of 'set' objects}
 7491250 2.329 0.000 2.329 0.000 {method 'update' of 'set' objects}
 13455000 1.780 0.000 1.780 0.000 {built-in method builtins.len}
 13450000 1.595 0.000 1.595 0.000 {method 'bit_length' of 'int' objects}
 6725000 0.640 0.000 0.640 0.000 {method 'append' of 'list' objects}
 5000 0.029 0.000 0.041 0.000 graph.py:646(nodes)
 5000 0.012 0.000 0.012 0.000 reportviews.py:167(__init__)
 5000 0.006 0.000 0.009 0.000 reportviews.py:174(__iter__)
 5000 0.003 0.000 0.006 0.000 reportviews.py:171(__len__)
 2500 0.001 0.000 0.001 0.000 {method 'keys' of 'dict' objects}
 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}

It seems like I spend so much time just going through the for loops in sp3... is there some way to fix this? (Also, I've been told to check if there's a faster implementation than random.choice when I'm using really small lists usually of size 6, give or take, so any insight on that is welcome.)

Edit: here's something to test the code, in reality I'm using an adjacency matrix created from some text files but it's really messy and this should mostly capture the properties

import networkx as nx
import random
def SPit(m=25,n):
 G = nx.grid_2d_graph(m,m)
 i=0
 while i<n:
 x1 = sp1(G)
 x3 = sp3(G)
 i+=1
def sp3(G):
 nodes = list(G.nodes)
 P = []
 dic = {}
 for v in nodes:
 dic[v] = frozenset([v])
 for v in nodes:
 nb = random.choice(list(G.neighbors(v)))
 U = frozenset.union(dic[v],dic[nb])
 for a in U:
 dic[a] = U
 for v in nodes:
 P.append(dic[v])
 P = list(set(P))
 return P
def sp1(G):
 nodes = list(G.nodes)
 dic = {v:set() for v in nodes}
 for v in nodes:
 nb = random.choice(list(G.neighbors(v)))
 dic[v].add(nb)
 dic[nb].add(v)
 P = list(CCbfs(dic))
 return P
def CCbfs(dic):
 seen = set()
 for v in dic.keys():
 if v not in seen:
 c = frozenset(bfs(dic, v))
 yield c
 seen.update(c)
def bfs(dic,n):
 d = dic
 seen = set()
 nextlevel = {n}
 while nextlevel:
 thislevel = nextlevel
 nextlevel = set()
 for v in thislevel:
 if v not in seen:
 yield v
 seen.add(v)
 nextlevel.update(d[v])
```

Question 2

Can you give us testing code and/or a dataset to test sp3?

Question 3

This code still isn't runable to someone other than you. You will probably get better results if you include a block of code that tests what you want tested and can be copied and pasted and run without having to figure out all the imports and stuff.

Question 4

updated, my bad

Question 5

No problem, Thanks for fixing it so quickly.

Question 6

Take a look at the classic Union-Find algorithm.

Question 7

I'm just going to review sp3.

 P = []
 <snip 8 lines>
 for v in nodes:
 P.append(dic[v])

It's generally preferable to declare your variables as near as possible to the first use. That way someone reading the code has to keep less information in their memory.

P is not an informative name. I would usually guess that it's short for Point, but that doesn't seem to fit.

 for v in nodes:
 P.append(dic[v])
 P = list(set(P))
 return P

This could be shortened to

 return list(set(dic.values()))

But it's not obvious why it's important to return a list rather than a set.

 dic = {}
 for v in nodes:
 dic[v] = frozenset([v])
 for v in nodes:
 nb = random.choice(list(G.neighbors(v)))
 U = frozenset.union(dic[v],dic[nb])
 for a in U:
 dic[a] = U

Again, dic is not a very informative name. I can see it's a dictionary from the declaration, but what does it mean? Perhaps components would be a better name?

Again, I don't understand the point of conversion to list. random.choice is documented as accepting any sequence. If you've profiled and discovered that it's faster when passed a list, that's the kind of valuable information which should be captured in a comment.

This algorithm is not very efficient. Wikipedia's article Disjoint-set data structure contains pseudocode for a more efficient approach. Basically you replace the mapping to a set with a mapping to another vertex and build a tree. There are various optimisations which are simple to implement and harder to analyse but give an essentially linear time algorithm.

NB if you can arrange for the vertices to be numbers from 0 to n-1 then you can replace the dict with a list, which should be slightly faster.

Also, I've been told to check if there's a faster implementation than random.choice when I'm using really small lists usually of size 6, give or take, so any insight on that is welcome.

I was expecting the answer to be no, but looking at the implementation of random.choice (and, in particular, _randbelow_with_getrandbits) I think that you probably can do better. Certainly it's worth profiling against an implementation rather more like Java's nextInt(int).

Basically, if you want a random number from 0 inclusive to n exclusive, Python finds the smallest power of two which is at least n, takes that many random bits, and tries again if the result is n or greater. When n is one more than a power of two it takes two tries on average. Java, on the other hand, always takes 31 random bits and calculates \2ドル^{31} \bmod n\$ to figure out the threshold at which to discard. So for n=5 it does one expensive operation (%) but has an extremely low probability of needing to try again.

Which approach is fastest probably depends on your CPU. Given that you know the statistics of your list sizes you could also consider taking the Java approach but pre-calculating a lookup table to do the % operations just once at start-up.

Question 8

type(G.neighbors) is 'dict_keyiterator', which is why I make it a list. but otherwise your smaller comments definitely helped. currently I'm struggling to implement Union Find faster, I think it's because I need to keep track of the names of my nodes and I'm handling that wrong. once I've finally mustered up my best efforts on doing UF and the randomness, if I'm still having trouble, is it recommended I simply make an edit or a new post?

Question 9

Don't edit the question to update the code. Making a new post is one of your options.

Question 10

The reason python does random choice that way is that unlike Java's implementation, it is unbiased. For small numbers it doesn't matter much, but if for example you wanted to choose a random number between 1 and 1.5*2^30, java's approach would do a very bad job of evenly distributing the results.

Question 11

@OscarSmith, no. Java's approach is also unbiased; it uses a different technique to Python to achieve that, which is arguably better for weak PRNGs because it uses all of the bits output by the PRNG. At a guess the reason for Python's approach is because it doesn't require special-casing for big integers.

Question 12

I guess I don't understand what you mean by calculates 2^{31} mod n to figure out the threshold at which to discard. Can you clarify what that means?

Peter Taylor Peter Taylor 24.4k1 gold badge49 silver badges94 bronze badges · Accepted Answer · 2019-04-24 08:46:47Z

I'm just going to review sp3.

 P = []
 <snip 8 lines>
 for v in nodes:
 P.append(dic[v])

It's generally preferable to declare your variables as near as possible to the first use. That way someone reading the code has to keep less information in their memory.

P is not an informative name. I would usually guess that it's short for Point, but that doesn't seem to fit.

 for v in nodes:
 P.append(dic[v])
 P = list(set(P))
 return P

This could be shortened to

 return list(set(dic.values()))

But it's not obvious why it's important to return a list rather than a set.

 dic = {}
 for v in nodes:
 dic[v] = frozenset([v])
 for v in nodes:
 nb = random.choice(list(G.neighbors(v)))
 U = frozenset.union(dic[v],dic[nb])
 for a in U:
 dic[a] = U

Again, dic is not a very informative name. I can see it's a dictionary from the declaration, but what does it mean? Perhaps components would be a better name?

Again, I don't understand the point of conversion to list. random.choice is documented as accepting any sequence. If you've profiled and discovered that it's faster when passed a list, that's the kind of valuable information which should be captured in a comment.

This algorithm is not very efficient. Wikipedia's article Disjoint-set data structure contains pseudocode for a more efficient approach. Basically you replace the mapping to a set with a mapping to another vertex and build a tree. There are various optimisations which are simple to implement and harder to analyse but give an essentially linear time algorithm.

NB if you can arrange for the vertices to be numbers from 0 to n-1 then you can replace the dict with a list, which should be slightly faster.

Also, I've been told to check if there's a faster implementation than random.choice when I'm using really small lists usually of size 6, give or take, so any insight on that is welcome.

I was expecting the answer to be no, but looking at the implementation of random.choice (and, in particular, _randbelow_with_getrandbits) I think that you probably can do better. Certainly it's worth profiling against an implementation rather more like Java's nextInt(int).

Basically, if you want a random number from 0 inclusive to n exclusive, Python finds the smallest power of two which is at least n, takes that many random bits, and tries again if the result is n or greater. When n is one more than a power of two it takes two tries on average. Java, on the other hand, always takes 31 random bits and calculates \2ドル^{31} \bmod n\$ to figure out the threshold at which to discard. So for n=5 it does one expensive operation (%) but has an extremely low probability of needing to try again.

Which approach is fastest probably depends on your CPU. Given that you know the statistics of your list sizes you could also consider taking the Java approach but pre-calculating a lookup table to do the % operations just once at start-up.

type(G.neighbors) is 'dict_keyiterator', which is why I make it a list. but otherwise your smaller comments definitely helped. currently I'm struggling to implement Union Find faster, I think it's because I need to keep track of the names of my nodes and I'm handling that wrong. once I've finally mustered up my best efforts on doing UF and the randomness, if I'm still having trouble, is it recommended I simply make an edit or a new post?
Don't edit the question to update the code. Making a new post is one of your options.
The reason python does random choice that way is that unlike Java's implementation, it is unbiased. For small numbers it doesn't matter much, but if for example you wanted to choose a random number between 1 and 1.5*2^30, java's approach would do a very bad job of evenly distributing the results.
@OscarSmith, no. Java's approach is also unbiased; it uses a different technique to Python to achieve that, which is arguably better for weak PRNGs because it uses all of the bits output by the PRNG. At a guess the reason for Python's approach is because it doesn't require special-casing for big integers.
I guess I don't understand what you mean by calculates 2^{31} mod n to figure out the threshold at which to discard. Can you clarify what that means?

Stack Exchange Network

Optimizing graph partitioning by connected components of generated subgraph

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Optimizing graph partitioning by connected components of generated subgraph

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions