Efficient way to build random undirected graphs without self loops given total number of graph nodes and associated node degree

Question 1

I wrote an algorithm to build 1000 different random graphs given number of nodes and node degree. Node degree is maintained between the random graphs. I am building these random graphs to input to a statistical test which says given a certain graph what are the chances of finding certain edges by chance. My current algorithm is quite slow. Pseudo code below:

Data structure input to algorithm:

node_id : node_degree (dictionary) i.e node1 : 122, node2 : 33, node3: 66
List of edges (list) i.e. edge1,edge2,edge3...edgen

For edge1 in ListofEdge:
- Pick 2 random nodes node1(startnode), node10(endnode) to assign edge1 to
  - if node1 not reached node_degree:
    - add to dict random_graph1[node1] : edge1
  - if node10 not reached node_degree:
    - add to dict random_graph1[node10] : edge1
  - If edge assigned to both node move to next edge

During this simple pseudo-code I encounter conditions which I resolve:

Sometime only 1 node is left in the end while picking 2 random nodes and this single node has unassigned node degree, But I do not want self loops in my graph so I do below :

If only 1 node left (node_z):
- Look at current full graph and select 2 nodes at random (node_x, node_y):
  - Do these nodes have a connected edge between them:
    - break this edge (b/w node_x, node_y) + connect node_x to node_z + connect node_y to node_z
    - Keep breaking edges randomly until all nodes have reached node degree capacity.

Also every time a node is connected I check original dictionary to know if node degree reached or not. At the moment to generate a random graph given 11,000 nodes and 136million edge my code takes 5 hours. I need a significant speed up as I need to build 1000 such graphs. Also this code is single processor. I could not figure out how to multiprocess as many shared data structures would need to be synchronised between processes.

Just putting part of my code that checks initial conditions:


success_dict = defaultdict(list) 
popped_dict = defaultdict(dict)
grid_dens_dict = defaultdict(set)
count = 0
all_itr_list = []
for iteration in range(1000):
 all_stop = False
 all_stop_2 = False
 count = count +1
 
 success_dict = defaultdict(list) 
 popped_dict = defaultdict(dict)
 grid_dens_dict = defaultdict(set)
 
 with open(filename2,'rb') as handle:
 pfxlinks = pickle.load(handle,encoding ='latin-1') 
 with open(filename3,'rb') as handle:
 grid_ids_withlinks = pickle.load(handle,encoding ='latin-1')
 
 for pfxpair in pfxlinks:
 start_put = False 
 end_put = False 
 
 if all_stop_2 == True:
 break
 
 while True:
 
 if all_stop == True:
 all_stop_2 = True
 break
 
 try:
 grid_start_id, grid_end_id = (np.random.choice(list(grid_ids_withlinks.keys()),size = 2, replace = False)) # get 2 random node ids
 grid_start_id = int(grid_start_id)
 grid_end_id = int(grid_end_id)
 
 if start_put == False:
 
 start_value = grid_dens_dict.get(grid_start_id) # if node id exists in my dict and node degree hasnt reached capacity
 
 start_process_condition = (not start_value) or ( (start_value) and (len(grid_dens_dict[grid_start_id]) < grid_ids_withlinks[grid_start_id]) )
 if start_process_condition:
 grid_dens_dict[grid_start_id].add(pfxpair)
 start_put = True
 
 if len(grid_dens_dict[grid_start_id]) == grid_ids_withlinks[grid_start_id]: # node degree for node full, remove from dict
 
 try:
 #print('deleted key: ',grid_start_id, 'with size:',grid_dens_dict[grid_start_id],'Capacity:',grid_ids_withlinks[grid_start_id])
 popped_dict[grid_start_id] = {'orig_capacity': grid_ids_withlinks[grid_start_id],'size':len(grid_dens_dict[grid_start_id]) }
 
 grid_ids_withlinks.pop(grid_start_id)
 
 except:
 print('already popped')
 else:
 print('check')
 
 if end_put == False: 
 
 
 end_value = grid_dens_dict.get(grid_end_id)
 if (not end_value) or (end_value and (len(grid_dens_dict[grid_end_id]) < grid_ids_withlinks[grid_end_id])):
 grid_dens_dict[grid_end_id].add(pfxpair)
 end_put = True
 
 if len(grid_dens_dict[grid_end_id]) == grid_ids_withlinks[grid_end_id]:
 try:
 #print('deleted key: ',grid_end_id, 'with size:',grid_dens_dict[grid_end_id],'Capacity:',grid_ids_withlinks[grid_end_id])
 popped_dict[grid_end_id] = {'orig_capacity': grid_ids_withlinks[grid_end_id],'size':len(grid_dens_dict[grid_end_id]) }
 
 grid_ids_withlinks.pop(grid_end_id)
 
 except:
 print('already popped')
 else:
 print('check')
 
 if (start_put == False and end_put == True): # only end while when both nodes have been assigned a link
 grid_dens_dict[grid_end_id].discard(pfxpair)
 end_put = False
 if (start_put == True and end_put == False):
 grid_dens_dict[grid_start_id].discard(pfxpair)
 start_put = False
 
 
 if start_put == True and end_put == True:
 success_dict[pfxpair].append((grid_start_id,grid_end_id))
 
 break

Question 2

What is an "edge" in your list of edges? Are you guaranteed that the input provides enough edges to match the need to connect all your nodes with the prescribed node degree? Are you set in stone in terms of the format of your output? An adjacency matrix might work better than your current usage of dicts. I feel as if this whole question may be an x/y problem in terms of your inputs/outputs.

Question 3

@zephyr The edge list is just a placeholder for a list of edges. It might well as be a list like edge1,edge2,edge3.... 136millionth edge. The number of nodes and node degree is derived from an original graph which doesn't have self loops but has multiple edges, so yes it is realisable. I think such graphs can be thought of an isomorphic graphs. I do not care about input or output format as such. Only that I need to randomly build 1000 isomorphoc graphs of original given graph which are different from each other in an efficient way

Question 4

That doesn't really answer my question though. If I were to do type(edge1) what would I get? And why do you even need to pass in edges anyway? You know how many edges must exist by your number of nodes and their degree. Can't you simplify your inputs by removing the need to specify edges?

Question 5

@zephyr That is correct I might as well replace the line for pfxpair in pfxlinks: with for i in range(135million). But that single line doesn't change much in the whole algorithm ? type(edge1) is just string like (nodex,nodey)

Question 6

So again, this sounds like an x/y problem. If you already have a graph with the proper set of nodes and connections, why are you trying to build an entirely new one from scratch that has the same properties? You can just shuffle edges around in your current graph randomly for as long as you want until you have a new graph. Just make sure to shuffle your edges appropriately so you maintain the node degree. If your graph is defined with an adjacency matrix, this whole process is as simple as moving 1's around in your adjacency matrix under a set of rules to maintain node degree.

Question 7

to generate a random graph given 11,000 nodes and 136 million edge my code takes 5 hours. I need a significant speed up as I need to build 1000 such graphs.

This sounds pretty straightforward, given that you already have a working solution.

Choose a thousand random seeds, and rent a bunch of VMs from a cloud provider. Start crunching through 5-hour jobs. They're all independent of one another. If you have a thousand VMs, you'll be done in about five hours.

how to multiprocess as many shared data structures would need to be synchronised between processes.

Replicate the original graph a thousand times, and work on those copies. Each task is independent of the other ones. All that matters is that you've given them distinct PRNG seeds.

imports

The OP submission omitted this:

from collections import defaultdict

It's unclear what the first inits accomplish, as they are immediately overwritten.

success_dict = defaultdict(list) 
popped_dict = defaultdict(dict)
grid_dens_dict = defaultdict(set)
...
for iteration in range(1000):
 ... 
 success_dict = defaultdict(list) 
 popped_dict = defaultdict(dict)
 grid_dens_dict = defaultdict(set)

Consider rewriting your pickle files, so the default encoding='utf8' will work. ISO 8859-1 offered yeoman's service, but it's a brand new century, now.

meaningful identifier

 all_stop = False
 all_stop_2 = False

Pick a better name, please. Or minimally, offer a # comment explaining how the second flag differs from the first.

 if all_stop_2 == True:
 ...
 if all_stop == True:

It suffices to just test the flag:

 if all_stop_2:
 ...
 if all_stop:

Or negate the flag:

 if end_put == False:

 if not end_put:

extract helper

This whole thing is up at module top-level, and it is entirely Too Long. Learn to use def, so you can write the top-level code at a higher level of abstraction, closer to your business use case. Plus, you can conveniently test them.

Also, when testing boolean condition c, no need for if (c):. A simple if c:, without the parentheses, will suffice. Ask black to tidy up this and other aspects of your source: "$ black *.py"

pop()

You assign ... = = {'...': grid_ids_withlinks[grid_start_id], '...': len(grid_dens_dict[grid_start_id]) } and then ignore the return value from grid_ids_withlinks.pop(grid_start_id). It turns out that .pop() conveniently returns the value it popped. You could assign that to a temp variable, which you throw into the dict along with its length.

Oh, dear! I see we've copy-n-pasted, so this logic appears twice twice in the code code.

bare except

 except:

No.

Write except Exception:, or something more specific if you want to give the Gentle Reader a hint about what trouble you've anticipated. As stated we're interfering with important functionality such as CTRL/C handling.

representation

Your square adjacency matrix would be ballpark 10% full, it sounds like. So, not all that sparse.

Consider switching from dict to an ndarray.

Consider asking rust (or C++ or zig) to randomly manipulate that array.

Asking numba to randomly move edges around might even be a win.

J_H J_HJ_H 41.4k3 gold badges38 silver badges157 bronze badges · Answer 1 · 2024-04-28 21:44:54Z

to generate a random graph given 11,000 nodes and 136 million edge my code takes 5 hours. I need a significant speed up as I need to build 1000 such graphs.

This sounds pretty straightforward, given that you already have a working solution.

Choose a thousand random seeds, and rent a bunch of VMs from a cloud provider. Start crunching through 5-hour jobs. They're all independent of one another. If you have a thousand VMs, you'll be done in about five hours.

how to multiprocess as many shared data structures would need to be synchronised between processes.

Replicate the original graph a thousand times, and work on those copies. Each task is independent of the other ones. All that matters is that you've given them distinct PRNG seeds.

imports

The OP submission omitted this:

from collections import defaultdict

It's unclear what the first inits accomplish, as they are immediately overwritten.

success_dict = defaultdict(list) 
popped_dict = defaultdict(dict)
grid_dens_dict = defaultdict(set)
...
for iteration in range(1000):
 ... 
 success_dict = defaultdict(list) 
 popped_dict = defaultdict(dict)
 grid_dens_dict = defaultdict(set)

Consider rewriting your pickle files, so the default encoding='utf8' will work. ISO 8859-1 offered yeoman's service, but it's a brand new century, now.

meaningful identifier

 all_stop = False
 all_stop_2 = False

Pick a better name, please. Or minimally, offer a # comment explaining how the second flag differs from the first.

 if all_stop_2 == True:
 ...
 if all_stop == True:

It suffices to just test the flag:

 if all_stop_2:
 ...
 if all_stop:

Or negate the flag:

 if end_put == False:

 if not end_put:

extract helper

This whole thing is up at module top-level, and it is entirely Too Long. Learn to use def, so you can write the top-level code at a higher level of abstraction, closer to your business use case. Plus, you can conveniently test them.

Also, when testing boolean condition c, no need for if (c):. A simple if c:, without the parentheses, will suffice. Ask black to tidy up this and other aspects of your source: "$ black *.py"

pop()

You assign ... = = {'...': grid_ids_withlinks[grid_start_id], '...': len(grid_dens_dict[grid_start_id]) } and then ignore the return value from grid_ids_withlinks.pop(grid_start_id). It turns out that .pop() conveniently returns the value it popped. You could assign that to a temp variable, which you throw into the dict along with its length.

Oh, dear! I see we've copy-n-pasted, so this logic appears twice twice in the code code.

bare except

 except:

No.

Write except Exception:, or something more specific if you want to give the Gentle Reader a hint about what trouble you've anticipated. As stated we're interfering with important functionality such as CTRL/C handling.

representation

Your square adjacency matrix would be ballpark 10% full, it sounds like. So, not all that sparse.

Consider switching from dict to an ndarray.

Consider asking rust (or C++ or zig) to randomly manipulate that array.

Asking numba to randomly move edges around might even be a win.

Stack Exchange Network

Efficient way to build random undirected graphs without self loops given total number of graph nodes and associated node degree

1 Answer 1

imports

meaningful identifier

extract helper

pop()

bare except

representation

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Efficient way to build random undirected graphs without self loops given total number of graph nodes and associated node degree

1 Answer 1

imports

meaningful identifier

extract helper

pop()

bare except

representation

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions