I wrote an algorithm to build 1000 different random graphs given number of nodes and node degree. Node degree is maintained between the random graphs. I am building these random graphs to input to a statistical test which says given a certain graph what are the chances of finding certain edges by chance. My current algorithm is quite slow. Pseudo code below:
Data structure input to algorithm:
node_id : node_degree (dictionary) i.e node1 : 122, node2 : 33, node3: 66
List of edges (list) i.e. edge1,edge2,edge3...edgen
- For edge1 in ListofEdge:
- Pick 2 random nodes node1(startnode), node10(endnode) to assign edge1 to
- if node1 not reached node_degree:
- add to dict random_graph1[node1] : edge1
- if node10 not reached node_degree:
- add to dict random_graph1[node10] : edge1
- If edge assigned to both node move to next edge
- if node1 not reached node_degree:
- Pick 2 random nodes node1(startnode), node10(endnode) to assign edge1 to
During this simple pseudo-code I encounter conditions which I resolve:
Sometime only 1 node is left in the end while picking 2 random nodes and this single node has unassigned node degree, But I do not want self loops in my graph so I do below :
- If only 1 node left (node_z):
- Look at current full graph and select 2 nodes at random (node_x, node_y):
- Do these nodes have a connected edge between them:
- break this edge (b/w node_x, node_y) + connect node_x to node_z + connect node_y to node_z
- Keep breaking edges randomly until all nodes have reached node degree capacity.
- Do these nodes have a connected edge between them:
- Look at current full graph and select 2 nodes at random (node_x, node_y):
Also every time a node is connected I check original dictionary to know if node degree reached or not. At the moment to generate a random graph given 11,000 nodes and 136million edge my code takes 5 hours. I need a significant speed up as I need to build 1000 such graphs. Also this code is single processor. I could not figure out how to multiprocess as many shared data structures would need to be synchronised between processes.
Just putting part of my code that checks initial conditions:
success_dict = defaultdict(list)
popped_dict = defaultdict(dict)
grid_dens_dict = defaultdict(set)
count = 0
all_itr_list = []
for iteration in range(1000):
all_stop = False
all_stop_2 = False
count = count +1
success_dict = defaultdict(list)
popped_dict = defaultdict(dict)
grid_dens_dict = defaultdict(set)
with open(filename2,'rb') as handle:
pfxlinks = pickle.load(handle,encoding ='latin-1')
with open(filename3,'rb') as handle:
grid_ids_withlinks = pickle.load(handle,encoding ='latin-1')
for pfxpair in pfxlinks:
start_put = False
end_put = False
if all_stop_2 == True:
break
while True:
if all_stop == True:
all_stop_2 = True
break
try:
grid_start_id, grid_end_id = (np.random.choice(list(grid_ids_withlinks.keys()),size = 2, replace = False)) # get 2 random node ids
grid_start_id = int(grid_start_id)
grid_end_id = int(grid_end_id)
if start_put == False:
start_value = grid_dens_dict.get(grid_start_id) # if node id exists in my dict and node degree hasnt reached capacity
start_process_condition = (not start_value) or ( (start_value) and (len(grid_dens_dict[grid_start_id]) < grid_ids_withlinks[grid_start_id]) )
if start_process_condition:
grid_dens_dict[grid_start_id].add(pfxpair)
start_put = True
if len(grid_dens_dict[grid_start_id]) == grid_ids_withlinks[grid_start_id]: # node degree for node full, remove from dict
try:
#print('deleted key: ',grid_start_id, 'with size:',grid_dens_dict[grid_start_id],'Capacity:',grid_ids_withlinks[grid_start_id])
popped_dict[grid_start_id] = {'orig_capacity': grid_ids_withlinks[grid_start_id],'size':len(grid_dens_dict[grid_start_id]) }
grid_ids_withlinks.pop(grid_start_id)
except:
print('already popped')
else:
print('check')
if end_put == False:
end_value = grid_dens_dict.get(grid_end_id)
if (not end_value) or (end_value and (len(grid_dens_dict[grid_end_id]) < grid_ids_withlinks[grid_end_id])):
grid_dens_dict[grid_end_id].add(pfxpair)
end_put = True
if len(grid_dens_dict[grid_end_id]) == grid_ids_withlinks[grid_end_id]:
try:
#print('deleted key: ',grid_end_id, 'with size:',grid_dens_dict[grid_end_id],'Capacity:',grid_ids_withlinks[grid_end_id])
popped_dict[grid_end_id] = {'orig_capacity': grid_ids_withlinks[grid_end_id],'size':len(grid_dens_dict[grid_end_id]) }
grid_ids_withlinks.pop(grid_end_id)
except:
print('already popped')
else:
print('check')
if (start_put == False and end_put == True): # only end while when both nodes have been assigned a link
grid_dens_dict[grid_end_id].discard(pfxpair)
end_put = False
if (start_put == True and end_put == False):
grid_dens_dict[grid_start_id].discard(pfxpair)
start_put = False
if start_put == True and end_put == True:
success_dict[pfxpair].append((grid_start_id,grid_end_id))
break
1 Answer 1
to generate a random graph given 11,000 nodes and 136 million edge my code takes 5 hours. I need a significant speed up as I need to build 1000 such graphs.
This sounds pretty straightforward, given that you already have a working solution.
Choose a thousand random seeds, and rent a bunch of VMs from a cloud provider. Start crunching through 5-hour jobs. They're all independent of one another. If you have a thousand VMs, you'll be done in about five hours.
how to multiprocess as many shared data structures would need to be synchronised between processes.
Replicate the original graph a thousand times, and work on those copies. Each task is independent of the other ones. All that matters is that you've given them distinct PRNG seeds.
imports
The OP submission omitted this:
from collections import defaultdict
It's unclear what the first inits accomplish, as they are immediately overwritten.
success_dict = defaultdict(list)
popped_dict = defaultdict(dict)
grid_dens_dict = defaultdict(set)
...
for iteration in range(1000):
...
success_dict = defaultdict(list)
popped_dict = defaultdict(dict)
grid_dens_dict = defaultdict(set)
Consider rewriting your pickle files,
so the default encoding='utf8'
will work.
ISO 8859-1 offered yeoman's service,
but it's a brand new century, now.
meaningful identifier
all_stop = False
all_stop_2 = False
Pick a better name, please.
Or minimally, offer a #
comment explaining
how the second flag differs from the first.
if all_stop_2 == True:
...
if all_stop == True:
It suffices to just test the flag:
if all_stop_2:
...
if all_stop:
Or negate the flag:
if end_put == False:
if not end_put:
extract helper
This whole thing is up at module top-level,
and it is entirely Too Long.
Learn to use def
, so you can write the top-level code
at a higher level of abstraction, closer to your business use case.
Plus, you can conveniently
test
them.
Also, when testing boolean condition c
,
no need for if (c):
.
A simple if c:
, without the parentheses, will suffice.
Ask black
to tidy up this and other aspects of your source: "$ black *.py"
pop()
You assign
... = = {'...': grid_ids_withlinks[grid_start_id], '...': len(grid_dens_dict[grid_start_id]) }
and then ignore the return value from
grid_ids_withlinks.pop(grid_start_id)
.
It turns out that .pop() conveniently returns the value it popped.
You could assign that to a temp variable,
which you throw into the dict
along with its length.
Oh, dear! I see we've copy-n-pasted, so this logic appears twice twice in the code code.
bare except
except:
No.
Write except Exception:
, or something more specific
if you want to give the Gentle Reader a hint
about what trouble you've anticipated.
As stated we're interfering with important
functionality such as CTRL/C handling.
representation
Your square adjacency matrix would be ballpark 10% full, it sounds like. So, not all that sparse.
Consider switching from dict
to an
ndarray.
Consider asking rust (or C++ or zig) to randomly manipulate that array.
Asking numba to randomly move edges around might even be a win.
Explore related questions
See similar questions with these tags.
type(edge1)
what would I get? And why do you even need to pass in edges anyway? You know how many edges must exist by your number of nodes and their degree. Can't you simplify your inputs by removing the need to specify edges? \$\endgroup\$