Given a huge graph, I want to traverse it, in the fastest way possible, and extract a subgraph with the following properties:
the starting node is specified
the maximum number of elements and max depth can be specified
if there are more links than the number specified, I want to save the information about the original number of links
To save all information I am using a class:
class Link:
def __init__(self, node, sons = None ,weight = None, depth = None):
# name of the node
self.node = node
# number of links in the original graph
self.weight = weight if weight is not None else 0
# list of links in the reduced graph
self.sons = sons if sons is not None else []
# depth of the node with respect to the starting node
self.depth = depth if depth is not None else 0
Note that the original graph is a dictionary, which can have loops.
def reduce_graph(graph, start, max_depth, max_links = None):
if start not in graph:
return None
to_visit = [Link(start, depth = 0)]
visited = {}
while to_visit:
page = to_visit.pop(0)
if page.depth >= max_depth:
# we don't need to traverse this node, just save the number of links
# in the original graph, if any
try:
page.weight = len(graph[page.node])
except KeyError:
page.weight = 0
# let's make sure we will not visit this node again
visited[page.node] = page
continue
else:
# we are visiting a new node
try:
# str(l) to ensure compatibility with imported json files
# saving at most max_links links
links = [Link(str(l), depth = page.depth + 1) for l in graph[page.node][:max_links]]
# number of links in the original graph
page.weight = len(graph[page.node])
page.sons = links
visited[page.node] = page
to_visit.extend([l for l in links if l.node not in visited])
except KeyError:
# this node in not present in the original graph
page.weight = 0
visited[page.node] = page
return visited
1 Answer 1
- You may be visiting nodes more than once, because you only check
if l.node not in visited
when adding nodes toto_visit
. The node may be into_visit
already. The correct place for the check is right afterto_visit.pop(0)
. to_visit
should perhaps be acollections.deque
to benefit from the efficientpopleft()
function- The comment says you use
str(l)
to ensure compatibility somehow. I'm not sure what you mean. I'm concerned becausestr(l)
becomespage.node
which is used to indexgraph
again, which may not work depending on the original type ofl
.
-
\$\begingroup\$ Thanks for the useful comments; performance-wise, do you think it makes sense to keep
if l.node not in visited
even after adding the check afterto_visit.pop(0)
? Regarding the last point: I have a JSON file of the form'1' : [2,3,4,5]
. I am converting each element of the list so that I can use them as keys of the JSON itself \$\endgroup\$meto– meto2014年11月24日 13:14:54 +00:00Commented Nov 24, 2014 at 13:14 -
\$\begingroup\$ @meto Keeping the other check means checking twice for every new node, which sounds bad, but if you really want to know which is fastest you need to take some timings with real data. \$\endgroup\$Janne Karila– Janne Karila2014年11月24日 14:13:30 +00:00Commented Nov 24, 2014 at 14:13
Explore related questions
See similar questions with these tags.