2
\$\begingroup\$

Given a huge graph, I want to traverse it, in the fastest way possible, and extract a subgraph with the following properties:

  • the starting node is specified

  • the maximum number of elements and max depth can be specified

  • if there are more links than the number specified, I want to save the information about the original number of links

To save all information I am using a class:

 class Link:
 def __init__(self, node, sons = None ,weight = None, depth = None):
 # name of the node
 self.node = node
 # number of links in the original graph
 self.weight = weight if weight is not None else 0
 # list of links in the reduced graph
 self.sons = sons if sons is not None else []
 # depth of the node with respect to the starting node
 self.depth = depth if depth is not None else 0

Note that the original graph is a dictionary, which can have loops.

 def reduce_graph(graph, start, max_depth, max_links = None):
 if start not in graph:
 return None
 to_visit = [Link(start, depth = 0)]
 visited = {}
 while to_visit:
 page = to_visit.pop(0)
 if page.depth >= max_depth:
 # we don't need to traverse this node, just save the number of links
 # in the original graph, if any
 try:
 page.weight = len(graph[page.node])
 except KeyError:
 page.weight = 0
 # let's make sure we will not visit this node again
 visited[page.node] = page
 continue
 else:
 # we are visiting a new node
 try:
 # str(l) to ensure compatibility with imported json files
 # saving at most max_links links
 links = [Link(str(l), depth = page.depth + 1) for l in graph[page.node][:max_links]]
 # number of links in the original graph
 page.weight = len(graph[page.node])
 page.sons = links
 visited[page.node] = page
 to_visit.extend([l for l in links if l.node not in visited])
 except KeyError:
 # this node in not present in the original graph
 page.weight = 0
 visited[page.node] = page
 return visited
JaDogg
4,5513 gold badges29 silver badges65 bronze badges
asked Nov 20, 2014 at 13:51
\$\endgroup\$

1 Answer 1

2
\$\begingroup\$
  • You may be visiting nodes more than once, because you only check if l.node not in visited when adding nodes to to_visit. The node may be in to_visit already. The correct place for the check is right after to_visit.pop(0).
  • to_visit should perhaps be a collections.deque to benefit from the efficient popleft() function
  • The comment says you use str(l) to ensure compatibility somehow. I'm not sure what you mean. I'm concerned because str(l) becomes page.node which is used to index graph again, which may not work depending on the original type of l.
answered Nov 21, 2014 at 13:55
\$\endgroup\$
2
  • \$\begingroup\$ Thanks for the useful comments; performance-wise, do you think it makes sense to keep if l.node not in visited even after adding the check after to_visit.pop(0)? Regarding the last point: I have a JSON file of the form '1' : [2,3,4,5]. I am converting each element of the list so that I can use them as keys of the JSON itself \$\endgroup\$ Commented Nov 24, 2014 at 13:14
  • \$\begingroup\$ @meto Keeping the other check means checking twice for every new node, which sounds bad, but if you really want to know which is fastest you need to take some timings with real data. \$\endgroup\$ Commented Nov 24, 2014 at 14:13

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.