Extracting a subgraph

Question 1

Given a huge graph, I want to traverse it, in the fastest way possible, and extract a subgraph with the following properties:

the starting node is specified
the maximum number of elements and max depth can be specified
if there are more links than the number specified, I want to save the information about the original number of links

To save all information I am using a class:

 class Link:
 def __init__(self, node, sons = None ,weight = None, depth = None):
 # name of the node
 self.node = node
 # number of links in the original graph
 self.weight = weight if weight is not None else 0
 # list of links in the reduced graph
 self.sons = sons if sons is not None else []
 # depth of the node with respect to the starting node
 self.depth = depth if depth is not None else 0

Note that the original graph is a dictionary, which can have loops.

 def reduce_graph(graph, start, max_depth, max_links = None):
 if start not in graph:
 return None
 to_visit = [Link(start, depth = 0)]
 visited = {}
 while to_visit:
 page = to_visit.pop(0)
 if page.depth >= max_depth:
 # we don't need to traverse this node, just save the number of links
 # in the original graph, if any
 try:
 page.weight = len(graph[page.node])
 except KeyError:
 page.weight = 0
 # let's make sure we will not visit this node again
 visited[page.node] = page
 continue
 else:
 # we are visiting a new node
 try:
 # str(l) to ensure compatibility with imported json files
 # saving at most max_links links
 links = [Link(str(l), depth = page.depth + 1) for l in graph[page.node][:max_links]]
 # number of links in the original graph
 page.weight = len(graph[page.node])
 page.sons = links
 visited[page.node] = page
 to_visit.extend([l for l in links if l.node not in visited])
 except KeyError:
 # this node in not present in the original graph
 page.weight = 0
 visited[page.node] = page
 return visited

Question 2

You may be visiting nodes more than once, because you only check if l.node not in visited when adding nodes to to_visit. The node may be in to_visit already. The correct place for the check is right after to_visit.pop(0).
to_visit should perhaps be a collections.deque to benefit from the efficient popleft() function
The comment says you use str(l) to ensure compatibility somehow. I'm not sure what you mean. I'm concerned because str(l) becomes page.node which is used to index graph again, which may not work depending on the original type of l.

Question 3

Thanks for the useful comments; performance-wise, do you think it makes sense to keep if l.node not in visited even after adding the check after to_visit.pop(0)? Regarding the last point: I have a JSON file of the form '1' : [2,3,4,5]. I am converting each element of the list so that I can use them as keys of the JSON itself

Question 4

@meto Keeping the other check means checking twice for every new node, which sounds bad, but if you really want to know which is fastest you need to take some timings with real data.

Janne Karila Janne Karila 10.6k21 silver badges34 bronze badges · Accepted Answer · 2014-11-21 13:55:47Z

You may be visiting nodes more than once, because you only check if l.node not in visited when adding nodes to to_visit. The node may be in to_visit already. The correct place for the check is right after to_visit.pop(0).
to_visit should perhaps be a collections.deque to benefit from the efficient popleft() function
The comment says you use str(l) to ensure compatibility somehow. I'm not sure what you mean. I'm concerned because str(l) becomes page.node which is used to index graph again, which may not work depending on the original type of l.

Thanks for the useful comments; performance-wise, do you think it makes sense to keep if l.node not in visited even after adding the check after to_visit.pop(0)? Regarding the last point: I have a JSON file of the form '1' : [2,3,4,5]. I am converting each element of the list so that I can use them as keys of the JSON itself
@meto Keeping the other check means checking twice for every new node, which sounds bad, but if you really want to know which is fastest you need to take some timings with real data.

Stack Exchange Network

Extracting a subgraph

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Extracting a subgraph

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions