My attempt at Dijkstra's Algorithm in Python 3

Question 1

I am practicing my coding at HackerRank. The exact problem statement can be found here, but to summarize, it is finding the distances of the shortest path from a starting node to every other node in a non-directional graph where every edge is weighted 6. If a node is unreachable, the distance is -1. The nodes are named 1 to n. If you want an example, please see the provided one in the link.

To solve this, I googled an explanation of Dijkstra's Algorithm and tried my best to implement it (I am new to graph problems). My program passes 6/7 of the test cases, but it times out on the 7th. I improved some small inefficiencies, but it wasn't enough. Please advise me on how I may make the following faster:

def shortest (start, nodes, adj):
 unique = [] #this part is pretty standard for this Algorithm I think
 distance = [float('inf')]*nodes
 for i in range(1, nodes+1):
 if (len(adj[i-1])>0):
 unique.append(i)
 distance [start-1] = 0
 while (len(unique)>0):
 minDist = float('inf')
 for i in unique:
 if (distance[i-1] <= minDist):
 minDist = distance[i-1]
 distIndex = i
 current = distIndex #sets the current node as the closest unvisited node
 unique.remove(current)
 for i in adj[current-1]: #updates the distance of each neighbour of the current node
 neigh = i
 temp = 6 + distance[current-1] #edges are 6
 if temp < distance[neigh-1]:
 distance[neigh-1] = temp
 return distance 
q = int(input()) #each test case contains multiple problems, so just consider one iteration
for i in range(q):
 a = [] #for input purposes
 b = [] #for output purposes
 node = 0
 edge = 0
 #start of input
 node, edge = list(map(int, input().strip().split())) #the input begins with the number of nodes and edges in the graph
 for j in range(edge):
 a.append(list(map(int, input().strip().split()))) #reading the edges, which are all inputted next
 starting = int(input()) #reading the starting node
 #end of input
 adj = [[] for x in range(node)] #the adjacency matrix
 for j in a: #making the adjacency matrix from the inputted edges
 adj[j[0]-1].append(j[1]) 
 adj[j[1]-1].append(j[0])
 for j in adj:
 j = list(set(j)) #someone warned that the problematic test case gives the same edge many times, so this is to remove duplicates
 b = shortest(starting, node, adj)
 #formatting the output to what the question required
 b.remove(0)
 for i in range(len(b)):
 if b[i] == float('inf'):
 b[i] = -1
 print (" ".join(str(e) for e in b))

If it helps, I 'purchased' the test case which is timing out.

Question 2

Make sure you copied the code correctly. I get NameError: name 'node' is not defined.

Question 3

1. Sorry, I did it in the provided terminal at HackerRank and it compiled there. It seems that the node, edge line is not valid syntax elsewhere. When I change it to read to a list first and then assign node and edge, the next error is for mixing spaces and tabs (it didn't matter in the HackerRank terminal again, sorry). I have to go right now, but I will try to clean it as soon as I get back. 2. The first link to HackerRank has a Sample Input, Sample Output, and Explanation on the page. The second link is just the input of the problematic test case.

Question 4

Never mind, I'll just be a little late. I needed to declare node and edge first (not what I originally thought), and I used a text editor to change the indentation to be consistent with spaces. Please try again.

Question 5

Now it breaks because adj is length 0 on the first iteration.

Question 6

You probably just needed to swap the adj = and the node, edge = lines.

Question 7

Those comments are very difficult to read.

j = list(set(j)) #someone warned that the problematic test case gives the same edge many times, so this is to remove duplicates

Format this properly:

# The problematic test case gives the same edge many times,
# so this is to remove duplicates
j = list(set(j))

Then remove fluff:

# Remove duplicates. Helps some test cases.
j = list(set(j))

In this case just remove the first comment, because list(set(...)) is already known to mean deduplicate. And remove the second since it's implied by the fact you've written the code.

j = list(set(j))

Another pair is

a = [] # for input purposes
b = [] # for output purposes

Don't do this. Just call them

descriptive_name_1 = []
descriptive_name_2 = []

so you don't need the comment in the first place.

Then there's q = .... So I wrack my brain thinking about what word contains q. But it turns out you just meant to write num_test_cases. What does that have to do with the letter q?

Worse than that is when you outright lie. What would you expect the variable node to contain? A node, right? Nope - it contains the total number of nodes.

And you write

node = 0

so we expect the total number of nodes to be 0 at that point, right? Nope, you immediately change your mind and write

node, edge = list(map(int, input().strip().split()))

But OK, fine. At least we know what nodes contains, then: multiple nodes, so a list (or other collection) of lengths of nodes.

Nope, it actually means exactly the same thing as node.

A similar comment goes for edge, which should be num_edges.

Then there's adj, which you label "adjacency matrix". Why not just call it adjacency_matrix then? But it's not even a dense matrix like that would imply, it's actually a list of per-node adjacencies, so call it adjacencies.

Then there's starting and start, which are the same thing named differently. Call them start_node or at least be consistent.

Going back to b, you have

b = []

but again this is a lie, because actually

b = shortest(start_node, num_nodes, adjacencies)

So just write that.

distances = shortest(start_node, num_nodes, adjacencies)

For a, initialize it as close to usage as possible and call it edges.

If you write

for j in range(num_edges):

you're misleading me into thinking you're going to use the index. When you're not going to, write

for _ in range(num_edges):

And what's up with that, followed by

for j in edges:

followed by

for j in adjacencies:

Is j just your general "this is a thing" variable? Most people use x, elem and item for that. Instead, write

for _ in range(num_edges):
for edge in edges:
for adjacent in adjacencies:

The middle can be

for start, end in edges:

The comment

#edges are 6

Isn't coherent. Try

# Edges are length 6

Instead, though, don't put this information in shortest - use 1 and multiply out when using the results.

Don't call variables temp. It's a poor name.

You've double-indented the part where you update the neighbours.

Note that

num_nodes, num_edges = list(map(int, input().strip().split()))

is just

num_nodes, num_edges = map(int, input().strip().split())

Why do you do this?

for i in adjacencies[current-1]:
 neigh = i

Just write

for neighbour in adjacencies[current-1]:

And the content can be

for neighbour in adjacencies[current-1]:
 distance[neighbour-1] = min(distance[neighbour-1], distance[current-1] + 1)

Do you not think this whole -1 thing is getting a bit silly? When reading, just decrement the edge values and starting edge correctly:

edges = []
for _ in range(num_edges):
 start, end = map(int, input().strip().split())
 edges.append((start - 1, end - 1))
start_node = int(input()) - 1

while loops and if statements don't need their condition parenthesized, and operators should be spaced properly:

while (len(unique)>0): # before
while len(unique) > 0: # after

Then this is better as just while unique.

Don't put spaces before the parentheses in function calls or indexing.

minDist should be min_dist. You can also start with it set to num_nodes, and the same for distance's initialization.

unique is a horrible name, and has nothing to do with the variable's purpose. Try unvisited instead. It can be initialized as

unvisited = [i for i, adj in enumerate(adjacencies) if adj]

distance, unsurprisingly, does not mean distance but instead distances. Fix that.

This code:

min_dist = num_nodes
for i in unvisited:
 if distances[i] <= min_dist:
 min_dist = distances[i]
 distIndex = i
current = distIndex

is just

min_dist = num_nodes
for i in unvisited:
 if distances[i] <= min_dist:
 min_dist = distances[i]
 current = i

and current should be called closest. It is simplifiable to

closest = min(unvisited, key=lambda x: distances[x])

This is faster if you write

closest = min(unvisited, key=distances.__getitem__)

but the difference isn't vital.

One last thing with shortest - rename it to shortest_distances.

Put the rest of the code in a main function.

If you write

def read_pair(decrement):
 x, _, y = input().partition(" ")
 return int(x) - decrement, int(y) - decrement

then you can initialize num_nodes, num_edges and edges with

num_nodes, num_edges = read_pair(0)
edges = [read_pair(1) for i in range(num_edges)]

Note my use of partition instead of split as IMO it's a better description of the operation here.

Then

for adjacent in adjacencies:
 adjacent = list(set(adjacent))

doesn't actually work! adjacent = only affects the local scope! Instead, you want

adjacencies = [list(set(adj)) for adj in adjacencies]

which is even better as

adjacencies = [set() for _ in range(num_nodes)]
for start, end in edges:
 adjacencies[start].add(end)
 adjacencies[end].add(start)

as you never have the large intermediates then, and there's no real point converting back from a set.

This stuff:

distances.remove(0)
for i in range(len(distances)):
 if distances[i] == num_nodes:
 distances[i] = -1
 else:
 distances[i] *= 6
print(" ".join(str(e) for e in distances))

is just

print(*(-1 if i == num_nodes else i * 6 for i in distances if i))

Nice, that seems to have made it fast enough, but more important the code is more readable.

We can improve speed further by avoiding input, using sys.stdin directly to get buffering.

def main(lines):
 def read_pair(decrement):
 x, _, y = next(lines).partition(" ")
 return int(x) - decrement, int(y) - decrement
 ...
main(sys.stdin)

Note that this change is sufficient on its own to beat the time limit: you can apply it to your original code to pass the test. Note also that you shouldn't mix input with next; at least on Python 2 this throws an error saying

ValueError: Mixing iteration and read methods would lose data

Though this error is gone in Python 3, that doesn't leave me feeling good about it.

I'd finish off with

from __future__ import print_function

to make it Python 2 compatible.

from __future__ import print_function
import sys
def shortest_distances(start_node, num_nodes, adjacencies):
 distances = [num_nodes] * num_nodes
 unvisited = [i for i, adj in enumerate(adjacencies) if adj]
 distances[start_node] = 0
 while unvisited:
 closest = min(unvisited, key=distances.__getitem__)
 unvisited.remove(closest)
 # Update the distances of each neighbour
 for neighbour in adjacencies[closest]:
 distances[neighbour] = min(distances[neighbour], distances[closest] + 1)
 return distances
def main(lines):
 def read_pair(decrement):
 x, _, y = next(lines).partition(" ")
 return int(x) - decrement, int(y) - decrement
 num_test_cases = int(next(lines))
 for i in range(num_test_cases):
 num_nodes, num_edges = read_pair(0)
 edges = [read_pair(1) for i in range(num_edges)]
 start_node = int(next(lines)) - 1
 adjacencies = [set() for _ in range(num_nodes)]
 for start, end in edges:
 adjacencies[start].add(end)
 adjacencies[end].add(start)
 distances = shortest_distances(start_node, num_nodes, adjacencies)
 print(*(-1 if i == num_nodes else i * 6 for i in distances if i))
main(sys.stdin)

Question 8

Wow, thank you very much. This is very thorough and helpful.

Question 9

Veedrac's answer has a lot of good points about the quality of the code, so in this answer I'm just going to discuss the performance.

Let's figure out where the time goes by profiling the program using Python's built-in profiler:

$ python -m cProfile -s tottime cr138989.py < testcase.txt
...
 1492053 function calls in 3.915 seconds
 Ordered by: internal time
 ncalls tottime percall cumtime percall filename:lineno(function)
 1 1.767 1.767 3.915 3.915 cr138989.py:1(<module>)
 246223 1.346 0.000 1.348 0.000 {built-in method builtins.input}
 6 0.343 0.057 0.364 0.061 cr138989.py:1(shortest)
 246216 0.221 0.000 0.221 0.000 {method 'split' of 'str' objects}
 741011 0.106 0.000 0.106 0.000 {method 'append' of 'list' objects}
 246216 0.062 0.000 0.062 0.000 {method 'strip' of 'str' objects}
 6 0.043 0.007 0.043 0.007 {built-in method builtins.print}
 2387 0.020 0.000 0.020 0.000 {method 'remove' of 'list' objects}
 3403 0.002 0.000 0.002 0.000 cr138989.py:53(<genexpr>)
 6 0.001 0.000 0.001 0.000 cr138989.py:39(<listcomp>)
 258 0.001 0.000 0.001 0.000 {built-in method _codecs.ascii_decode}
 6 0.001 0.000 0.002 0.000 {method 'join' of 'str' objects}
 5796 0.001 0.000 0.001 0.000 {built-in method builtins.len}
 258 0.001 0.000 0.002 0.000 ascii.py:25(decode)
 258 0.000 0.000 0.000 0.000 codecs.py:280(getstate)
 1 0.000 0.000 3.915 3.915 {built-in method builtins.exec}
 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}

It takes some practice to interpret the results, but basically this is telling you that the total runtime was 3.9 seconds, of which 1.7 seconds was spent in the code at the top level of the module, 1.3 seconds was spent in input, 0.3 in shortest, and a small amount in other functions.

So the obvious performance problem here is input. If you read the documentation, you'll see that this function has several features that you don't use (prompting, interactivity, newline stripping, EOF handling), and all this can easily be avoided by replacing input with sys.stdin.readline:

import sys
input = sys.stdin.readline

Now the profile results look like this:

 ncalls tottime percall cumtime percall filename:lineno(function)
 1 1.500 1.500 2.387 2.387 cr138989.py:1(<module>)
 6 0.357 0.059 0.378 0.063 cr138989.py:1(shortest)
 246216 0.151 0.000 0.151 0.000 {method 'split' of 'str' objects}
 246223 0.140 0.000 0.142 0.000 {method 'readline' of '_io.TextIOWrapper' objects}
 741011 0.103 0.000 0.103 0.000 {method 'append' of 'list' objects}
 246216 0.058 0.000 0.058 0.000 {method 'strip' of 'str' objects}
 6 0.052 0.009 0.052 0.009 {built-in method builtins.print}
 2387 0.020 0.000 0.020 0.000 {method 'remove' of 'list' objects}

This is already fast enough to pass the HackerRank time limit. But you can save another 0.6 seconds or so by making these changes:

Drop the calls to strip: there's no need for them, because split with no arguments has the same effect.
There's no need to construct the list a of edges: it's simpler just to construct the adjacency matrix directly from the input.
There's no need to deduplicate the edges (duplicate edges don't affect the correctness of Dikjstra's algorithm), but if you are going to do it, it's simpler to make adj into a list of sets, and call the add method instead of append so that the deduplication happens as you go along.

Veedrac Veedrac 9,77323 silver badges38 bronze badges · Accepted Answer · 2016-08-18 05:33:46Z

Those comments are very difficult to read.

j = list(set(j)) #someone warned that the problematic test case gives the same edge many times, so this is to remove duplicates

Format this properly:

# The problematic test case gives the same edge many times,
# so this is to remove duplicates
j = list(set(j))

Then remove fluff:

# Remove duplicates. Helps some test cases.
j = list(set(j))

In this case just remove the first comment, because list(set(...)) is already known to mean deduplicate. And remove the second since it's implied by the fact you've written the code.

j = list(set(j))

Another pair is

a = [] # for input purposes
b = [] # for output purposes

Don't do this. Just call them

descriptive_name_1 = []
descriptive_name_2 = []

so you don't need the comment in the first place.

Then there's q = .... So I wrack my brain thinking about what word contains q. But it turns out you just meant to write num_test_cases. What does that have to do with the letter q?

Worse than that is when you outright lie. What would you expect the variable node to contain? A node, right? Nope - it contains the total number of nodes.

And you write

node = 0

so we expect the total number of nodes to be 0 at that point, right? Nope, you immediately change your mind and write

node, edge = list(map(int, input().strip().split()))

But OK, fine. At least we know what nodes contains, then: multiple nodes, so a list (or other collection) of lengths of nodes.

Nope, it actually means exactly the same thing as node.

A similar comment goes for edge, which should be num_edges.

Then there's adj, which you label "adjacency matrix". Why not just call it adjacency_matrix then? But it's not even a dense matrix like that would imply, it's actually a list of per-node adjacencies, so call it adjacencies.

Then there's starting and start, which are the same thing named differently. Call them start_node or at least be consistent.

Going back to b, you have

b = []

but again this is a lie, because actually

b = shortest(start_node, num_nodes, adjacencies)

So just write that.

distances = shortest(start_node, num_nodes, adjacencies)

For a, initialize it as close to usage as possible and call it edges.

If you write

for j in range(num_edges):

you're misleading me into thinking you're going to use the index. When you're not going to, write

for _ in range(num_edges):

And what's up with that, followed by

for j in edges:

followed by

for j in adjacencies:

Is j just your general "this is a thing" variable? Most people use x, elem and item for that. Instead, write

for _ in range(num_edges):
for edge in edges:
for adjacent in adjacencies:

The middle can be

for start, end in edges:

The comment

#edges are 6

Isn't coherent. Try

# Edges are length 6

Instead, though, don't put this information in shortest - use 1 and multiply out when using the results.

Don't call variables temp. It's a poor name.

You've double-indented the part where you update the neighbours.

Note that

num_nodes, num_edges = list(map(int, input().strip().split()))

is just

num_nodes, num_edges = map(int, input().strip().split())

Why do you do this?

for i in adjacencies[current-1]:
 neigh = i

Just write

for neighbour in adjacencies[current-1]:

And the content can be

for neighbour in adjacencies[current-1]:
 distance[neighbour-1] = min(distance[neighbour-1], distance[current-1] + 1)

Do you not think this whole -1 thing is getting a bit silly? When reading, just decrement the edge values and starting edge correctly:

edges = []
for _ in range(num_edges):
 start, end = map(int, input().strip().split())
 edges.append((start - 1, end - 1))
start_node = int(input()) - 1

while loops and if statements don't need their condition parenthesized, and operators should be spaced properly:

while (len(unique)>0): # before
while len(unique) > 0: # after

Then this is better as just while unique.

Don't put spaces before the parentheses in function calls or indexing.

minDist should be min_dist. You can also start with it set to num_nodes, and the same for distance's initialization.

unique is a horrible name, and has nothing to do with the variable's purpose. Try unvisited instead. It can be initialized as

unvisited = [i for i, adj in enumerate(adjacencies) if adj]

distance, unsurprisingly, does not mean distance but instead distances. Fix that.

This code:

min_dist = num_nodes
for i in unvisited:
 if distances[i] <= min_dist:
 min_dist = distances[i]
 distIndex = i
current = distIndex

is just

min_dist = num_nodes
for i in unvisited:
 if distances[i] <= min_dist:
 min_dist = distances[i]
 current = i

and current should be called closest. It is simplifiable to

closest = min(unvisited, key=lambda x: distances[x])

This is faster if you write

closest = min(unvisited, key=distances.__getitem__)

but the difference isn't vital.

One last thing with shortest - rename it to shortest_distances.

Put the rest of the code in a main function.

If you write

def read_pair(decrement):
 x, _, y = input().partition(" ")
 return int(x) - decrement, int(y) - decrement

then you can initialize num_nodes, num_edges and edges with

num_nodes, num_edges = read_pair(0)
edges = [read_pair(1) for i in range(num_edges)]

Note my use of partition instead of split as IMO it's a better description of the operation here.

Then

for adjacent in adjacencies:
 adjacent = list(set(adjacent))

doesn't actually work! adjacent = only affects the local scope! Instead, you want

adjacencies = [list(set(adj)) for adj in adjacencies]

which is even better as

adjacencies = [set() for _ in range(num_nodes)]
for start, end in edges:
 adjacencies[start].add(end)
 adjacencies[end].add(start)

as you never have the large intermediates then, and there's no real point converting back from a set.

This stuff:

distances.remove(0)
for i in range(len(distances)):
 if distances[i] == num_nodes:
 distances[i] = -1
 else:
 distances[i] *= 6
print(" ".join(str(e) for e in distances))

is just

print(*(-1 if i == num_nodes else i * 6 for i in distances if i))

Nice, that seems to have made it fast enough, but more important the code is more readable.

We can improve speed further by avoiding input, using sys.stdin directly to get buffering.

def main(lines):
 def read_pair(decrement):
 x, _, y = next(lines).partition(" ")
 return int(x) - decrement, int(y) - decrement
 ...
main(sys.stdin)

Note that this change is sufficient on its own to beat the time limit: you can apply it to your original code to pass the test. Note also that you shouldn't mix input with next; at least on Python 2 this throws an error saying

ValueError: Mixing iteration and read methods would lose data

Though this error is gone in Python 3, that doesn't leave me feeling good about it.

I'd finish off with

from __future__ import print_function

to make it Python 2 compatible.

from __future__ import print_function
import sys
def shortest_distances(start_node, num_nodes, adjacencies):
 distances = [num_nodes] * num_nodes
 unvisited = [i for i, adj in enumerate(adjacencies) if adj]
 distances[start_node] = 0
 while unvisited:
 closest = min(unvisited, key=distances.__getitem__)
 unvisited.remove(closest)
 # Update the distances of each neighbour
 for neighbour in adjacencies[closest]:
 distances[neighbour] = min(distances[neighbour], distances[closest] + 1)
 return distances
def main(lines):
 def read_pair(decrement):
 x, _, y = next(lines).partition(" ")
 return int(x) - decrement, int(y) - decrement
 num_test_cases = int(next(lines))
 for i in range(num_test_cases):
 num_nodes, num_edges = read_pair(0)
 edges = [read_pair(1) for i in range(num_edges)]
 start_node = int(next(lines)) - 1
 adjacencies = [set() for _ in range(num_nodes)]
 for start, end in edges:
 adjacencies[start].add(end)
 adjacencies[end].add(start)
 distances = shortest_distances(start_node, num_nodes, adjacencies)
 print(*(-1 if i == num_nodes else i * 6 for i in distances if i))
main(sys.stdin)

Wow, thank you very much. This is very thorough and helpful.

Stack Exchange Network

My attempt at Dijkstra's Algorithm in Python 3

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

My attempt at Dijkstra's Algorithm in Python 3

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions