Implementation of Dijkstra's algorithm in Python

Question 1

I have implemented Dijkstra's algorithm for my research on an Economic model, using Python. In my research I am investigating two functions and the differences between them. Every functions takes as input two parameters: F(a,b) and Z(a,b).

Every cell of the matrix is defined as: $$M[a][b]=|F(a,b)-Z(a,b)|$$

The purpose of this is to find the path of minimal difference between the equations that will be correct for every input a

Online implementations of Dijkstra's algorithm were all using weighted edges whereas I have weighted vertices.

Pseudo-code:

function Dijkstra(Graph, source):
 
 create vertex set Q
 
 for each vertex v in Graph: 
 dist[v] ← INFINITY 
 prev[v] ← UNDEFINED 
 add v to Q 
 dist[source] ← 0 
 
 while Q is not empty:
 u ← vertex in Q with min dist[u] 
 remove u from Q
 for each neighbor v of u: // only v that are still in Q
 alt ← dist[u] + length(u, v)
 if alt < dist[v]: 
 dist[v] ← alt
 prev[v] ← u
 
 return dist[], prev[]

Input:

2d array where each cells value is its weight
source tuple (x, y)

Output:

distance matrix where each cell contains distance from source to vertex (i, j)
prev matrix where each cell contains its parent. By tracebacking from (98,98) I can find the shortest path.

Implementation:

MAX_DISTANCE = 99999
RANGE_ARR = [x for x in range(1, 1001)]
def dijkstra_get_min(Q, dist):
 min = MAX_DISTANCE + 1
 u = None
 for vertex in Q:
 if dist[vertex[0], vertex[1]] <= min:
 min = dist[vertex[0], vertex[1]]
 u = vertex
 return u
def dijkstra(graph, src=(0, 0)):
 dist = np.array([np.array([0 for x in RANGE_ARR], dtype=float) for y in RANGE_ARR])
 prev = np.array([np.array([(0, 0) for x in RANGE_ARR], dtype='i,i') for y in RANGE_ARR])
 Q = []
 for i in RANGE_ARR_0:
 for j in RANGE_ARR_0:
 dist[i, j] = MAX_DISTANCE
 prev[i, j] = (0, 0)
 Q.append((i, j))
 dist[0][0] = 0
 while Q:
 u = dijkstra_get_min(Q, dist)
 Q.remove(u)
 moves = [x for x in ( (u[0], u[1] + 1), (u[0] + 1, u[1]), (u[0] + 1, u[1] + 1) ) if x in Q]
 for v in moves:
 alt = dist[u[0]][u[1]] + graph[v[0]][v[1]]
 if alt < dist[v[0]][v[1]]:
 dist[v[0], v[1]] = alt
 prev[v[0], v[1]] = u
 return dist, prev

Any opinions about its correctness?

Question 2

What is RANGE_ARR_0?

Question 3

I am confused by your mentioning of online implementations. It doesn't look like your algorithm is online at all, nor do I see any attempt at making it online. Also, from the description of your problem, I don't even see why the algorithm needs to be online, especially since online algorithms typically only give approximately correct results, not absolutely correct ones. (Online insertion sort is a rare exception to the rule.)

Question 4

@MisterMiyagi I guess it slipped when I copy pasted, it's same as RANGE_ARR but from 0 to 1000

Question 5

@JörgWMittag Online I meant on the internet, not live data input. My mistake for using this terminology.

Question 6

Your code looks generally correct, but ignores src and only searches in positive direction. In addition, it can be cleaned up and optimised significantly.

Some general comments first:

Use full variable names in code that express meaning/purpose. There is no significant cost to using meaningful names, but they can make code much easier to digest.
Be aware of the host language's features and standards. Avoid re-using the names of builtins (e.g. min) and try to adhere to coding style standards.
Avoid numpy unless actually using its inbuilt features. Using numpy.array for direct access is usually slower than list/set/... because values are converted to full Python objects on each access.

Do not make assumptions about the features of your data. In specific, avoid these:

MAX_DISTANCE = 99999
RANGE_ARR = [x for x in range(1, 1001)]

These fail for graphs with distance > 99999 or more than 1000 elements. Either compute them for your input, or use true upper bounds.

Since numbers have a well-defined "maximum", we can use this safely:

INFINITY = float('int')

Since the input graph is an nxn matrix, we can just query its size.

# inside `def dijkstra(graph, source):`
indices = range(len(graph))

Let us start with vertex in Q with min dist[u]/dijkstra_get_min. Your algorithm is proper, but we can exploit that Python's builtin min already allows custom weights. The for vertex in Q: becomes the primary argument to min, the if dist[vertex[0], vertex[1]] <= min: becomes the weight key.

def dijkstra_get_min(vertices, distances):
 return min(vertices, key=lambda vertex: distance[vertex[0]][vertex[1]])

The Dijkstra algorithm consists of two parts – initialisation and search. Your code becomes clearer if we split these two parts – your line dist[0][0] = 0 is the transition from one to the other.

def dijkstra(graph, src=(0, 0)):
 # dist, prev, Q
 distances, prev_nodes, unvisited = dijkstra_initial(len(graph))
 # set starting point
 distances[src[0]][src[1]] = 0
 dijkstra_search(graph, distances, prev_nodes, unvisited)
 return distances, prev_nodes

The purpose of initialisation is that every point has the same value. This means we can directly create the matrices with their final value. Also, since the algorithm does not use the "previous node", we can initialise it to a cheap placeholder.

def dijkstra_initial(size):
 distances = [[INFINITY] * size for _ in range(size)]
 prev_nodes = [[None] * size for _ in range(size)]
 unvisited = {(x, y) for x in range(size) for y in range(size)}
 # dist, prev, Q
 return distances, prev_nodes, unvisited

Instead of tracking visited nodes as a list ([..., ...]) we use a set ({..., ...}). A set is unordered and supports O(1) membership tests, compared to list O(n) membership tests. This makes it better suited for bookkeeping of visited/unvisited nodes.

To search through the graph, we will be visiting the neighbours repeatedly. This is a key part that can be easily done wrong – unless the Graph implementation provides it, it can be worthwhile to implement explicitly.

def neighbours(node):
 x, y = node
 return [
 (x + x_offset, y + y_offset)
 for x_offset in (-1, 0, 1)
 for y_offset in (-1, 0, 1)
 if not (x_offset == y_offset == 0) # reject node itself
 ]

The core of the algorithm stays logically the same: We adjust some names to be more speaking (e.g. u -> node, v -> neighbour). We use the prepared neighbours instead of the lengthy expression.

def dijkstra_search(graph, distances, prev_nodes, unvisited):
 while unvisited:
 node = dijkstra_get_min(unvisited, dist)
 unvisited.remove(node)
 for neighbour in neighbours(node):
 if neighbour not in unvisited:
 continue
 alt = distances[node[0]][node[1]] + graph[neighbour[0]][neighbour[1]]
 if alt < distances[neighbour[0]][neighbour[1]]:
 distances[neighbour[0]][neighbour[1]] = alt
 prev_nodes[neighbour[0]][neighbour[1]] = node

At this point, the code should be both faster and easier to maintain. The most glaring flaw we still have is the explicit handling of dimensions. Instead of manually accessing each dimension, it would be better if we could directly access points.

# currently
distances[neighbour[0]][neighbour[1]]
# desirable
distances[neighbour]

This can be "fixed" by using dictionaries ({point: value, ...}) instead of nested lists ([[value, ...], ...]). An immediate downside is that this trades memory for simplicity.

However, it can be used to actually reduce memory usage – dictionaries can be naturally sparse, allowing us to simply not store undetermined fields. Since any visited node becomes irrelevant for distances, we can even clear distances of nodes that are already processed.

Question 7

Thank you so much for your detailed comment! Regarding usage of numpy, its only purpose is because it provides easy debugging in Pycharm. But I guess that after debug I should have reverted them to normal lists. I have also changed my global input variables as you suggested and it allows easier flexibility for different inputs of data. the usuage of min function is good to know, as I implemented another version of this in another section of my analysis and this does save some hassle (took notes for future)

Question 8

I will implement the rest of the changes you suggest as I am getting very long run time that could use some optimizations. Thank you so much!

Question 9

[x for x in range(1, 1001)] can be written as just list(range(1, 1001)).

It would be good to give that 1001 a name too.

Similarly, [0 for x in RANGE_ARR] can be written as [0] * len(RANGE_ARR). Multiplying any sequence type by an integer repeats the elements within the sequence. As a bonus, from some quick benchmarking that I just did, it's also more than 10x faster:

from timeit import timeit
N = int(1e6)
TRIALS = int(1e3)
print(timeit(lambda: [0] * N, number=TRIALS), "seconds")
print(timeit(lambda: [0 for x in range(N)], number=TRIALS), "seconds")
2.9889957 seconds
38.1463017 seconds

Be aware though that you should not use this when the element type is mutable (like [[0]] * 5). Multiplying a sequence creates multiple references to the same object; it doesn't make copies.

It looks like Q should be a set. You don't care about order, and the only thing you use it for is to track membership of a set. Sets will be significantly faster here. The only two changes needed are:

Q = set()
. . .
Q.add((i, j))

The only change I can see this making is dijkstra_get_min technically does rely on the order of Q. If two elements with the same minimum values are in Q, your algorithm picks the last instance. Since sets may use a different order, this may change what vertex gets returned.

It looks like MAX_DISTANCE is meant to be some arbitrarily large number that everything else will be less than. You may want to try using np.inf for that. By hardcoding the upper limit, you risk the problem "growing" later and potentially exceeding that max; causing erroneous behavior.

Question 10

Thanks for your input! I guess I should have made Q a set and not a list, but I have already started execution of my run and expect it to take about 3 days (I only save data to file when the execution is complete, and not every iteration), so I am not sure if it's worth restarting for this change. Hopefully the algorithm is also correct for that matter. Regarding [0] * size, I didn't know that is such a big difference in run time. However it looks dangerous for lists in lists.

Question 11

@O.B. Honestly, if I were you, I'd make the changes then see if you can run another instance of the program (I have no idea what environment you're running this is in, so idk if that's feasible). The changes I suggested have the potential for massive speed increases. The * suggestion likely won't matter unless the dijkstra function is being called repeatedly with large inputs. The set change will likely be where the majority of speedups happens. remove on a list is an expensive operation. It's O(n) on lists, but effectively O(1) with sets. How big of Qs are you working with?

Question 12

If restarting is the only option though, I wouldn't say to restart since I obviously have little background about the size of data you're working with and external constraints you're under.

Question 13

Actually, remove on lists might be worse than I originally thought. It needs to search through the list to find the element, then after removing it, it needs to shift all the elements after the removed element to the left to fill in the space (lists don't support efficient random removals). That means even if the first element in the list is the one to be removed, every element after it will need to be shifted. So every element in the list will need to be "visited" regardless of which element is removed.

Question 14

I see your point. I have restarted it as it is running only for few hours... Hopefully it will be faster. My matrix is 1,000x1,000 cells at this stage.

MisterMiyagi MisterMiyagi 2141 silver badge8 bronze badges · Accepted Answer · 2020-09-07 10:47:27Z

Your code looks generally correct, but ignores src and only searches in positive direction. In addition, it can be cleaned up and optimised significantly.

Some general comments first:

Use full variable names in code that express meaning/purpose. There is no significant cost to using meaningful names, but they can make code much easier to digest.
Be aware of the host language's features and standards. Avoid re-using the names of builtins (e.g. min) and try to adhere to coding style standards.
Avoid numpy unless actually using its inbuilt features. Using numpy.array for direct access is usually slower than list/set/... because values are converted to full Python objects on each access.

Do not make assumptions about the features of your data. In specific, avoid these:

MAX_DISTANCE = 99999
RANGE_ARR = [x for x in range(1, 1001)]

These fail for graphs with distance > 99999 or more than 1000 elements. Either compute them for your input, or use true upper bounds.

Since numbers have a well-defined "maximum", we can use this safely:

INFINITY = float('int')

Since the input graph is an nxn matrix, we can just query its size.

# inside `def dijkstra(graph, source):`
indices = range(len(graph))

Let us start with vertex in Q with min dist[u]/dijkstra_get_min. Your algorithm is proper, but we can exploit that Python's builtin min already allows custom weights. The for vertex in Q: becomes the primary argument to min, the if dist[vertex[0], vertex[1]] <= min: becomes the weight key.

def dijkstra_get_min(vertices, distances):
 return min(vertices, key=lambda vertex: distance[vertex[0]][vertex[1]])

The Dijkstra algorithm consists of two parts – initialisation and search. Your code becomes clearer if we split these two parts – your line dist[0][0] = 0 is the transition from one to the other.

def dijkstra(graph, src=(0, 0)):
 # dist, prev, Q
 distances, prev_nodes, unvisited = dijkstra_initial(len(graph))
 # set starting point
 distances[src[0]][src[1]] = 0
 dijkstra_search(graph, distances, prev_nodes, unvisited)
 return distances, prev_nodes

The purpose of initialisation is that every point has the same value. This means we can directly create the matrices with their final value. Also, since the algorithm does not use the "previous node", we can initialise it to a cheap placeholder.

def dijkstra_initial(size):
 distances = [[INFINITY] * size for _ in range(size)]
 prev_nodes = [[None] * size for _ in range(size)]
 unvisited = {(x, y) for x in range(size) for y in range(size)}
 # dist, prev, Q
 return distances, prev_nodes, unvisited

Instead of tracking visited nodes as a list ([..., ...]) we use a set ({..., ...}). A set is unordered and supports O(1) membership tests, compared to list O(n) membership tests. This makes it better suited for bookkeeping of visited/unvisited nodes.

To search through the graph, we will be visiting the neighbours repeatedly. This is a key part that can be easily done wrong – unless the Graph implementation provides it, it can be worthwhile to implement explicitly.

def neighbours(node):
 x, y = node
 return [
 (x + x_offset, y + y_offset)
 for x_offset in (-1, 0, 1)
 for y_offset in (-1, 0, 1)
 if not (x_offset == y_offset == 0) # reject node itself
 ]

The core of the algorithm stays logically the same: We adjust some names to be more speaking (e.g. u -> node, v -> neighbour). We use the prepared neighbours instead of the lengthy expression.

def dijkstra_search(graph, distances, prev_nodes, unvisited):
 while unvisited:
 node = dijkstra_get_min(unvisited, dist)
 unvisited.remove(node)
 for neighbour in neighbours(node):
 if neighbour not in unvisited:
 continue
 alt = distances[node[0]][node[1]] + graph[neighbour[0]][neighbour[1]]
 if alt < distances[neighbour[0]][neighbour[1]]:
 distances[neighbour[0]][neighbour[1]] = alt
 prev_nodes[neighbour[0]][neighbour[1]] = node

At this point, the code should be both faster and easier to maintain. The most glaring flaw we still have is the explicit handling of dimensions. Instead of manually accessing each dimension, it would be better if we could directly access points.

# currently
distances[neighbour[0]][neighbour[1]]
# desirable
distances[neighbour]

This can be "fixed" by using dictionaries ({point: value, ...}) instead of nested lists ([[value, ...], ...]). An immediate downside is that this trades memory for simplicity.

However, it can be used to actually reduce memory usage – dictionaries can be naturally sparse, allowing us to simply not store undetermined fields. Since any visited node becomes irrelevant for distances, we can even clear distances of nodes that are already processed.

Thank you so much for your detailed comment! Regarding usage of numpy, its only purpose is because it provides easy debugging in Pycharm. But I guess that after debug I should have reverted them to normal lists. I have also changed my global input variables as you suggested and it allows easier flexibility for different inputs of data. the usuage of min function is good to know, as I implemented another version of this in another section of my analysis and this does save some hassle (took notes for future)
I will implement the rest of the changes you suggest as I am getting very long run time that could use some optimizations. Thank you so much!

Stack Exchange Network

Implementation of Dijkstra's algorithm in Python

Pseudo-code:

Input:

Output:

Implementation:

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Implementation of Dijkstra's algorithm in Python

Pseudo-code:

Input:

Output:

Implementation:

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions