I have implemented Dijkstra's algorithm
for my research on an Economic model, using Python.
In my research I am investigating two functions and the differences between them. Every functions takes as input two parameters:
F(a,b)
and Z(a,b)
.
Every cell of the matrix is defined as: $$M[a][b]=|F(a,b)-Z(a,b)|$$
The purpose of this is to find the path of minimal difference between the equations that will be correct for every input a
Online implementations of Dijkstra's algorithm were all using weighted edges whereas I have weighted vertices.
Pseudo-code:
function Dijkstra(Graph, source):
create vertex set Q
for each vertex v in Graph:
dist[v] ← INFINITY
prev[v] ← UNDEFINED
add v to Q
dist[source] ← 0
while Q is not empty:
u ← vertex in Q with min dist[u]
remove u from Q
for each neighbor v of u: // only v that are still in Q
alt ← dist[u] + length(u, v)
if alt < dist[v]:
dist[v] ← alt
prev[v] ← u
return dist[], prev[]
Input:
- 2d array where each cells value is its weight
- source tuple (x, y)
Output:
distance matrix where each cell contains distance from source to vertex (i, j)
prev matrix where each cell contains its parent. By tracebacking from (98,98) I can find the shortest path.
Implementation:
MAX_DISTANCE = 99999
RANGE_ARR = [x for x in range(1, 1001)]
def dijkstra_get_min(Q, dist):
min = MAX_DISTANCE + 1
u = None
for vertex in Q:
if dist[vertex[0], vertex[1]] <= min:
min = dist[vertex[0], vertex[1]]
u = vertex
return u
def dijkstra(graph, src=(0, 0)):
dist = np.array([np.array([0 for x in RANGE_ARR], dtype=float) for y in RANGE_ARR])
prev = np.array([np.array([(0, 0) for x in RANGE_ARR], dtype='i,i') for y in RANGE_ARR])
Q = []
for i in RANGE_ARR_0:
for j in RANGE_ARR_0:
dist[i, j] = MAX_DISTANCE
prev[i, j] = (0, 0)
Q.append((i, j))
dist[0][0] = 0
while Q:
u = dijkstra_get_min(Q, dist)
Q.remove(u)
moves = [x for x in ( (u[0], u[1] + 1), (u[0] + 1, u[1]), (u[0] + 1, u[1] + 1) ) if x in Q]
for v in moves:
alt = dist[u[0]][u[1]] + graph[v[0]][v[1]]
if alt < dist[v[0]][v[1]]:
dist[v[0], v[1]] = alt
prev[v[0], v[1]] = u
return dist, prev
Any opinions about its correctness?
2 Answers 2
Your code looks generally correct, but ignores src
and only searches in positive direction. In addition, it can be cleaned up and optimised significantly.
Some general comments first:
- Use full variable names in code that express meaning/purpose. There is no significant cost to using meaningful names, but they can make code much easier to digest.
- Be aware of the host language's features and standards. Avoid re-using the names of builtins (e.g.
min
) and try to adhere to coding style standards. - Avoid
numpy
unless actually using its inbuilt features. Usingnumpy.array
for direct access is usually slower thanlist
/set
/... because values are converted to full Python objects on each access.
Do not make assumptions about the features of your data. In specific, avoid these:
MAX_DISTANCE = 99999
RANGE_ARR = [x for x in range(1, 1001)]
These fail for graphs with distance > 99999 or more than 1000 elements. Either compute them for your input, or use true upper bounds.
Since numbers have a well-defined "maximum", we can use this safely:
INFINITY = float('int')
Since the input graph
is an nxn matrix, we can just query its size.
# inside `def dijkstra(graph, source):`
indices = range(len(graph))
Let us start with vertex in Q with min dist[u]
/dijkstra_get_min
. Your algorithm is proper, but we can exploit that Python's builtin min
already allows custom weights. The for vertex in Q:
becomes the primary argument to min
, the if dist[vertex[0], vertex[1]] <= min:
becomes the weight key
.
def dijkstra_get_min(vertices, distances):
return min(vertices, key=lambda vertex: distance[vertex[0]][vertex[1]])
The Dijkstra
algorithm consists of two parts – initialisation and search. Your code becomes clearer if we split these two parts – your line dist[0][0] = 0
is the transition from one to the other.
def dijkstra(graph, src=(0, 0)):
# dist, prev, Q
distances, prev_nodes, unvisited = dijkstra_initial(len(graph))
# set starting point
distances[src[0]][src[1]] = 0
dijkstra_search(graph, distances, prev_nodes, unvisited)
return distances, prev_nodes
The purpose of initialisation is that every point has the same value. This means we can directly create the matrices with their final value. Also, since the algorithm does not use the "previous node", we can initialise it to a cheap placeholder.
def dijkstra_initial(size):
distances = [[INFINITY] * size for _ in range(size)]
prev_nodes = [[None] * size for _ in range(size)]
unvisited = {(x, y) for x in range(size) for y in range(size)}
# dist, prev, Q
return distances, prev_nodes, unvisited
Instead of tracking visited nodes as a list ([..., ...]
) we use a set ({..., ...}
). A set is unordered and supports O(1) membership tests, compared to list O(n) membership tests. This makes it better suited for bookkeeping of visited/unvisited nodes.
To search through the graph, we will be visiting the neighbours repeatedly. This is a key part that can be easily done wrong – unless the Graph implementation provides it, it can be worthwhile to implement explicitly.
def neighbours(node):
x, y = node
return [
(x + x_offset, y + y_offset)
for x_offset in (-1, 0, 1)
for y_offset in (-1, 0, 1)
if not (x_offset == y_offset == 0) # reject node itself
]
The core of the algorithm stays logically the same: We adjust some names to be more speaking (e.g. u
-> node
, v
-> neighbour
). We use the prepared neighbours
instead of the lengthy expression.
def dijkstra_search(graph, distances, prev_nodes, unvisited):
while unvisited:
node = dijkstra_get_min(unvisited, dist)
unvisited.remove(node)
for neighbour in neighbours(node):
if neighbour not in unvisited:
continue
alt = distances[node[0]][node[1]] + graph[neighbour[0]][neighbour[1]]
if alt < distances[neighbour[0]][neighbour[1]]:
distances[neighbour[0]][neighbour[1]] = alt
prev_nodes[neighbour[0]][neighbour[1]] = node
At this point, the code should be both faster and easier to maintain. The most glaring flaw we still have is the explicit handling of dimensions. Instead of manually accessing each dimension, it would be better if we could directly access points.
# currently
distances[neighbour[0]][neighbour[1]]
# desirable
distances[neighbour]
This can be "fixed" by using dictionaries ({point: value, ...}
) instead of nested lists ([[value, ...], ...]
). An immediate downside is that this trades memory for simplicity.
However, it can be used to actually reduce memory usage – dictionaries can be naturally sparse, allowing us to simply not store undetermined fields. Since any visited node becomes irrelevant for distances, we can even clear distances
of nodes that are already processed.
-
\$\begingroup\$ Thank you so much for your detailed comment! Regarding usage of numpy, its only purpose is because it provides easy debugging in Pycharm. But I guess that after debug I should have reverted them to normal lists. I have also changed my global input variables as you suggested and it allows easier flexibility for different inputs of data. the usuage of min function is good to know, as I implemented another version of this in another section of my analysis and this does save some hassle (took notes for future) \$\endgroup\$O.B.– O.B.2020年09月07日 17:00:43 +00:00Commented Sep 7, 2020 at 17:00
-
\$\begingroup\$ I will implement the rest of the changes you suggest as I am getting very long run time that could use some optimizations. Thank you so much! \$\endgroup\$O.B.– O.B.2020年09月07日 17:08:53 +00:00Commented Sep 7, 2020 at 17:08
[x for x in range(1, 1001)]
can be written as just list(range(1, 1001))
.
It would be good to give that 1001
a name too.
Similarly, [0 for x in RANGE_ARR]
can be written as [0] * len(RANGE_ARR)
. Multiplying any sequence type by an integer repeats the elements within the sequence. As a bonus, from some quick benchmarking that I just did, it's also more than 10x faster:
from timeit import timeit
N = int(1e6)
TRIALS = int(1e3)
print(timeit(lambda: [0] * N, number=TRIALS), "seconds")
print(timeit(lambda: [0 for x in range(N)], number=TRIALS), "seconds")
2.9889957 seconds
38.1463017 seconds
Be aware though that you should not use this when the element type is mutable (like [[0]] * 5
). Multiplying a sequence creates multiple references to the same object; it doesn't make copies.
It looks like Q
should be a set. You don't care about order, and the only thing you use it for is to track membership of a set. Sets will be significantly faster here. The only two changes needed are:
Q = set()
. . .
Q.add((i, j))
The only change I can see this making is dijkstra_get_min
technically does rely on the order of Q
. If two elements with the same minimum values are in Q, your algorithm picks the last instance. Since sets may use a different order, this may change what vertex gets returned.
It looks like MAX_DISTANCE
is meant to be some arbitrarily large number that everything else will be less than. You may want to try using np.inf
for that. By hardcoding the upper limit, you risk the problem "growing" later and potentially exceeding that max; causing erroneous behavior.
-
\$\begingroup\$ Thanks for your input! I guess I should have made Q a set and not a list, but I have already started execution of my run and expect it to take about 3 days (I only save data to file when the execution is complete, and not every iteration), so I am not sure if it's worth restarting for this change. Hopefully the algorithm is also correct for that matter. Regarding [0] * size, I didn't know that is such a big difference in run time. However it looks dangerous for lists in lists. \$\endgroup\$O.B.– O.B.2020年09月06日 23:41:29 +00:00Commented Sep 6, 2020 at 23:41
-
1\$\begingroup\$ @O.B. Honestly, if I were you, I'd make the changes then see if you can run another instance of the program (I have no idea what environment you're running this is in, so idk if that's feasible). The changes I suggested have the potential for massive speed increases. The
*
suggestion likely won't matter unless thedijkstra
function is being called repeatedly with large inputs. The set change will likely be where the majority of speedups happens.remove
on a list is an expensive operation. It'sO(n)
on lists, but effectivelyO(1)
with sets. How big ofQ
s are you working with? \$\endgroup\$Carcigenicate– Carcigenicate2020年09月06日 23:46:59 +00:00Commented Sep 6, 2020 at 23:46 -
1\$\begingroup\$ If restarting is the only option though, I wouldn't say to restart since I obviously have little background about the size of data you're working with and external constraints you're under. \$\endgroup\$Carcigenicate– Carcigenicate2020年09月06日 23:48:16 +00:00Commented Sep 6, 2020 at 23:48
-
2\$\begingroup\$ Actually,
remove
on lists might be worse than I originally thought. It needs to search through the list to find the element, then after removing it, it needs to shift all the elements after the removed element to the left to fill in the space (lists don't support efficient random removals). That means even if the first element in the list is the one to be removed, every element after it will need to be shifted. So every element in the list will need to be "visited" regardless of which element is removed. \$\endgroup\$Carcigenicate– Carcigenicate2020年09月06日 23:57:12 +00:00Commented Sep 6, 2020 at 23:57 -
\$\begingroup\$ I see your point. I have restarted it as it is running only for few hours... Hopefully it will be faster. My matrix is 1,000x1,000 cells at this stage. \$\endgroup\$O.B.– O.B.2020年09月07日 00:03:31 +00:00Commented Sep 7, 2020 at 0:03
RANGE_ARR_0
? \$\endgroup\$