I am implementing a DP-based range minimum query data structure and algorithm in Python 2.7. Any advice on algorithm time complexity improvements, code bugs or code style issues are appreciated.
More information about range minimum query.
from collections import defaultdict
import random
def build_rmq(numbers):
rmq_map = defaultdict(int) # key: tuple of (start, end) index, value: min value
j = 0
while 2 ** j - 1 < len(numbers):
j += 1
j -= 1
# print j
for step in range(0,j+1):
for i in range(0, len(numbers)):
if step == 0:
rmq_map[(i,step)] = numbers[i]
continue
if i + 2 ** (step-1) < len(numbers):
rmq_map[(i,step)] = min(rmq_map[(i, step-1)], rmq_map[(i + 2 ** (step-1), step-1)])
else:
rmq_map[(i,step)] = rmq_map[(i, step-1)]
return rmq_map
def rmq_query(start, end, rmq_map):
j = 0
while start + 2 ** j - 1 <= end:
j += 1
j -= 1
x = rmq_map[(start, j)]
y = rmq_map[(end + 1 - 2 ** j, j)]
return min(x,y)
if __name__ == "__main__":
numbers = [0,5,2,5,4,3,1,6,3]
rmq_map = build_rmq(numbers)
#print rmq_map
start = random.randint(0, len(numbers)-1)
end = random.randint(0, len(numbers)-1)
if start > end:
start, end = end, start
print start, end, numbers[start:end+1]
print rmq_query(start, end, rmq_map)
3 Answers 3
1. Review
There are no docstrings. What do these functions do? What do they return? How do I use them?
The function
build_rmq
constructs a persistent data structure which can then be queried by calling the functionrmq_query
. When you have a persistent data structure and functions that operate on it, this is good candidate for making into a class.The comment on this line is not quite right:
rmq_map = defaultdict(int) # key: tuple of (start, end) index, value: min value
the second element of the key is a step, not an end index.
There is no need to use a
defaultdict
here, since the code always fills in the entry before looking it up. An ordinary dictionary would do fine.This code:
j = 0 while 2 ** j - 1 < len(numbers): j += 1
computes the smallest power of 2 that is greater than
len(numbers)
. It would be simpler to useint.bit_length
:j = len(numbers).bit_length()
(In ChatterOne's answer it is suggested that you use
int(log(len(numbers), 2))
. I think this is a dangerous suggestion, becauselog
makes approximations that can lead to errors. For example,int(log(2**48 - 1, 2))
evaluates to 48, but the correct result is 47. In order to check that these errors can't actually cause your code to fail, you need to do numerical analysis. This seems complicated and error-prone to me, compared to sticking with integer arithmetic where you know the results are exact.)
Similarly, this loop:
j = 0 while start + 2 ** j - 1 <= end: j += 1
becomes:
j = (end - start + 1).bit_length()
This code subtracts 1 from
j
and then adds it back again, which is not necessary:j -= 1 for step in range(0,j+1):
range(0, x)
can be abbreviated torange(x)
.rmq_map[(i,step)]
does not need the parentheses:rmq_map[i, step]
is the same.The two loops over
step
andi
can be combined into a single loop usingitertools.product
I would avoid the special case
if step == 0
by having a separate loop to handle that case:rmq_map = {(i, 0): m for i, m in enumerate(numbers)}
and then looping over
step
starting from 1.In Python it is usual to represent a range in half-open fashion (exclusive at the upper end). It would be convenient if
rmq_query
took its arguments in the same way.The top-level code runs a single range query but does not check that the result is correct. It would be easy to check the result of
rmq_query(start, end)
againstmin(numbers[start:end+1])
. It would then be easy to turn this into a unit test using the features in theunittest
module.
2. Revised code
from itertools import product
class RangeMinimum(object):
"Data structure providing efficient range-minimum queries."
def __init__(self, numbers):
"Build a RangeMinimum object for the given sequence of numbers."
# Mapping from (start, step) to min(numbers[start:start + 2**step])
self._rmq = rmq = {(i, 0): m for i, m in enumerate(numbers)}
n = len(numbers)
for step, i in product(range(1, n.bit_length()), range(n)):
j = i + 2 ** (step-1)
if j < n:
rmq[i, step] = min(rmq[i, step-1], rmq[j, step-1])
else:
rmq[i, step] = rmq[i, step-1]
def query(self, start, stop):
"Return min(numbers[start:stop])."
j = (stop - start).bit_length() - 1
x = self._rmq[start, j]
y = self._rmq[stop - 2 ** j, j]
return min(x, y)
import unittest
import random
class TestRangeMinimum(unittest.TestCase):
def test_range_minimum(self):
n = 1000
numbers = [random.randrange(n) for _ in range(n)]
rmq = RangeMinimum(numbers)
for _ in range(n):
start, stop = sorted(random.sample(range(n + 1), 2))
self.assertEqual(rmq.query(start, stop), min(numbers[start:stop]))
3. Generalization
There's nothing special about the minimum here — exactly the same algorithm can be used for range-maximum queries, or for querying other kinds of statistic on a range (such as the longest common prefix, or the union of sets).
So that suggests this kind of generalization:
from itertools import product
class RangeQuery(object):
"Data structure providing efficient range queries."
def __init__(self, items, fn):
"""Build a RangeQuery object for a sequence of items.
fn -- function taking two items and returning their query
result, for example "min" to query the range-minimum. It must be
associative, commutative, and idempotent.
"""
# Mapping from (start, step) to reduce(fn, items[start:start + 2**step])
self._rq = rq = {(i, 0): item for i, item in enumerate(items)}
self._fn = fn
n = len(items)
for step, i in product(range(1, n.bit_length()), range(n)):
j = i + 2 ** (step-1)
if j < n:
rq[i, step] = fn(rq[i, step-1], rq[j, step-1])
else:
rq[i, step] = rq[i, step-1]
def query(self, start, stop):
"Return reduce(fn, items[start:stop])."
j = (stop - start).bit_length() - 1
x = self._rq[start, j]
y = self._rq[stop - 2 ** j, j]
return self._fn(x, y)
import random
import unittest
class TestRangeQuery(unittest.TestCase):
def test_range_minmax(self):
n = 1000
items = [random.randrange(n) for _ in range(n)]
rqmin = RangeQuery(items, min)
rqmax = RangeQuery(items, max)
for _ in range(n):
start, stop = sorted(random.sample(range(n + 1), 2))
self.assertEqual(rqmin.query(start, stop), min(items[start:stop]))
self.assertEqual(rqmax.query(start, stop), max(items[start:stop]))
-
\$\begingroup\$ Thanks for the great comments, Gareth and vote up. 1. 'because log makes approximations that can lead to errors' -- agree. 2. 'the second element of the key is a step, not an end index', agree. \$\endgroup\$Lin Ma– Lin Ma2017年02月01日 21:05:41 +00:00Commented Feb 1, 2017 at 21:05
-
\$\begingroup\$ BTW, do you think there are any bugs (I mean functional) in our logic/code (either my original code or your revised)? For functional bug, I mean in any (could be corner) case, our code does not return the right minimal value for a range query. \$\endgroup\$Lin Ma– Lin Ma2017年02月01日 22:10:15 +00:00Commented Feb 1, 2017 at 22:10
-
\$\begingroup\$ Re-read your code and one more question, in your generalization version, why it is generalized? I see very similar to your "Revised code". \$\endgroup\$Lin Ma– Lin Ma2017年02月02日 00:14:10 +00:00Commented Feb 2, 2017 at 0:14
-
1\$\begingroup\$ @LinMa: Did you compare the two versions to see where they differ? \$\endgroup\$Gareth Rees– Gareth Rees2017年02月02日 09:39:20 +00:00Commented Feb 2, 2017 at 9:39
-
\$\begingroup\$ Hi Gareth, I might be lost in your intention in your code, I think your code is far from resolving LCA problem, but I could be wrong, please feel free to correct me. I have write a version of using MRQ to resolve LCA problem => codereview.stackexchange.com/questions/154319/…, if you have advice there, appreciated! \$\endgroup\$Lin Ma– Lin Ma2017年02月03日 06:16:53 +00:00Commented Feb 3, 2017 at 6:16
Just some really quick thoughts.
This:
j = 0
while 2 ** j - 1 < len(numbers):
j += 1
j -= 1
can be written as:
from math import log
j = int(log(len_numbers, 2))
You're repeating step - 1
and 2 ** (step - 1)
, it's probably a good idea to use variables for those.
You're also accessing an array by both index and value, that's what enumerate
is for:
for index, value in enumerate(numbers):
if step == 0:
rmq_map[(index, step)] = value
continue
prev_step = step - 1
step_power = 2 ** prev_step
if index + step_power < len_numbers:
rmq_map[(index, step)] = min(rmq_map[(index, rev_step)], rmq_map[(index + step_power, prev_step)])
else:
rmq_map[(index, step)] = rmq_map[(index, prev_step)]
Also, 2 ** (step - 1)
is being calculated repeatedly, so there may be a slight performance increase if you keep a dictionary to cache the results. Not entirely sure though, because the overhead of using a dictionary may be too much for such a simple operation. The same goes for j
.
-
\$\begingroup\$ Thanks ChatterOne, love your comments and vote up. Besides the comments, do you think any code (functional/logical) bugs in my code or revised version? \$\endgroup\$Lin Ma– Lin Ma2017年02月06日 01:14:27 +00:00Commented Feb 6, 2017 at 1:14
-
\$\begingroup\$ BTW, how do you handle the issue @Gareth raised, for the math log issue? \$\endgroup\$Lin Ma– Lin Ma2017年02月06日 01:15:13 +00:00Commented Feb 6, 2017 at 1:15
-
\$\begingroup\$ @LinMa I know about the limitations of using
log
, it all depends on which space you're working. @Gareth did not chose 2^48 randomly, because up to 2^47 there's no problem. Integer arithmetic also has limits, they're just higher. Also nowadays depending on cpu/architecture, floating point math can be either slower or faster than integer operations. So as to which one to use I'd say "it depends". \$\endgroup\$ChatterOne– ChatterOne2017年02月06日 07:47:17 +00:00Commented Feb 6, 2017 at 7:47
A tiny detail not mentionned by ChatterOne's excellent answer.
if step == 0:
rmq_map[(i,step)] = numbers[i]
continue
if i + 2 ** (step-1) < len(numbers):
rmq_map[(i,step)] = min(rmq_map[(i, step-1)], rmq_map[(i + 2 ** (step-1), step-1)])
else:
rmq_map[(i,step)] = rmq_map[(i, step-1)]
could be written with an elif
and no continue
.
for i in range(len(numbers)):
if step == 0:
rmq_map[(i,step)] = numbers[i]
elif i + 2 ** (step-1) < len(numbers):
rmq_map[(i,step)] = min(rmq_map[(i, step-1)], rmq_map[(i + 2 ** (step-1), step-1)])
else:
rmq_map[(i,step)] = rmq_map[(i, step-1)]
-
\$\begingroup\$ Thanks Josay, love your comments and vote up. Besides the comments, do you think any code (functional/logical) bugs in my code or revised version? \$\endgroup\$Lin Ma– Lin Ma2017年02月06日 01:14:39 +00:00Commented Feb 6, 2017 at 1:14