I have an unsorted array. I have queries in which I give a range and then the maximum value from that range has to returned. For example:
array[]={23,17,9,45,78,2,4,6,90,1};
query(both inclusive): 2 6
answer: 78
Which algorithm or data structure do I construct to quickly retrieve maximum value from any range. (There are a lot of queries)
EDIT: This is indeed a simple version of the actual problem. I can have the array size as large as 100000 and the number of queries upto 100000. So I definitely require some preprocessing which'll facilitate a fast query response.
-
5Why is it unsorted? The problem is trivial if it's sorted, so the obvious approach is to sort it.user7043– user704305/04/2013 09:41:33Commented May 4, 2013 at 9:41
-
1@delnan Without some extra mechanism, you lose track of which values were originally in the range to be queried...Thijs van Dien– Thijs van Dien05/04/2013 09:54:26Commented May 4, 2013 at 9:54
-
Specify your whole problem. If this knowledge (or any other information) matters, one has to know to factor that into the solution.user7043– user704305/04/2013 09:56:54Commented May 4, 2013 at 9:56
-
1Am I missing something, or is this just a matter of visiting items 2 through 6 and finding the maximum value of those elements?Blrfl– Blrfl05/04/2013 17:27:12Commented May 4, 2013 at 17:27
-
@Blrfl: I don't think you're missing anything, except maybe the part about many queries. It's not really clear whether there's any point in building a structure that makes queries substantially cheaper than a sequential search. (Although there wouldn't be much point in asking the question here if that weren't the idea.)Mike Sherrill 'Cat Recall'– Mike Sherrill 'Cat Recall'05/04/2013 18:31:53Commented May 4, 2013 at 18:31
6 Answers 6
I think you could construct some kind of binary tree where each node represents the maximum value its children:
78
45 78
23 45 78 6
23 17 9 45 78 2 4 6
Then you only need to find a way to determine which nodes you minimally need to check to find the maximum value in the range queried. In this example, to get the maximum value in the index range [2, 6]
(inclusive) you would have max(45, 78, 4)
instead of max(9, 45, 78, 2, 4)
. As the tree grows, the gain will be larger.
-
1For this to work, there's information missing from your example tree: Each internal node must have both the maximum, and the total number of child nodes it has. Otherwise the search has no way of knowing that (for example) it doesn't have to look at all the children of
78
(and skip the2
), because for all it knows index6
is in that subtree.Izkata– Izkata05/04/2013 18:14:32Commented May 4, 2013 at 18:14 -
Otherwise, +1 as I find this rather inventiveIzkata– Izkata05/04/2013 18:16:38Commented May 4, 2013 at 18:16
-
+1: This is a powerful technique for answering queries about subranges of a list in log(N) time, usable whevever the data at the root node can be computed in constant time from the data at the children.kevin cline– kevin cline05/04/2013 19:40:14Commented May 4, 2013 at 19:40
-
This idea is awesome. It gives O(logn) query time. I think @Izkata made a good point too. We can augment the tree node with information about the left and right ranges it covers. So given a range, it knows how to split the problem into two. Space-wise, all the data are store at the leaf level. So it requires 2*N space, which is O(N) to store. I don't know what is a segment tree, but is this the idea behind the segment tree?Kay– Kay01/26/2018 20:43:24Commented Jan 26, 2018 at 20:43
-
And in terms of preprocessing, it takes O(n) to construct the tree.Kay– Kay01/26/2018 20:51:41Commented Jan 26, 2018 at 20:51
To complement ngoaho91's answer.
The best way to solve this problem is using the Segment Tree data structure. This allows you to answer such queries in O(log(n)), that would mean the total complexity of your algorithm would be O(Qlogn) where Q is the number of queries. If you used the naive algorithm, the total complexity would be O(Qn) which is obviouslly slower.
There is, however, a drawback of the usage of Segment Trees. It takes up a lot of memory, but a lot of times you care less about memory than about speed.
I will briefly describe the algorithms used by this DS:
The segment tree is just an special case of a Binary Search Tree, where every node holds the value of the range it is assigned to. The root node, is assigned the range [0, n]. The left child is assigned the range [0, (0+n)/2] and the right child [(0+n)/2+1, n]. This way the tree will be built.
Create Tree:
/*
A[] -> array of original values
tree[] -> Segment Tree Data Structure.
node -> the node we are actually in: remember left child is 2*node, right child is 2*node+1
a, b -> The limits of the actual array. This is used because we are dealing
with a recursive function.
*/
int tree[SIZE];
void build_tree(vector<int> A, int node, int a, int b) {
if (a == b) { // We get to a simple element
tree[node] = A[a]; // This node stores the only value
}
else {
int leftChild, rightChild, middle;
leftChild = 2*node;
rightChild = 2*node+1; // Or leftChild+1
middle = (a+b) / 2;
build_tree(A, leftChild, a, middle); // Recursively build the tree in the left child
build_tree(A, rightChild, middle+1, b); // Recursively build the tree in the right child
tree[node] = max(tree[leftChild], tree[rightChild]); // The Value of the actual node,
//is the max of both of the children.
}
}
Query Tree
int query(int node, int a, int b, int p, int q) {
if (b < p || a > q) // The actual range is outside this range
return -INF; // Return a negative big number. Can you figure out why?
else if (p >= a && b >= q) // Query inside the range
return tree[node];
int l, r, m;
l = 2*node;
r = l+1;
m = (a+b) / 2;
return max(query(l, a, m, p, q), query(r, m+1, b, p, q)); // Return the max of querying both children.
}
If you need further explanation, just let me know.
BTW, Segment Tree also supports update of a single element or a range of elements in O(log n)
-
what's the complexity of filling the tree?Pieter B– Pieter B06/20/2015 07:08:50Commented Jun 20, 2015 at 7:08
-
You have to go through all the elements, and it takes
O(log(n))
for each element to be added to the tree. Therefore, the total complexity isO(nlog(n))
Andrés– Andrés06/21/2015 02:53:52Commented Jun 21, 2015 at 2:53
The best algorithm would be in O(n) time as below let start, end be the index of the bounds of range
int findMax(int[] a, start, end) {
max = Integer.MIN; // initialize to minimum Integer
for(int i=start; i <= end; i++)
if ( a[i] > max )
max = a[i];
return max;
}
-
4-1 for merely repeating the algorithm the OP was trying to improve on.kevin cline– kevin cline05/04/2013 19:42:12Commented May 4, 2013 at 19:42
-
1+1 for posting a solution to the as-stated problem. This really is the only way to do it if you have an array and don't know what the bounds are going to be a priori. (Although I would initialize
max
toa[i]
and start thefor
loop ati+1
.)Blrfl– Blrfl05/04/2013 20:36:13Commented May 4, 2013 at 20:36 -
@kevincline It's not just restating - it's also saying "Yes, you already have the best algorithm for this task", with a minor improvement (jump to
start
, stop atend
). And I agree, this is the best for a one-time lookup. @ThijsvanDien's answer is only better if the lookup is going to happen multiple times, since it takes longer to set up initially.Izkata– Izkata05/06/2013 17:03:29Commented May 6, 2013 at 17:03 -
Granted, at the time of posting this answer, the question did not include the edit confirming that he'll be doing many queries over the same data.Izkata– Izkata05/06/2013 18:44:13Commented May 6, 2013 at 18:44
The binary tree/segment tree-based solutions are indeed pointing in the right direction. One might object that they require a lot of extra memory, however. There are two solutions to these problems:
- Use an implicit data structure instead of a binary tree
- Use an M-ary tree instead of a binary tree
The first point is that because the tree is highly structured, you can use a heap-like structure to implicitly define the tree rather than representing the tree with nodes, left and right pointers, intervals etc.. That saves a lot of memory with essentially no performance hit - you do need to perform a little more pointer arithmetic.
The second point is that, at the cost of a little more work during evaluation, you can use an M-ary tree rather than a binary tree. For instance if you use a 3-ary tree you will compute the max of 3 elements at a time, then 9 elements at a time, then 27, etc. The extra storage required is then N/(M-1) - you can prove using the geometric series formula. If you choose M = 11, for example, you will require 1/10th the storage of the binary tree method.
You can verify that these naive and optimized implementations in Python give the same results:
class RangeQuerier(object):
#The naive way
def __init__(self):
pass
def set_array(self,arr):
#Set, and preprocess
self.arr = arr
def query(self,l,r):
try:
return max(self.arr[l:r])
except ValueError:
return None
vs.
class RangeQuerierMultiLevel(object):
def __init__(self):
self.arrs = []
self.sub_factor = 3
self.len_ = 0
def set_array(self,arr):
#Set, and preprocess
tgt = arr
self.len_ = len(tgt)
self.arrs.append(arr)
while len(tgt) > 1:
tgt = self.maxify_one_array(tgt)
self.arrs.append(tgt)
def maxify_one_array(self,arr):
sub_arr = []
themax = float('-inf')
for i,el in enumerate(arr):
themax = max(el,themax)
if i % self.sub_factor == self.sub_factor - 1:
sub_arr.append(themax)
themax = float('-inf')
return sub_arr
def query(self,l,r,level=None):
if level is None:
level = len(self.arrs)-1
if r <= l:
return None
int_size = self.sub_factor ** level
lhs,mid,rhs = (float('-inf'),float('-inf'),float('-inf'))
#Check if there's an imperfect match on the left hand side
if l % int_size != 0:
lnew = int(ceil(l/float(int_size)))*int_size
lhs = self.query(l,min(lnew,r),level-1)
l = lnew
#Check if there's an imperfect match on the right hand side
if r % int_size != 0:
rnew = int(floor(r/float(int_size)))*int_size
rhs = self.query(max(rnew,l),r,level-1)
r = rnew
if r > l:
#Handle the middle elements
mid = max(self.arrs[level][l/int_size:r/int_size])
return max(max(lhs,mid),rhs)
try "segment tree" data structure
there are 2 step
build_tree() O(n)
query(int min, int max) O(nlogn)
http://en.wikipedia.org/wiki/Segment_tree
edit:
you guys just don't read the wiki i sent!
this algorithm is:
- you traverse the array 1 time to build tree. O(n)
- next 100000000+ times you want to know max of any part of array, just call the query function. O(logn) for every query
- c++ implement here geeksforgeeks.org/segment-tree-set-1-range-minimum-query/
old algorithm is:
every query, just traverse the selected area and find.
so, if you gonna use this algorithm to process once, OK, it slower than old way.
but if you gonna process huge number of query(billion), it's very efficient
you can generate text file like this, for test
line 1: 50000 random number from 0-1000000, split by '(space)'(it's the array)
line 2: 2 random number from 1 to 50000, split by '(space)'(it's the query)
...
line 200000: likes line 2, it's random query too
this is the example problem, sorry but this is in vietnamese
http://vn.spoj.com/problems/NKLINEUP/
if you solve it by old way, you never pass.
-
3I don't think that's relevant. An interval tree holds intervals, not integers, and the operations they permit look nothing like what OP asks for. You could, of course, generate all possible intervals and store them in an interval tree, but (1) there are exponentially many of them, so this doesn't scale, and (2) the operations still don't look like what OP asks for.user7043– user704305/04/2013 10:15:15Commented May 4, 2013 at 10:15
-
my mistake, i mean segment tree, not interval tree.ngoaho91– ngoaho9105/04/2013 10:23:53Commented May 4, 2013 at 10:23
-
Interesting, I think I've never come across this tree! IIUC this still requires storing all possible intervals, though. I think there's O(n^2) of those, which is rather expensive. (Also, shouldn't query be O(log n + k) for k results?user7043– user704305/04/2013 10:30:07Commented May 4, 2013 at 10:30
-
yes, void build_tree() must travel cross the array. and store max(or min) value for every nodes. but in many case, memory cost is not important than speed.ngoaho91– ngoaho9105/04/2013 10:43:25Commented May 4, 2013 at 10:43
-
2I can't imagine this being any faster than a plain
O(n)
search of the array, as described in tarun_telang's answer. First instinct is thatO(log n + k)
is faster thanO(n)
, but theO(log n + k)
is just retrieval of the sub-array - equivalent toO(1)
array access given the start and end points. You would still need to traverse it to find the maximum.Izkata– Izkata05/04/2013 17:53:15Commented May 4, 2013 at 17:53
You can achieve O(1) per query (with O(n log n) construction) using data structure called sparse table. For each power of 2 let's save maximum for each segment of this length. Now given segment [l, r) you get maximum of maximums on [l+2^k) and [r-2^k,r) for appropriate k. They overlap but it's OK