Suppose I have a sorted array, and each element may appear multiple times.
To make it simple, here is an example:
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3]
The expected output is the number of each element. In this example, the results are:
number of 1 is 17
number of 2 is 6
number of 3 is 9
I want to find a solution where the algorithm time complexity is lower than \$O(n)\$. My idea is similar to binary search:
- When I meet with equal neighbour, I will jump the tail pointer by 1, by 2, by 4, etc. and at the same time, I will make cur pointer move to the tail position
- When I meet with unequal case, I will shrink the tail pointer to be in the middle of cur and tail
- The diff between cur and last element tail is the count of the current elements
Any advice on algorithm time complexity improvement, code bugs or general code style advice is appreciated.
from collections import defaultdict
import random
def generate_count(numbers):
last_index = -1
result = defaultdict(int)
cur_index = last_index + 1
tail_index = cur_index + 1
while cur_index < len(numbers):
while cur_index < tail_index and cur_index < len(numbers) and tail_index < len(numbers):
if numbers[cur_index] == numbers[tail_index]:
pre = cur_index
cur_index = tail_index
tail_index += (tail_index - pre)
else:
tail_index = (cur_index+tail_index) / 2
result[numbers[cur_index]] = cur_index - last_index
last_index = cur_index
cur_index = last_index + 1
tail_index = cur_index + 1
return result
if __name__ == "__main__":
numbers = []
for i in range(random.randint(1,20)):
numbers.append(1)
print 'number of 1 is', i+1
for i in range(random.randint(1, 20)):
numbers.append(2)
print 'number of 2 is', i+1
for i in range(random.randint(1, 20)):
numbers.append(3)
print 'number of 3 is', i+1
print numbers
print generate_count(numbers)
1 Answer 1
I want to find a solution [whose] algorithm time complexity is lower than
O(n)
.
You'll have to be more specific. What you asked for literally isn't possible; but sometimes people use "big-O" notation loosely, either out of laziness or because they haven't yet been exposed to the notation expressing their intended precise gradations of meaning.
Notice that if I give you the input
1 2 3 4 5 6 7 8 9 10 11 12 13 ...
then you must output
number of 1 is 1
number of 2 is 1
number of 3 is 1
number of 4 is 1
...
— that is, if the size of the input is n
, then the worst-case size of the output is O(n)
. Printing those O(n)
bytes of output will take O(n)
time just by itself. Anything else the program does (besides print its output) can only make the overall program slower, not faster.
On the above worst-case input, your "repeated binary search" algorithm runs in O(n log n)
time, which is worse than O(n)
.
Now, if you're not concerned about worst-case inputs, but merely with the average input... well, then you'd have to define a notion of "the average input". You'll have to come up with some model of what kinds of input you expect. One way to do this is to come up with an algorithm for generating inputs, and then treat it as your model. For example:
def get_random_input(n):
return sorted([random.randrange(n) for i in xrange(n)])
Or:
def get_random_input(n):
return sorted([random.randrange(100) for i in xrange(n)])
On these two different models, your algorithm performs very differently. On the latter, you do well: O(log n)
. On the former, you do poorly: O(n log n)
.
This is kind of like asking for a compression algorithm that compresses all possible inputs. You can get compression for some inputs only at the expense of others; the art is deciding which kinds of inputs are "interesting" and which can safely be blown up. In this case, you can get sub-O(n)
performance for some inputs only at the expense of others; the art is deciding which kinds of inputs are "interesting", which is to say, deciding on a model for your "average" or "expected" input.
EDIT: My analogy above is flawed, as indicated by LinMa's comments below this answer. See, if you can get O(log n)
performance on your "expected" inputs while blowing up the "unexpected" inputs by only a constant factor (e.g. "O(n)
plus a constant number of comparisons" or "O(n)
times two"), then you can keep your algorithm's worst-case big-O "constant" while improving the best-case.
Hope this helps!
-
\$\begingroup\$ Thanks for the reply Quuxplusone, I think my algorithm worst case (when each element occur only once) is
O(n)
, why do you think it isO(n log n)
? \$\endgroup\$Lin Ma– Lin Ma2017年02月21日 02:07:54 +00:00Commented Feb 21, 2017 at 2:07 -
\$\begingroup\$ BTW, vote up for your post, agree with most points, and I should make it more clear for my target. I am more interested in improve average performance to be less than
O(n)
, and I am good with worse case performance to beO(n)
. \$\endgroup\$Lin Ma– Lin Ma2017年02月21日 02:08:33 +00:00Commented Feb 21, 2017 at 2:08 -
1\$\begingroup\$ @LinMa: I admit I didn't look closely at your function because it looks dense and complicated. I assumed that you were binary searching (
O(log n)
) in the rest of the array for the end of each run of identical values; and there aren
runs; so that'sO(n log n)
total. But you're doing something better than just binary-searching the entire rest of the array. I now agree it gets you back toO(n)
worst case. My point about compression is misleading, then: you're gettingO(log n)
on the "good" inputs and only hurting the "bad" ones by a constant factor, which is stillO(n)
! \$\endgroup\$Quuxplusone– Quuxplusone2017年02月21日 02:21:49 +00:00Commented Feb 21, 2017 at 2:21 -
\$\begingroup\$ Thanks Quuxplusone, also I am bit lost for your example, specifically, this line --
random.randrange(n) for i in xrange(n)
, I think your purpose is to generate a different number each time (so that final result is [1,2,3,...]), but you knowrandrange(n)
could generate any number which is between0
andn-1
? \$\endgroup\$Lin Ma– Lin Ma2017年02月21日 02:30:16 +00:00Commented Feb 21, 2017 at 2:30 -
1\$\begingroup\$ @LinMa:
sorted(random.randrange(n) for i in xrange(n))
just generates a sorted list of probably but not necessarily unique numbers, e.g. it might reasonably generate[1, 3, 3, 4, 6, 7, 7, 10, 11, 12,... n-2]
and it has a chance (however small) of generating[1, 1, 1, 1, 1,... 1]
. If you knew that your input generator only ever producedsorted(range(n))
, it would be really easy to write agenerate_count()
that "worked"! ;) \$\endgroup\$Quuxplusone– Quuxplusone2017年02月21日 02:58:03 +00:00Commented Feb 21, 2017 at 2:58
Explore related questions
See similar questions with these tags.
O(n)
. What do you mean specific query? There is just one input array (sorted, but have duplicate), no concept called query here. If I am wrong, please feel free to correct me. \$\endgroup\$My [idea] is similar to binary search
and has been named time and again: galloping, exponential, doubling search. \$\endgroup\$tail_index = (cur_index+tail_index) / 2
looks plain wrong. Definebetter
(for exact result as described, I don't see faster using a RAM). \$\endgroup\$