Counting the number of elements in a sorted array

Question 1

Suppose I have a sorted array, and each element may appear multiple times.

To make it simple, here is an example:

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3]

The expected output is the number of each element. In this example, the results are:

number of 1 is 17
number of 2 is 6
number of 3 is 9

I want to find a solution where the algorithm time complexity is lower than \$O(n)\$. My idea is similar to binary search:

When I meet with equal neighbour, I will jump the tail pointer by 1, by 2, by 4, etc. and at the same time, I will make cur pointer move to the tail position
When I meet with unequal case, I will shrink the tail pointer to be in the middle of cur and tail
The diff between cur and last element tail is the count of the current elements

Any advice on algorithm time complexity improvement, code bugs or general code style advice is appreciated.

from collections import defaultdict
import random
def generate_count(numbers):
 last_index = -1
 result = defaultdict(int)
 cur_index = last_index + 1
 tail_index = cur_index + 1
 while cur_index < len(numbers):
 while cur_index < tail_index and cur_index < len(numbers) and tail_index < len(numbers):
 if numbers[cur_index] == numbers[tail_index]:
 pre = cur_index
 cur_index = tail_index
 tail_index += (tail_index - pre)
 else:
 tail_index = (cur_index+tail_index) / 2
 result[numbers[cur_index]] = cur_index - last_index
 last_index = cur_index
 cur_index = last_index + 1
 tail_index = cur_index + 1
 return result
if __name__ == "__main__":
 numbers = []
 for i in range(random.randint(1,20)):
 numbers.append(1)
 print 'number of 1 is', i+1
 for i in range(random.randint(1, 20)):
 numbers.append(2)
 print 'number of 2 is', i+1
 for i in range(random.randint(1, 20)):
 numbers.append(3)
 print 'number of 3 is', i+1
 print numbers
 print generate_count(numbers)

Question 2

Do you mean you need an efficiency of less than O(n) for ALL the numbers in the array? Or just for one specific query?

Question 3

@user3508551, (1) yes, less than O(n). What do you mean specific query? There is just one input array (sorted, but have duplicate), no concept called query here. If I am wrong, please feel free to correct me.

Question 4

What I mean is, your input is 1) the array and 2) a number and then the output is how many times this number was repeated or is your input only an array and you're supposed to output how many of each unique element are in the array??

Question 5

My [idea] is similar to binary search and has been named time and again: galloping, exponential, doubling search.

Question 6

I avoid uncommented code. tail_index = (cur_index+tail_index) / 2 looks plain wrong. Define better (for exact result as described, I don't see faster using a RAM).

Question 7

I want to find a solution [whose] algorithm time complexity is lower than O(n).

You'll have to be more specific. What you asked for literally isn't possible; but sometimes people use "big-O" notation loosely, either out of laziness or because they haven't yet been exposed to the notation expressing their intended precise gradations of meaning.

Notice that if I give you the input

1 2 3 4 5 6 7 8 9 10 11 12 13 ...

then you must output

number of 1 is 1
number of 2 is 1
number of 3 is 1
number of 4 is 1
...

— that is, if the size of the input is n, then the worst-case size of the output is O(n). Printing those O(n) bytes of output will take O(n) time just by itself. Anything else the program does (besides print its output) can only make the overall program slower, not faster.

On the above worst-case input, your "repeated binary search" algorithm runs in O(n log n) time, which is worse than O(n).

Now, if you're not concerned about worst-case inputs, but merely with the average input... well, then you'd have to define a notion of "the average input". You'll have to come up with some model of what kinds of input you expect. One way to do this is to come up with an algorithm for generating inputs, and then treat it as your model. For example:

def get_random_input(n):
 return sorted([random.randrange(n) for i in xrange(n)])

Or:

def get_random_input(n):
 return sorted([random.randrange(100) for i in xrange(n)])

On these two different models, your algorithm performs very differently. On the latter, you do well: O(log n). On the former, you do poorly: O(n log n).

This is kind of like asking for a compression algorithm that compresses all possible inputs. You can get compression for some inputs only at the expense of others; the art is deciding which kinds of inputs are "interesting" and which can safely be blown up. In this case, you can get sub-O(n) performance for some inputs only at the expense of others; the art is deciding which kinds of inputs are "interesting", which is to say, deciding on a model for your "average" or "expected" input.

EDIT: My analogy above is flawed, as indicated by LinMa's comments below this answer. See, if you can get O(log n) performance on your "expected" inputs while blowing up the "unexpected" inputs by only a constant factor (e.g. "O(n) plus a constant number of comparisons" or "O(n) times two"), then you can keep your algorithm's worst-case big-O "constant" while improving the best-case.

Hope this helps!

Question 8

Thanks for the reply Quuxplusone, I think my algorithm worst case (when each element occur only once) is O(n), why do you think it is O(n log n)?

Question 9

BTW, vote up for your post, agree with most points, and I should make it more clear for my target. I am more interested in improve average performance to be less than O(n), and I am good with worse case performance to be O(n).

Question 10

@LinMa: I admit I didn't look closely at your function because it looks dense and complicated. I assumed that you were binary searching (O(log n)) in the rest of the array for the end of each run of identical values; and there are n runs; so that's O(n log n) total. But you're doing something better than just binary-searching the entire rest of the array. I now agree it gets you back to O(n) worst case. My point about compression is misleading, then: you're getting O(log n) on the "good" inputs and only hurting the "bad" ones by a constant factor, which is still O(n)!

Question 11

Thanks Quuxplusone, also I am bit lost for your example, specifically, this line -- random.randrange(n) for i in xrange(n), I think your purpose is to generate a different number each time (so that final result is [1,2,3,...]), but you know randrange(n) could generate any number which is between 0 and n-1?

Question 12

@LinMa: sorted(random.randrange(n) for i in xrange(n)) just generates a sorted list of probably but not necessarily unique numbers, e.g. it might reasonably generate [1, 3, 3, 4, 6, 7, 7, 10, 11, 12,... n-2] and it has a chance (however small) of generating [1, 1, 1, 1, 1,... 1]. If you knew that your input generator only ever produced sorted(range(n)), it would be really easy to write a generate_count() that "worked"! ;)

Quuxplusone Quuxplusone 19.7k2 gold badges44 silver badges91 bronze badges · Answer 1 · 2017-02-20 08:36:19Z

I want to find a solution [whose] algorithm time complexity is lower than O(n).

You'll have to be more specific. What you asked for literally isn't possible; but sometimes people use "big-O" notation loosely, either out of laziness or because they haven't yet been exposed to the notation expressing their intended precise gradations of meaning.

Notice that if I give you the input

1 2 3 4 5 6 7 8 9 10 11 12 13 ...

then you must output

number of 1 is 1
number of 2 is 1
number of 3 is 1
number of 4 is 1
...

— that is, if the size of the input is n, then the worst-case size of the output is O(n). Printing those O(n) bytes of output will take O(n) time just by itself. Anything else the program does (besides print its output) can only make the overall program slower, not faster.

On the above worst-case input, your "repeated binary search" algorithm runs in O(n log n) time, which is worse than O(n).

Now, if you're not concerned about worst-case inputs, but merely with the average input... well, then you'd have to define a notion of "the average input". You'll have to come up with some model of what kinds of input you expect. One way to do this is to come up with an algorithm for generating inputs, and then treat it as your model. For example:

def get_random_input(n):
 return sorted([random.randrange(n) for i in xrange(n)])

Or:

def get_random_input(n):
 return sorted([random.randrange(100) for i in xrange(n)])

On these two different models, your algorithm performs very differently. On the latter, you do well: O(log n). On the former, you do poorly: O(n log n).

This is kind of like asking for a compression algorithm that compresses all possible inputs. You can get compression for some inputs only at the expense of others; the art is deciding which kinds of inputs are "interesting" and which can safely be blown up. In this case, you can get sub-O(n) performance for some inputs only at the expense of others; the art is deciding which kinds of inputs are "interesting", which is to say, deciding on a model for your "average" or "expected" input.

EDIT: My analogy above is flawed, as indicated by LinMa's comments below this answer. See, if you can get O(log n) performance on your "expected" inputs while blowing up the "unexpected" inputs by only a constant factor (e.g. "O(n) plus a constant number of comparisons" or "O(n) times two"), then you can keep your algorithm's worst-case big-O "constant" while improving the best-case.

Hope this helps!

Thanks for the reply Quuxplusone, I think my algorithm worst case (when each element occur only once) is O(n), why do you think it is O(n log n)?
BTW, vote up for your post, agree with most points, and I should make it more clear for my target. I am more interested in improve average performance to be less than O(n), and I am good with worse case performance to be O(n).
@LinMa: I admit I didn't look closely at your function because it looks dense and complicated. I assumed that you were binary searching (O(log n)) in the rest of the array for the end of each run of identical values; and there are n runs; so that's O(n log n) total. But you're doing something better than just binary-searching the entire rest of the array. I now agree it gets you back to O(n) worst case. My point about compression is misleading, then: you're getting O(log n) on the "good" inputs and only hurting the "bad" ones by a constant factor, which is still O(n)!
Thanks Quuxplusone, also I am bit lost for your example, specifically, this line -- random.randrange(n) for i in xrange(n), I think your purpose is to generate a different number each time (so that final result is [1,2,3,...]), but you know randrange(n) could generate any number which is between 0 and n-1?
@LinMa: sorted(random.randrange(n) for i in xrange(n)) just generates a sorted list of probably but not necessarily unique numbers, e.g. it might reasonably generate [1, 3, 3, 4, 6, 7, 7, 10, 11, 12,... n-2] and it has a chance (however small) of generating [1, 1, 1, 1, 1,... 1]. If you knew that your input generator only ever produced sorted(range(n)), it would be really easy to write a generate_count() that "worked"! ;)

Stack Exchange Network

Counting the number of elements in a sorted array

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Counting the number of elements in a sorted array

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions