Count how many times values has appeared in list so far

Question 1

I have a list of non-unique values. I want to output a list of the same length where each value corresponds to how many times that value has appeared so far.

The code I have written to do this seems over-complicated, so I think there has to be a better way.

x = [1,1,2,3,1,3]
def times_so_far(ls):
 out = [0]*len(x)
 counted = {}
 for i,v in enumerate(x):
 if v in counted:
 counted[v] += 1
 else:
 counted[v] = 0
 out[i] = counted[v]
 return out
times_so_far(x)
#[0, 1, 0, 0, 2, 1]

Question 2

I have rolled back the last edit. Please see What to do when someone answers .

Question 3

What you wrote is actually pretty good, certainly better than the answer you accepted.

Question 4

Maybe using list.count() you can achieve the same with less lines?

x = [1,1,2,3,1,3]
def times_so_far(ls):
 out = [0]*len(ls)
 for i in xrange(len(ls)):
 out[i] = ls[:i].count(ls[i]) 
 return out

This can be written with list comprehension as mentioned by Caridorc, and it removes the first len(ls) call this way (though maybe at the cost of resizing out for every element of the list):

def times_so_far(ls):
 return [ls[:i].count(ls[i]) for i in xrange(len(ls))]

Now, if you're going to work with lists of positive integers of a small value, I would recommend the following, as it's similar to yours but it works with indices instead of dict keys:

def times_so_far(ls, maxValue):
 temp = [0]*(maxValue+1)
 out = [0]*len(ls)
 for i in xrange(len(ls)):
 out[i] = temp[ls[i]]
 temp[ls[i]] += 1
 return out

Question 5

Yes, list.count seems much better. I think I'll use this (I think it's ls[i] and not ls[i+1], though)

Question 6

Isn't this slower, and also have the same worst case memory usage?

Question 7

@JoeWallis It's Python... probably the slowest language out there. If performance is a problem then I would suggest moving on to another language.

Question 8

My point is your algorithm's speed is \$O(n^2)\$ where OP's was \$O(n)\$. And you also use \$O(3n)\$ memory. IMO that's a bad thing to say is a good algorithm. In any language.

Question 9

@hackerman out and ls is 2n. ls[:i] is a copy which is why i[:] is good. And so \$O(3n)\$. And dict lookup is \$O(1)\$. Since this is tagged algorithm I would assume they want it improved...

Question 10

You can use a defaultdict, to remove the if v else. To achieve this, you can do:

from collections import defaultdict
counted = defaultdict(int)
for i,v in enumerate(x):
 counted[v] += 1

To remove the need for out and enumerate you can yield on each iteration.
This would make your code very small and readable.

from collections import defaultdict
def times_so_far(list_):
 counted = defaultdict(int)
 for v in list_:
 counted[v] += 1
 yield counted[v]
print(list(times_so_far([1,1,2,3,1,3])))
#[0, 1, 0, 0, 2, 1]

It will return a generator object, hence why I use list. If this is a problem then you can change yield to out.append() and re-add out.

This has \$O(n)\$ runtime, with a min \$O(n)\$ and maximum \$O(2n)\$ memory usage.

Question 11

I propose a very slow solution, \$O(N^2)\$ in time and space.

The benefits of this solution are its clarity and simplicity.

Knowing that:

xs[:i] gives all the elements of xs up to the index i
enumerate yields an index, value pair,

The following is easier to understand than English:

def times_so_far(xs):
 return [xs[:index].count(x) for index, x in enumerate(xs)]

In case you care about performance, you may prefer using a Counter or a defaultdict as suggested by @Joe Wallis and @Ethan Bierlein

Question 12

I would use the "high-performance" (as described in documentation) Counter class for this task. You can actually get a complete counter at every step, but that is of course more than you asked for.

This is my implementation:

from collections import Counter
def times_so_far(nums):
 return list(times_so_far_generator(nums))
def times_so_far_generator(nums):
 counter = Counter()
 for num in nums:
 counter[num] += 1
 yield counter[num]

I like this solution because:

It is fast: uses generators + Counter and asymptotically optimal \$O(n)\$ performance with respect to both memory and time.
It is very readable and understandable
It generalizes (yield counter instead of yield counter[num] to get a full Counter at every position). You can also return a tuple of the counts with the element with yield (num, counter[num])
It uses the specifically designed Counter instead of a generic defaultdict, and so is more fitting to the task (and can possibly perform better, as values are integers instead of any object)
It works on arbitrary iterables (even infinitely long ones, if you use the generator version)

Question 13

Whoever down voted...please explain yourselves?

Question 14

Not the downvoter, but calling Counter "high-performance" may be something of a misnomer. The only high-performance aspects of it relative to an ordinary dict or a defaultdict are creating or update-ing it from an input iterable, and even then, only on Python 3.3 and up. Other than that, it's either equivalent in speed or slower than dict or defaultdict.

Question 15

@user2357112 I was simply referring to the way it is described in the official documentation. The collections module is described as "High-performance container datatypes"

Question 16

I timed both Counter and defaultdict in Python2 and Python3. (It's only one data set so it may be significantly different on a larger data set.)

Question 17

I would suggest a vectorized approach making heavy usage of NumPy functions which might be beneficial as we are focusing on performance here. The steps to solve our case could be put together as shown below -

# Get counts for each label in input array
counts = np.bincount(X)
# Setup ID array that will ultimately when cumulatively summed would give us 
# such a ramp-like array corresponding to an otherwise sorted input array 
id_arr = np.ones(len(X),dtype=int)
id_arr[0] = 0
id_arr[counts[:-1].cumsum()] = -counts[:-1]+1 
# Perform cumlative summation and go back to original order using double argsorting
out = id_arr.cumsum()[np.argsort(X,kind='mergesort').argsort()]

Please note that the input expected would be a NumPy array.

Benchmarking

Let's test out all the approaches posted thus far to solve the problem. I have compiled all the solutions as functions with names attributed to their respective author names. Listed below are the compiled codes -

def times_so_far_hackerman(ls):
 out = [0]*len(ls)
 for i in xrange(len(ls)):
 out[i] = ls[:i].count(ls[i]) 
 return out
def times_so_far_JoeWallis(list_):
 counted = defaultdict(int)
 for v in list_:
 counted[v] += 1
 yield counted[v]-1
def times_so_far_generator(nums):
 counter = Counter()
 for num in nums:
 counter[num] += 1
 yield counter[num]-1
def times_so_far_mleyfman(nums):
 return list(times_so_far_generator(nums))
def times_so_far_Caridorc(xs):
 return [xs[:index].count(x) for index, x in enumerate(xs)]
def times_so_far_vectorized(X):
 counts = np.bincount(X)
 id_arr = np.ones(len(X),dtype=int)
 id_arr[0] = 0
 id_arr[counts[:-1].cumsum()] = -counts[:-1]+1 
 return id_arr.cumsum()[np.argsort(X,kind='mergesort').argsort()]

Now, let's start timings these. Let's start with a relatively decent sized list of numbers :

In [9]: %timeit times_so_far(x)
 ...: %timeit times_so_far_hackerman(x)
 ...: %timeit list(times_so_far_JoeWallis(x))
 ...: %timeit times_so_far_mleyfman(x)
 ...: %timeit times_so_far_Caridorc(x)
 ...: %timeit times_so_far_vectorized(np.array(x))
 ...: 
100 loops, best of 3: 3.27 ms per loop
1 loops, best of 3: 1.3 s per loop
100 loops, best of 3: 3.03 ms per loop
100 loops, best of 3: 8.32 ms per loop
1 loops, best of 3: 1.36 s per loop
100 loops, best of 3: 2.44 ms per loop

So, the three fastest approaches are :

times_so_far()
times_so_far_JoeWallis()
times_so_far_vectorized()

Let's use a bigger sized array to test out those three :

In [10]: # Input array with random elements between 0 amnd 100
 ...: x = np.random.randint(0,500,(50000)).tolist()
In [11]: %timeit times_so_far(x)
 ...: %timeit list(times_so_far_JoeWallis(x))
 ...: %timeit times_so_far_vectorized(np.array(x))
 ...: 
100 loops, best of 3: 19.7 ms per loop
100 loops, best of 3: 17.7 ms per loop
100 loops, best of 3: 15.6 ms per loop

Now, if the input is already a NumPy array, times_so_far_vectorized can shave off the runtime spent on converting from a list to an array. Let's test out how that works out :

In [12]: %timeit times_so_far(x)
 ...: %timeit list(times_so_far_JoeWallis(x))
 ...: X = np.array(x)
 ...: %timeit times_so_far_vectorized(X)
 ...: 
100 loops, best of 3: 19.5 ms per loop
100 loops, best of 3: 17.6 ms per loop
100 loops, best of 3: 11.3 ms per loop

So, there's a pretty good performance boost for times_so_far_vectorized over the other approaches!

Question 18

You're right - there is a better way. Python lists have a default built-in function named count. For example, if I wanted to count the number of times the string "lol" occurred in a list, I could just do this:

my_list = [1, 2, "lol", 3, 4, "lol", 5, 6, 7, 8, "lol"]
print(my_list.count("lol"))

This is ultimately better than generating a list of counts corresponding to each value as well, since it's less memory-intensive and much easier to work with.

In addition, if you want to take advantage of the powerful collections module, you should add this to the top of your code:

from collections import Counter

Then you can use the Counter object like this:

>>> print(Counter([1, 2, "lol", 3, 4, "lol", 5, 6, 7, 8, "lol"]))
Counter({'lol': 3, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1})

Hope this helps! :)

hackerman hackerman 988 bronze badges · Accepted Answer · 2016-01-29 18:15:19Z

Maybe using list.count() you can achieve the same with less lines?

x = [1,1,2,3,1,3]
def times_so_far(ls):
 out = [0]*len(ls)
 for i in xrange(len(ls)):
 out[i] = ls[:i].count(ls[i]) 
 return out

This can be written with list comprehension as mentioned by Caridorc, and it removes the first len(ls) call this way (though maybe at the cost of resizing out for every element of the list):

def times_so_far(ls):
 return [ls[:i].count(ls[i]) for i in xrange(len(ls))]

Now, if you're going to work with lists of positive integers of a small value, I would recommend the following, as it's similar to yours but it works with indices instead of dict keys:

def times_so_far(ls, maxValue):
 temp = [0]*(maxValue+1)
 out = [0]*len(ls)
 for i in xrange(len(ls)):
 out[i] = temp[ls[i]]
 temp[ls[i]] += 1
 return out

Yes, list.count seems much better. I think I'll use this (I think it's ls[i] and not ls[i+1], though)
Isn't this slower, and also have the same worst case memory usage?
@JoeWallis It's Python... probably the slowest language out there. If performance is a problem then I would suggest moving on to another language.
My point is your algorithm's speed is \$O(n^2)\$ where OP's was \$O(n)\$. And you also use \$O(3n)\$ memory. IMO that's a bad thing to say is a good algorithm. In any language.
@hackerman out and ls is 2n. ls[:i] is a copy which is why i[:] is good. And so \$O(3n)\$. And dict lookup is \$O(1)\$. Since this is tagged algorithm I would assume they want it improved...

Stack Exchange Network

Count how many times values has appeared in list so far

6 Answers 6

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Count how many times values has appeared in list so far

6 Answers 6

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions