I have a list of non-unique values. I want to output a list of the same length where each value corresponds to how many times that value has appeared so far.
The code I have written to do this seems over-complicated, so I think there has to be a better way.
x = [1,1,2,3,1,3]
def times_so_far(ls):
out = [0]*len(x)
counted = {}
for i,v in enumerate(x):
if v in counted:
counted[v] += 1
else:
counted[v] = 0
out[i] = counted[v]
return out
times_so_far(x)
#[0, 1, 0, 0, 2, 1]
-
2\$\begingroup\$ I have rolled back the last edit. Please see What to do when someone answers . \$\endgroup\$Ethan Bierlein– Ethan Bierlein2016年01月29日 18:23:26 +00:00Commented Jan 29, 2016 at 18:23
-
2\$\begingroup\$ What you wrote is actually pretty good, certainly better than the answer you accepted. \$\endgroup\$user2357112– user23571122016年01月29日 22:07:19 +00:00Commented Jan 29, 2016 at 22:07
6 Answers 6
Maybe using list.count() you can achieve the same with less lines?
x = [1,1,2,3,1,3]
def times_so_far(ls):
out = [0]*len(ls)
for i in xrange(len(ls)):
out[i] = ls[:i].count(ls[i])
return out
This can be written with list comprehension as mentioned by Caridorc, and it removes the first len(ls)
call this way (though maybe at the cost of resizing out
for every element of the list):
def times_so_far(ls):
return [ls[:i].count(ls[i]) for i in xrange(len(ls))]
Now, if you're going to work with lists of positive integers of a small value, I would recommend the following, as it's similar to yours but it works with indices instead of dict keys:
def times_so_far(ls, maxValue):
temp = [0]*(maxValue+1)
out = [0]*len(ls)
for i in xrange(len(ls)):
out[i] = temp[ls[i]]
temp[ls[i]] += 1
return out
-
1\$\begingroup\$ Yes, list.count seems much better. I think I'll use this (I think it's
ls[i]
and notls[i+1]
, though) \$\endgroup\$C_Z_– C_Z_2016年01月29日 18:20:21 +00:00Commented Jan 29, 2016 at 18:20 -
5\$\begingroup\$ Isn't this slower, and also have the same worst case memory usage? \$\endgroup\$2016年01月29日 18:30:13 +00:00Commented Jan 29, 2016 at 18:30
-
1\$\begingroup\$ @JoeWallis It's Python... probably the slowest language out there. If performance is a problem then I would suggest moving on to another language. \$\endgroup\$hackerman– hackerman2016年01月29日 18:38:41 +00:00Commented Jan 29, 2016 at 18:38
-
9\$\begingroup\$ My point is your algorithm's speed is \$O(n^2)\$ where OP's was \$O(n)\$. And you also use \$O(3n)\$ memory. IMO that's a bad thing to say is a good algorithm. In any language. \$\endgroup\$2016年01月29日 18:45:31 +00:00Commented Jan 29, 2016 at 18:45
-
1\$\begingroup\$ @hackerman
out
andls
is 2n.ls[:i]
is a copy which is whyi[:]
is good. And so \$O(3n)\$. Anddict
lookup is \$O(1)\$. Since this is tagged algorithm I would assume they want it improved... \$\endgroup\$2016年01月29日 18:53:59 +00:00Commented Jan 29, 2016 at 18:53
You can use a defaultdict
, to remove the if v else
.
To achieve this, you can do:
from collections import defaultdict
counted = defaultdict(int)
for i,v in enumerate(x):
counted[v] += 1
To remove the need for out
and enumerate
you can yield
on each iteration.
This would make your code very small and readable.
from collections import defaultdict
def times_so_far(list_):
counted = defaultdict(int)
for v in list_:
counted[v] += 1
yield counted[v]
print(list(times_so_far([1,1,2,3,1,3])))
#[0, 1, 0, 0, 2, 1]
It will return a generator object, hence why I use list
.
If this is a problem then you can change yield
to out.append()
and re-add out.
This has \$O(n)\$ runtime, with a min \$O(n)\$ and maximum \$O(2n)\$ memory usage.
I propose a very slow solution, \$O(N^2)\$ in time and space.
The benefits of this solution are its clarity and simplicity.
Knowing that:
xs[:i]
gives all the elements ofxs
up to the indexi
enumerate
yields anindex, value
pair,
The following is easier to understand than English:
def times_so_far(xs):
return [xs[:index].count(x) for index, x in enumerate(xs)]
In case you care about performance, you may prefer using a Counter
or a defaultdict
as suggested by @Joe Wallis and @Ethan Bierlein
I would use the "high-performance" (as described in documentation) Counter
class for this task. You can actually get a complete counter at every step, but that is of course more than you asked for.
This is my implementation:
from collections import Counter
def times_so_far(nums):
return list(times_so_far_generator(nums))
def times_so_far_generator(nums):
counter = Counter()
for num in nums:
counter[num] += 1
yield counter[num]
I like this solution because:
- It is fast: uses generators +
Counter
and asymptotically optimal \$O(n)\$ performance with respect to both memory and time. - It is very readable and understandable
- It generalizes (
yield counter
instead ofyield counter[num]
to get a fullCounter
at every position). You can also return a tuple of the counts with the element withyield (num, counter[num])
- It uses the specifically designed
Counter
instead of a genericdefaultdict
, and so is more fitting to the task (and can possibly perform better, as values are integers instead of any object) - It works on arbitrary iterables (even infinitely long ones, if you use the generator version)
-
\$\begingroup\$ Whoever down voted...please explain yourselves? \$\endgroup\$mleyfman– mleyfman2016年01月29日 20:35:46 +00:00Commented Jan 29, 2016 at 20:35
-
2\$\begingroup\$ Not the downvoter, but calling
Counter
"high-performance" may be something of a misnomer. The only high-performance aspects of it relative to an ordinary dict or a defaultdict are creating orupdate
-ing it from an input iterable, and even then, only on Python 3.3 and up. Other than that, it's either equivalent in speed or slower than dict or defaultdict. \$\endgroup\$user2357112– user23571122016年01月29日 22:22:48 +00:00Commented Jan 29, 2016 at 22:22 -
1\$\begingroup\$ @user2357112 I was simply referring to the way it is described in the official documentation. The collections module is described as "High-performance container datatypes" \$\endgroup\$mleyfman– mleyfman2016年01月29日 22:34:34 +00:00Commented Jan 29, 2016 at 22:34
-
I would suggest a vectorized approach making heavy usage of NumPy functions which might be beneficial as we are focusing on performance here. The steps to solve our case could be put together as shown below -
# Get counts for each label in input array
counts = np.bincount(X)
# Setup ID array that will ultimately when cumulatively summed would give us
# such a ramp-like array corresponding to an otherwise sorted input array
id_arr = np.ones(len(X),dtype=int)
id_arr[0] = 0
id_arr[counts[:-1].cumsum()] = -counts[:-1]+1
# Perform cumlative summation and go back to original order using double argsorting
out = id_arr.cumsum()[np.argsort(X,kind='mergesort').argsort()]
Please note that the input expected would be a NumPy array.
Benchmarking
Let's test out all the approaches posted thus far to solve the problem. I have compiled all the solutions as functions with names attributed to their respective author names. Listed below are the compiled codes -
def times_so_far_hackerman(ls):
out = [0]*len(ls)
for i in xrange(len(ls)):
out[i] = ls[:i].count(ls[i])
return out
def times_so_far_JoeWallis(list_):
counted = defaultdict(int)
for v in list_:
counted[v] += 1
yield counted[v]-1
def times_so_far_generator(nums):
counter = Counter()
for num in nums:
counter[num] += 1
yield counter[num]-1
def times_so_far_mleyfman(nums):
return list(times_so_far_generator(nums))
def times_so_far_Caridorc(xs):
return [xs[:index].count(x) for index, x in enumerate(xs)]
def times_so_far_vectorized(X):
counts = np.bincount(X)
id_arr = np.ones(len(X),dtype=int)
id_arr[0] = 0
id_arr[counts[:-1].cumsum()] = -counts[:-1]+1
return id_arr.cumsum()[np.argsort(X,kind='mergesort').argsort()]
Now, let's start timings these. Let's start with a relatively decent sized list of numbers :
In [9]: %timeit times_so_far(x)
...: %timeit times_so_far_hackerman(x)
...: %timeit list(times_so_far_JoeWallis(x))
...: %timeit times_so_far_mleyfman(x)
...: %timeit times_so_far_Caridorc(x)
...: %timeit times_so_far_vectorized(np.array(x))
...:
100 loops, best of 3: 3.27 ms per loop
1 loops, best of 3: 1.3 s per loop
100 loops, best of 3: 3.03 ms per loop
100 loops, best of 3: 8.32 ms per loop
1 loops, best of 3: 1.36 s per loop
100 loops, best of 3: 2.44 ms per loop
So, the three fastest approaches are :
times_so_far()
times_so_far_JoeWallis()
times_so_far_vectorized()
Let's use a bigger sized array to test out those three :
In [10]: # Input array with random elements between 0 amnd 100
...: x = np.random.randint(0,500,(50000)).tolist()
In [11]: %timeit times_so_far(x)
...: %timeit list(times_so_far_JoeWallis(x))
...: %timeit times_so_far_vectorized(np.array(x))
...:
100 loops, best of 3: 19.7 ms per loop
100 loops, best of 3: 17.7 ms per loop
100 loops, best of 3: 15.6 ms per loop
Now, if the input is already a NumPy array, times_so_far_vectorized
can shave off the runtime spent on converting from a list to an array. Let's test out how that works out :
In [12]: %timeit times_so_far(x)
...: %timeit list(times_so_far_JoeWallis(x))
...: X = np.array(x)
...: %timeit times_so_far_vectorized(X)
...:
100 loops, best of 3: 19.5 ms per loop
100 loops, best of 3: 17.6 ms per loop
100 loops, best of 3: 11.3 ms per loop
So, there's a pretty good performance boost for times_so_far_vectorized
over the other approaches!
You're right - there is a better way. Python list
s have a default built-in function named count
. For example, if I wanted to count the number of times the string "lol"
occurred in a list, I could just do this:
my_list = [1, 2, "lol", 3, 4, "lol", 5, 6, 7, 8, "lol"]
print(my_list.count("lol"))
This is ultimately better than generating a list of counts corresponding to each value as well, since it's less memory-intensive and much easier to work with.
In addition, if you want to take advantage of the powerful collections
module, you should add this to the top of your code:
from collections import Counter
Then you can use the Counter
object like this:
>>> print(Counter([1, 2, "lol", 3, 4, "lol", 5, 6, 7, 8, "lol"]))
Counter({'lol': 3, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1})
Hope this helps! :)