So the problem is
Find the duplicate entries (2nd occurrence onwards) in the given numpy array and mark them as True. First time occurrences should be False.
And my solution is:
import numpy as np
def duplicates(a):
uniques = np.unique(a)
result = []
for i in a:
if i not in uniques:
result.append(True)
else:
result.append(False)
uniques = np.delete(uniques, np.where(uniques == i))
return result
np.random.seed(100)
a = np.random.randint(0, 5, 10)
print('Array: ', a)
print(duplicates(a))
What improvments can i make in this program?
-
1\$\begingroup\$ Hey, thanks for your question. I answered it but I edited it a bunch of times, so be sure to check the final version of the answer now. Let me know in the comments if you have questions. \$\endgroup\$RGS– RGS2021年04月13日 17:06:27 +00:00Commented Apr 13, 2021 at 17:06
3 Answers 3
Your code is good code, I'd say. There's nothing wrong with it. The only thing is that you can write more or less the same algorithm in a much shorter way, and that essentially means you would refactor all of your code down to a single line.
So, before taking a look at the boiled down version of what you wrote, let me just point out that variable names should be descriptive and if you are using abbreviations or one-letter names, you should stick to the most used conventions. Why am I saying this? Well, i
is typically the name for an unimportant index variable, but in your loop you have for i in a
, and i
is an element of the array... So, believe it or not, people looking at the
rest of your code might be thrown off by the fact that i
is not an index.
Intermediate step
So before jumping to the one-liner that is actually readable (i.e. it is good Python code) let's take a look at a more tractable intermediate step. Instead of computing all the unique values and then deleting from there (deleting from the middle of a list is typically "slow), you could go the other way around and have a list that saves all the unique values you have found:
def duplicates(arr):
result, uniques = [], []
for elem in arr:
if elem in uniques:
result.append(True)
else:
result.append(False)
uniques.append(elem)
return result
Next intermediate step
Now take another look at what I wrote.
Notice that both in the if
and the else
I am appending to the result
variable. The only thing that is really "special" is the appending to the uniques
, which only happens on the else
, so we could write the .append
unconditionally and append to uniques
only if needed:
def duplicates(arr):
result, uniques = [], []
for elem in arr:
result.append(elem in uniques)
if elem in not uniques:
uniques.append(elem)
return result
Final solution (after rethinking algorithm)
Let me know if you can understand my suggested solution and I'll expand on it if you need :)
def duplicates(arr):
return [elem in arr[:i] for i, elem in enumerate(arr)]
Notice that this list comprehension reads "return a list that says if the i-th element can be found in the first i-1 elements".
The key here is in phrasing the result in the correct way. If you reason Python code in the correct way, English and the code become one and the same.
When you have a relatively simple for
loop that is building a list with successive .append
, that's a sign to try and see if you can use a list comprehension.
I am also making use of the enumerate
built-in (learn about it here) to traverse the array while looking at the values and indices at the same time.
EDIT:
As other answers have noted (and also, see comments on this answer), this one-liner runs in time complexity O(n^2), which is not the best you can do. However, that single line of Python shows you how far you can go with just a simple idea/naive algorithm, and optimised usage of the core features of Python :)
-
1\$\begingroup\$ In that case it should be False, True, True. The False value is given only to the first appearance of the value in the array. Thank you for your explanation and advices, I really appreciate being in this community :) \$\endgroup\$Levon Avetisyan– Levon Avetisyan2021年04月13日 17:06:24 +00:00Commented Apr 13, 2021 at 17:06
-
\$\begingroup\$ @LevonAvetisyan yup, I looked more calmly at your code and understood that, of course! \$\endgroup\$RGS– RGS2021年04月13日 17:07:25 +00:00Commented Apr 13, 2021 at 17:07
-
2\$\begingroup\$ Better make
uniques
aset
. \$\endgroup\$Manuel– Manuel2021年04月13日 18:03:08 +00:00Commented Apr 13, 2021 at 18:03 -
\$\begingroup\$ @Manuel better performance in checking if the element is already there? (Won't the performance to add a new element be worse?) Or why would you suggest that? \$\endgroup\$RGS– RGS2021年04月13日 18:12:35 +00:00Commented Apr 13, 2021 at 18:12
-
2\$\begingroup\$ Yes, making the whole thing O(n) instead of O(n^2). Plus you can remove your extra check then. \$\endgroup\$Manuel– Manuel2021年04月13日 18:15:30 +00:00Commented Apr 13, 2021 at 18:15
As mentioned in the comments of another answer, you should use set
s for membership tests whenever possible. Combinig this with building up uniques
instead of cutting it down, we get this really simple implementation:
def duplicates(numbers):
uniques = set()
result = []
for num in numbers:
result.append(num in uniques)
uniques.add(num)
return result
This implementation of duplicates
is concise, easy to read and runs in O(n) instead of O(n^2), which is a major difference. One-liners are a fun exercise that can teach you some interesting aspects of Python, and I understand their appeal, but they're often not the best way to go. It's not worth sacrificing readability and performance for the sake of writing less lines of code.
EDIT: Performance testing
As requested in the comments I tested the performance of three different approaches:
def duplicates_set(numbers):
uniques = set()
result = []
for num in numbers:
result.append(num in uniques)
uniques.add(num)
return result
def duplicates_list(numbers):
result, uniques = [], []
for elem in numbers:
if elem in uniques:
result.append(True)
else:
result.append(False)
uniques.append(elem)
return result
def duplicates_oneliner(numbers):
return [elem in numbers[:i] for i, elem in enumerate(numbers)]
Note that the following approach for duplicates_list
should be avoided in any case, since it checks for elem
membership in uniques
twice.
def duplicates_list(numbers):
result, uniques = [], []
for elem in numbers:
result.append(elem in uniques)
if elem not in uniques:
uniques.append(elem)
return result
Here is my code for performance testing using the timeit
module. All the provided outputs were produced by this code.
num_range: (int, int)
random_nums: int
executions: int
np.random.seed(100)
a = np.random.randint(*num_range, random_nums)
from timeit import timeit
print(f"{executions} executions with {random_nums} random numbers in range {num_range}")
print("duplicates_set:", "\t" * 2, timeit(stmt=lambda: duplicates_set(a), number=executions))
print("duplicates_list:", "\t" * 2, timeit(stmt=lambda: duplicates_list(a), number=executions))
print("duplicates_oneliner:\t", timeit(stmt=lambda: duplicates_oneliner(a), number=executions))
Starting out with 100_000 executions with the parameters from the original question we get this result:
100000 executions with 10 random numbers in range (0, 5)
duplicates_set: 0.22261620000000001
duplicates_list: 0.18974730000000006
duplicates_oneliner: 2.9783912
We can clearly see that the list
implementation performs the best for a small amount of numbers and a small range of random numbers. The one-liner is already at a huge disadvantage.
Upping the length of the random number array to 1_000, we get this result:
1000 executions with 1000 random numbers in range (0, 5)
duplicates_set: 0.1603748
duplicates_list: 0.13001250000000003
duplicates_oneliner: 3.1837349
It seems the length of the array alone does not really affect the relative performance of duplicates_set
and duplicates_list
. As long as the range of numbers is small, the list
comes out on top. Relative performance stayed the same up until 1_000_000 elements, which makes sense since the size of uniques
is capped by the range of numbers.
Once we start increasing num_range
, the set
implementation really starts to shine, since it simply does not care about the size of uniques
when it comes to runtime:
1000 executions with 1000 random numbers in range (0, 10)
duplicates_set: 0.15972010000000003
duplicates_list: 0.15932980000000002
duplicates_oneliner: 3.1950098
For a range of (0, 10)
, set
and list
implementation perform about the same.
At a range of (0, 1_000)
the list
implementation is already outperformed by the one-liner:
1000 executions with 1000 random numbers in range (0, 1000)
duplicates_set: 0.15644170000000002
duplicates_list: 3.4406095
duplicates_oneliner: 3.1854200999999995
The one-liner however is immediately out of the competition once the input gets larger, even at small number ranges:
10 executions with 100000 random numbers in range (0, 10)
duplicates_set: 0.1576303
duplicates_list: 0.24761889999999998
duplicates_oneliner: 30.1282213
The final result is exactly what would be expected:
duplicates_list
really starts to struggle once the number of unique numbers goes up, sinceif elem in uniques
gets increasingly expensive.duplicates_oneliner
completely breaks down with long input lists, sinceelem in numbers[:i]
gets expensive really fast.duplicates_set
is only outperformed at a low number of unique numbers, but at scale it outperforms both other implementations by a huge margin, sinceif elem in uniques
can be checked in constant time.
-
\$\begingroup\$ I agree that your solution is concise and legible and I like your unconditional
uniques.add
, well played! +1 from me :) \$\endgroup\$RGS– RGS2021年04月13日 22:53:24 +00:00Commented Apr 13, 2021 at 22:53 -
\$\begingroup\$ However, you do seem to imply my solution is nor readable nor performant. While it is a fact that its complexity is O(n^2) and therefore less performant, I think we can agree that single line is still very legible and showcases nicely how far you can go with a fairly naive algorithm. Finally, I wonder if the solution using
set
really is O(n) if the array has too many unique elements. I would expect eitherset
to use way to much memory to be O(1) in insertion and membership checking or to be something like O(log n) for membership checking, no? Do you have references for me to learn? :) \$\endgroup\$RGS– RGS2021年04月13日 22:55:41 +00:00Commented Apr 13, 2021 at 22:55 -
1\$\begingroup\$ @RGS You're right in that
set
s use more memory than lists, since hash values values need to be stored in addition to the elements themselves. Memory comparison on SO. Your assumption regarding membership checking not being O(1) is incorrect though.set
s (anddict
s) are implemented using hash tables, providing O(1) lookups and insertions. Chapter 4 in High Performance Python: Dictionaries and Sets \$\endgroup\$riskypenguin– riskypenguin2021年04月14日 09:28:39 +00:00Commented Apr 14, 2021 at 9:28 -
2\$\begingroup\$ @RGS That unconditional
add
is btw what I meant with "Plus you can remove your extra check then". After that, I suspect the one-liner and the set-solution are about equally readable in general, and the one-liner more readable for people used to them :-) \$\endgroup\$Manuel– Manuel2021年04月14日 10:15:17 +00:00Commented Apr 14, 2021 at 10:15 -
1\$\begingroup\$ Oh well, added benchmarks to my answer now (same cases as yours). \$\endgroup\$Manuel– Manuel2021年04月14日 10:58:11 +00:00Commented Apr 14, 2021 at 10:58
Check out unique
's documentation again, it offers to give you indices:
def duplicates(a):
uniques = np.unique(a, True)[1]
result = a == a
result[uniques] = False
return result
In benchmarks (reusing riskypenguin's) it's medium for the tiny original case but by far fastest for all larger cases:
100000 executions with 10 random numbers in range (0, 5)
duplicates_original 17.807 17.395 18.284 seconds
duplicates_set 0.726 0.686 0.721 seconds
duplicates_list 0.631 0.548 0.587 seconds
duplicates_oneliner 7.087 6.946 7.822 seconds
duplicates_numpy 2.017 2.177 2.155 seconds
1000 executions with 1000 random numbers in range (0, 5)
duplicates_original 6.904 6.862 6.617 seconds
duplicates_set 0.520 0.535 0.584 seconds
duplicates_list 0.385 0.363 0.341 seconds
duplicates_oneliner 7.241 7.927 7.244 seconds
duplicates_numpy 0.062 0.065 0.063 seconds
1000 executions with 1000 random numbers in range (0, 10)
duplicates_original 6.720 6.885 6.801 seconds
duplicates_set 0.656 0.542 0.576 seconds
duplicates_list 0.459 0.511 0.509 seconds
duplicates_oneliner 7.556 7.064 7.637 seconds
duplicates_numpy 0.080 0.082 0.073 seconds
1000 executions with 1000 random numbers in range (0, 1000)
duplicates_original 25.222 26.108 24.767 seconds
duplicates_set 0.682 0.757 0.759 seconds
duplicates_list 11.608 11.651 11.196 seconds
duplicates_oneliner 8.320 7.949 7.608 seconds
duplicates_numpy 0.101 0.110 0.102 seconds
10 executions with 100000 random numbers in range (0, 10)
duplicates_original 6.577 6.658 6.637 seconds
duplicates_set 0.609 0.585 0.603 seconds
duplicates_list 0.514 0.524 0.508 seconds
duplicates_oneliner 36.337 36.677 35.764 seconds
duplicates_numpy 0.082 0.083 0.083 seconds
Benchmark code:
from timeit import timeit
import numpy as np
def duplicates_original(a):
uniques = np.unique(a)
result = []
for i in a:
if i not in uniques:
result.append(True)
else:
result.append(False)
uniques = np.delete(uniques, np.where(uniques == i))
return result
def duplicates_set(numbers):
uniques = set()
result = []
for num in numbers:
result.append(num in uniques)
uniques.add(num)
return result
def duplicates_list(numbers):
result, uniques = [], []
for elem in numbers:
if elem in uniques:
result.append(True)
else:
result.append(False)
uniques.append(elem)
return result
def duplicates_oneliner(numbers):
return [elem in numbers[:i] for i, elem in enumerate(numbers)]
def duplicates_numpy(a):
uniques = np.unique(a, True)[1]
result = a == a
result[uniques] = False
return result
def benchmark(num_range: (int, int), random_nums: int, executions: int):
np.random.seed(100)
a = np.random.randint(*num_range, random_nums)
print(f"{executions} executions with {random_nums} random numbers in range {num_range}")
solutions = [
duplicates_original,
duplicates_set,
duplicates_list,
duplicates_oneliner,
duplicates_numpy,
]
# Correctness
expect = list(solutions[0](a))
for solution in solutions:
assert list(solution(a)) == expect
# Speed
for solution in solutions:
ts = ['%6.3f' % timeit(stmt=lambda: solution(a), number=executions)
for _ in range(3)]
print(solution.__name__.ljust(20), *ts, 'seconds')
print()
benchmark((0, 5), 10, 100000)
benchmark((0, 5), 1000, 1000)
benchmark((0, 10), 1000, 1000)
benchmark((0, 1000), 1000, 1000)
benchmark((0, 10), 100000, 10)
-
\$\begingroup\$ This is a nice solution as well! Knowing your library functions is important :) +1 \$\endgroup\$RGS– RGS2021年04月13日 18:19:18 +00:00Commented Apr 13, 2021 at 18:19
-
2\$\begingroup\$ Although I would suggest renaming
uniques
to something that says you have indices, and not values, souniques_idx
or smth like that. Also, might I suggest then_, idx = np.unique(a, True)
? I'd argue you shouldn't unpack with indices when you can avoid it. \$\endgroup\$RGS– RGS2021年04月13日 18:22:59 +00:00Commented Apr 13, 2021 at 18:22 -
\$\begingroup\$ @RGS And here I was, thinking
a == a
would be what people would complain about :-D \$\endgroup\$Manuel– Manuel2021年04月13日 18:26:32 +00:00Commented Apr 13, 2021 at 18:26 -
\$\begingroup\$ I've been doing a lot of array oriented programming in APL lately, so
a == a
doesn't shock me that much :p \$\endgroup\$RGS– RGS2021年04月13日 18:34:00 +00:00Commented Apr 13, 2021 at 18:34 -
\$\begingroup\$ Really interesting benchmarks,
numpy
shines once again! \$\endgroup\$riskypenguin– riskypenguin2021年04月14日 10:58:37 +00:00Commented Apr 14, 2021 at 10:58