Marking duplicate entries in a numpy array as True

Question 1

So the problem is

Find the duplicate entries (2nd occurrence onwards) in the given numpy array and mark them as True. First time occurrences should be False.

And my solution is:

import numpy as np
def duplicates(a):
 uniques = np.unique(a)
 result = []
 for i in a:
 if i not in uniques:
 result.append(True)
 else:
 result.append(False)
 uniques = np.delete(uniques, np.where(uniques == i))
 return result
np.random.seed(100)
a = np.random.randint(0, 5, 10)
print('Array: ', a)
print(duplicates(a))

What improvments can i make in this program?

Question 2

Hey, thanks for your question. I answered it but I edited it a bunch of times, so be sure to check the final version of the answer now. Let me know in the comments if you have questions.

Question 3

Your code is good code, I'd say. There's nothing wrong with it. The only thing is that you can write more or less the same algorithm in a much shorter way, and that essentially means you would refactor all of your code down to a single line.

So, before taking a look at the boiled down version of what you wrote, let me just point out that variable names should be descriptive and if you are using abbreviations or one-letter names, you should stick to the most used conventions. Why am I saying this? Well, i is typically the name for an unimportant index variable, but in your loop you have for i in a, and i is an element of the array... So, believe it or not, people looking at the rest of your code might be thrown off by the fact that i is not an index.

Intermediate step

So before jumping to the one-liner that is actually readable (i.e. it is good Python code) let's take a look at a more tractable intermediate step. Instead of computing all the unique values and then deleting from there (deleting from the middle of a list is typically "slow), you could go the other way around and have a list that saves all the unique values you have found:

def duplicates(arr):
 result, uniques = [], []
 for elem in arr:
 if elem in uniques:
 result.append(True)
 else:
 result.append(False)
 uniques.append(elem)
 
 return result

Next intermediate step

Now take another look at what I wrote. Notice that both in the if and the else I am appending to the result variable. The only thing that is really "special" is the appending to the uniques, which only happens on the else, so we could write the .append unconditionally and append to uniques only if needed:

def duplicates(arr):
 result, uniques = [], []
 for elem in arr:
 result.append(elem in uniques)
 if elem in not uniques:
 uniques.append(elem)
 
 return result

Final solution (after rethinking algorithm)

Let me know if you can understand my suggested solution and I'll expand on it if you need :)

def duplicates(arr):
 return [elem in arr[:i] for i, elem in enumerate(arr)]

Notice that this list comprehension reads "return a list that says if the i-th element can be found in the first i-1 elements".

The key here is in phrasing the result in the correct way. If you reason Python code in the correct way, English and the code become one and the same.

When you have a relatively simple for loop that is building a list with successive .append, that's a sign to try and see if you can use a list comprehension.

I am also making use of the enumerate built-in (learn about it here) to traverse the array while looking at the values and indices at the same time.

EDIT:

As other answers have noted (and also, see comments on this answer), this one-liner runs in time complexity O(n^2), which is not the best you can do. However, that single line of Python shows you how far you can go with just a simple idea/naive algorithm, and optimised usage of the core features of Python :)

Question 4

In that case it should be False, True, True. The False value is given only to the first appearance of the value in the array. Thank you for your explanation and advices, I really appreciate being in this community :)

Question 5

@LevonAvetisyan yup, I looked more calmly at your code and understood that, of course!

Question 6

Better make uniques a set.

Question 7

@Manuel better performance in checking if the element is already there? (Won't the performance to add a new element be worse?) Or why would you suggest that?

Question 8

Yes, making the whole thing O(n) instead of O(n^2). Plus you can remove your extra check then.

Question 9

As mentioned in the comments of another answer, you should use sets for membership tests whenever possible. Combinig this with building up uniques instead of cutting it down, we get this really simple implementation:

def duplicates(numbers):
 uniques = set()
 result = []
 for num in numbers:
 result.append(num in uniques)
 uniques.add(num)
 return result

This implementation of duplicates is concise, easy to read and runs in O(n) instead of O(n^2), which is a major difference. One-liners are a fun exercise that can teach you some interesting aspects of Python, and I understand their appeal, but they're often not the best way to go. It's not worth sacrificing readability and performance for the sake of writing less lines of code.

EDIT: Performance testing

As requested in the comments I tested the performance of three different approaches:

def duplicates_set(numbers):
 uniques = set()
 result = []
 for num in numbers:
 result.append(num in uniques)
 uniques.add(num)
 return result
def duplicates_list(numbers):
 result, uniques = [], []
 for elem in numbers:
 if elem in uniques:
 result.append(True)
 else:
 result.append(False)
 uniques.append(elem)
 return result
def duplicates_oneliner(numbers):
 return [elem in numbers[:i] for i, elem in enumerate(numbers)]

Note that the following approach for duplicates_list should be avoided in any case, since it checks for elem membership in uniques twice.

def duplicates_list(numbers):
 result, uniques = [], []
 for elem in numbers:
 result.append(elem in uniques)
 if elem not in uniques:
 uniques.append(elem)
 return result

Here is my code for performance testing using the timeit module. All the provided outputs were produced by this code.

num_range: (int, int)
random_nums: int
executions: int
np.random.seed(100)
a = np.random.randint(*num_range, random_nums)
from timeit import timeit
print(f"{executions} executions with {random_nums} random numbers in range {num_range}")
print("duplicates_set:", "\t" * 2, timeit(stmt=lambda: duplicates_set(a), number=executions))
print("duplicates_list:", "\t" * 2, timeit(stmt=lambda: duplicates_list(a), number=executions))
print("duplicates_oneliner:\t", timeit(stmt=lambda: duplicates_oneliner(a), number=executions))

Starting out with 100_000 executions with the parameters from the original question we get this result:

100000 executions with 10 random numbers in range (0, 5)
duplicates_set: 0.22261620000000001
duplicates_list: 0.18974730000000006
duplicates_oneliner: 2.9783912

We can clearly see that the list implementation performs the best for a small amount of numbers and a small range of random numbers. The one-liner is already at a huge disadvantage.

Upping the length of the random number array to 1_000, we get this result:

1000 executions with 1000 random numbers in range (0, 5)
duplicates_set: 0.1603748
duplicates_list: 0.13001250000000003
duplicates_oneliner: 3.1837349

It seems the length of the array alone does not really affect the relative performance of duplicates_set and duplicates_list. As long as the range of numbers is small, the list comes out on top. Relative performance stayed the same up until 1_000_000 elements, which makes sense since the size of uniques is capped by the range of numbers.

Once we start increasing num_range, the set implementation really starts to shine, since it simply does not care about the size of uniques when it comes to runtime:

1000 executions with 1000 random numbers in range (0, 10)
duplicates_set: 0.15972010000000003
duplicates_list: 0.15932980000000002
duplicates_oneliner: 3.1950098

For a range of (0, 10), set and list implementation perform about the same.

At a range of (0, 1_000) the list implementation is already outperformed by the one-liner:

1000 executions with 1000 random numbers in range (0, 1000)
duplicates_set: 0.15644170000000002
duplicates_list: 3.4406095
duplicates_oneliner: 3.1854200999999995

The one-liner however is immediately out of the competition once the input gets larger, even at small number ranges:

10 executions with 100000 random numbers in range (0, 10)
duplicates_set: 0.1576303
duplicates_list: 0.24761889999999998
duplicates_oneliner: 30.1282213

The final result is exactly what would be expected:

duplicates_list really starts to struggle once the number of unique numbers goes up, since if elem in uniques gets increasingly expensive.
duplicates_oneliner completely breaks down with long input lists, since elem in numbers[:i] gets expensive really fast.
duplicates_set is only outperformed at a low number of unique numbers, but at scale it outperforms both other implementations by a huge margin, since if elem in uniques can be checked in constant time.

Question 10

I agree that your solution is concise and legible and I like your unconditional uniques.add, well played! +1 from me :)

Question 11

However, you do seem to imply my solution is nor readable nor performant. While it is a fact that its complexity is O(n^2) and therefore less performant, I think we can agree that single line is still very legible and showcases nicely how far you can go with a fairly naive algorithm. Finally, I wonder if the solution using set really is O(n) if the array has too many unique elements. I would expect either set to use way to much memory to be O(1) in insertion and membership checking or to be something like O(log n) for membership checking, no? Do you have references for me to learn? :)

Question 12

@RGS You're right in that sets use more memory than lists, since hash values values need to be stored in addition to the elements themselves. Memory comparison on SO. Your assumption regarding membership checking not being O(1) is incorrect though. sets (and dicts) are implemented using hash tables, providing O(1) lookups and insertions. Chapter 4 in High Performance Python: Dictionaries and Sets

Question 13

@RGS That unconditional add is btw what I meant with "Plus you can remove your extra check then". After that, I suspect the one-liner and the set-solution are about equally readable in general, and the one-liner more readable for people used to them :-)

Question 14

Oh well, added benchmarks to my answer now (same cases as yours).

Question 15

Check out unique's documentation again, it offers to give you indices:

def duplicates(a):
 uniques = np.unique(a, True)[1]
 result = a == a
 result[uniques] = False
 return result

In benchmarks (reusing riskypenguin's) it's medium for the tiny original case but by far fastest for all larger cases:

100000 executions with 10 random numbers in range (0, 5)
duplicates_original 17.807 17.395 18.284 seconds
duplicates_set 0.726 0.686 0.721 seconds
duplicates_list 0.631 0.548 0.587 seconds
duplicates_oneliner 7.087 6.946 7.822 seconds
duplicates_numpy 2.017 2.177 2.155 seconds
1000 executions with 1000 random numbers in range (0, 5)
duplicates_original 6.904 6.862 6.617 seconds
duplicates_set 0.520 0.535 0.584 seconds
duplicates_list 0.385 0.363 0.341 seconds
duplicates_oneliner 7.241 7.927 7.244 seconds
duplicates_numpy 0.062 0.065 0.063 seconds
1000 executions with 1000 random numbers in range (0, 10)
duplicates_original 6.720 6.885 6.801 seconds
duplicates_set 0.656 0.542 0.576 seconds
duplicates_list 0.459 0.511 0.509 seconds
duplicates_oneliner 7.556 7.064 7.637 seconds
duplicates_numpy 0.080 0.082 0.073 seconds
1000 executions with 1000 random numbers in range (0, 1000)
duplicates_original 25.222 26.108 24.767 seconds
duplicates_set 0.682 0.757 0.759 seconds
duplicates_list 11.608 11.651 11.196 seconds
duplicates_oneliner 8.320 7.949 7.608 seconds
duplicates_numpy 0.101 0.110 0.102 seconds
10 executions with 100000 random numbers in range (0, 10)
duplicates_original 6.577 6.658 6.637 seconds
duplicates_set 0.609 0.585 0.603 seconds
duplicates_list 0.514 0.524 0.508 seconds
duplicates_oneliner 36.337 36.677 35.764 seconds
duplicates_numpy 0.082 0.083 0.083 seconds

Benchmark code:

from timeit import timeit
import numpy as np
def duplicates_original(a):
 uniques = np.unique(a)
 result = []
 for i in a:
 if i not in uniques:
 result.append(True)
 else:
 result.append(False)
 uniques = np.delete(uniques, np.where(uniques == i))
 return result
def duplicates_set(numbers):
 uniques = set()
 result = []
 for num in numbers:
 result.append(num in uniques)
 uniques.add(num)
 return result
def duplicates_list(numbers):
 result, uniques = [], []
 for elem in numbers:
 if elem in uniques:
 result.append(True)
 else:
 result.append(False)
 uniques.append(elem)
 return result
def duplicates_oneliner(numbers):
 return [elem in numbers[:i] for i, elem in enumerate(numbers)]
def duplicates_numpy(a):
 uniques = np.unique(a, True)[1]
 result = a == a
 result[uniques] = False
 return result
def benchmark(num_range: (int, int), random_nums: int, executions: int):
 np.random.seed(100)
 a = np.random.randint(*num_range, random_nums)
 print(f"{executions} executions with {random_nums} random numbers in range {num_range}")
 solutions = [
 duplicates_original,
 duplicates_set,
 duplicates_list,
 duplicates_oneliner,
 duplicates_numpy,
 ]
 # Correctness
 expect = list(solutions[0](a))
 for solution in solutions:
 assert list(solution(a)) == expect
 # Speed
 for solution in solutions:
 ts = ['%6.3f' % timeit(stmt=lambda: solution(a), number=executions)
 for _ in range(3)]
 print(solution.__name__.ljust(20), *ts, 'seconds')
 print()
benchmark((0, 5), 10, 100000)
benchmark((0, 5), 1000, 1000)
benchmark((0, 10), 1000, 1000)
benchmark((0, 1000), 1000, 1000)
benchmark((0, 10), 100000, 10)

Question 16

This is a nice solution as well! Knowing your library functions is important :) +1

Question 17

Although I would suggest renaming uniques to something that says you have indices, and not values, so uniques_idx or smth like that. Also, might I suggest then _, idx = np.unique(a, True)? I'd argue you shouldn't unpack with indices when you can avoid it.

Question 18

@RGS And here I was, thinking a == a would be what people would complain about :-D

Question 19

I've been doing a lot of array oriented programming in APL lately, so a == a doesn't shock me that much :p

Question 20

Really interesting benchmarks, numpy shines once again!

RGS RGS 1,0785 silver badges18 bronze badges · Accepted Answer · 2021-04-13 16:45:39Z

Your code is good code, I'd say. There's nothing wrong with it. The only thing is that you can write more or less the same algorithm in a much shorter way, and that essentially means you would refactor all of your code down to a single line.

So, before taking a look at the boiled down version of what you wrote, let me just point out that variable names should be descriptive and if you are using abbreviations or one-letter names, you should stick to the most used conventions. Why am I saying this? Well, i is typically the name for an unimportant index variable, but in your loop you have for i in a, and i is an element of the array... So, believe it or not, people looking at the rest of your code might be thrown off by the fact that i is not an index.

Intermediate step

So before jumping to the one-liner that is actually readable (i.e. it is good Python code) let's take a look at a more tractable intermediate step. Instead of computing all the unique values and then deleting from there (deleting from the middle of a list is typically "slow), you could go the other way around and have a list that saves all the unique values you have found:

def duplicates(arr):
 result, uniques = [], []
 for elem in arr:
 if elem in uniques:
 result.append(True)
 else:
 result.append(False)
 uniques.append(elem)
 
 return result

Next intermediate step

Now take another look at what I wrote. Notice that both in the if and the else I am appending to the result variable. The only thing that is really "special" is the appending to the uniques, which only happens on the else, so we could write the .append unconditionally and append to uniques only if needed:

def duplicates(arr):
 result, uniques = [], []
 for elem in arr:
 result.append(elem in uniques)
 if elem in not uniques:
 uniques.append(elem)
 
 return result

Final solution (after rethinking algorithm)

Let me know if you can understand my suggested solution and I'll expand on it if you need :)

def duplicates(arr):
 return [elem in arr[:i] for i, elem in enumerate(arr)]

Notice that this list comprehension reads "return a list that says if the i-th element can be found in the first i-1 elements".

The key here is in phrasing the result in the correct way. If you reason Python code in the correct way, English and the code become one and the same.

When you have a relatively simple for loop that is building a list with successive .append, that's a sign to try and see if you can use a list comprehension.

I am also making use of the enumerate built-in (learn about it here) to traverse the array while looking at the values and indices at the same time.

EDIT:

As other answers have noted (and also, see comments on this answer), this one-liner runs in time complexity O(n^2), which is not the best you can do. However, that single line of Python shows you how far you can go with just a simple idea/naive algorithm, and optimised usage of the core features of Python :)

In that case it should be False, True, True. The False value is given only to the first appearance of the value in the array. Thank you for your explanation and advices, I really appreciate being in this community :)
@LevonAvetisyan yup, I looked more calmly at your code and understood that, of course!
@Manuel better performance in checking if the element is already there? (Won't the performance to add a new element be worse?) Or why would you suggest that?
Yes, making the whole thing O(n) instead of O(n^2). Plus you can remove your extra check then.

Stack Exchange Network

Marking duplicate entries in a numpy array as True

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Marking duplicate entries in a numpy array as True

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions