Two ways to remove duplicates from a list

Question 1

The question is - Write a program to remove the duplicates in a list

Here's how I did it-

numbers = [1, 1, 1, 3, 3, 7, 7]
for i in numbers:
 while numbers.count(i) > 1:
 numbers.pop(numbers.index(i))
print(numbers)

Here's how the teacher in the youtube video did it-

numbers = [2, 2, 4, 4, 6, 6]
uniques = []
for number in numbers:
 if number not in uniques:
 uniques.append(number)
print(uniques)

I know both the programs are objectively different in terms of how they do the task and the second program will I reckon consume more memory to create the new list however given the question which is the better approach and for what reasons?

Question 2

Too short for an answer, so: numbers = list(set(numbers))

Question 3

If you don't care about the order then just using set(numbers) will do the job

Question 4

@FoundABetterNamd OMG! Was it Mosh Hamedani’s tutorial? I watched the same one!

Question 5

yup :) @fartgeek

Question 6

The teacher is obviously using a better way. Check/read more about time complexity for lists in python here (python wiki).

In your case, you are going for a time complexity:

\$ O(n) \$ for iteration
\$ O(n) \$ for .count()
\$ O(n) \$ for intermediate pop
\$ O(n) \$ for .index

for a final (worst case) time: \$ O(n^2) \$ (see explanation in comments from superb rain).

In case of the tutorial:

\$ O(n) \$ for iteration
\$ O(k) \$ for not in check (using \$ k \$ since the list is different now.
\$ O(1) \$ for append

generating a worst case performance of \$ O(n \cdot k) \$.

A more efficient solution, disregarding the order of elements would be the call:

uniques = list(set(numbers))

as suggested in comments. This would have \$ O(n) \$ time.

Question 7

Theirs is O(n^2), not just O(n^3).

Question 8

Improvement on uniques = list(set(numbers)) would be uniques = list(dict.fromkeys(numbers)); on modern Python (CPython/PyPy 3.6+, all Python 3.7+) dicts are insertion ordered, so the unique elements will remain in order of first appearance in the input, where set would apply a quasi-randomized order.

Question 9

@hjpotter92 I have no idea why you think they take n^3. They don't. Only n^2. In that "worst case", you have n count, n-1 index and n-1 pop, each of which takes time O(n). So ~3n*O(n) = O(n^2).

Question 10

@hjpotter92 Actually that's a best case. Someone sadly deleted the explanation I posted earlier, but here's a benchmark now.

Question 11

@hjpotter92 The example you gave contains only duplicates of a single number. Nice way to choose an example so the numbers tell what you want them to tell!

Question 12

Your teacher's approach is better than yours, the main reason is the one noted in the answer by @hjpotter92.

However, their approach can be optimized even more. Note that there is the \$O(k)\$ check from not in. This is because when you check if an element is in a list, it goes through the whole list and tries to find it.

A set on the other hand stores elements via their hashes, giving you \$O(1)\$ in performance.

If you don't care about the order of elements, you can just pass the input directly to set:

numbers = [2, 2, 4, 4, 6, 6]
uniques = list(set(numbers))

If you do, however, care about maintaining the order of elements (and always using the position of where the element first appeared), you have to fill both a list and a set:

uniques = []
seen = set()
for number in numbers:
 if number not in seen:
 uniques.append(number)
 seen.add(number)
print(uniques)

There are two caveats here, though:

This takes additional memory, specifically \$O(k)\$, where \$k\$ is the number of unique elements.
This only works if all elements of the input list are "hashable". This basically boils down to them being not mutable, i.e. you can't use it with a list of lists or dicts. Numbers and strings are perfectly fine, though.

Another thing you will probably learn about, if you haven't already, are functions. They allow you to encapsulate some functionality in one place, give it a name and reuse it:

def unique(x):
 uniques = []
 seen = set()
 for number in numbers:
 if number not in seen:
 uniques.append(number)
 seen.add(number)
 return uniques
numbers = [2, 2, 4, 4, 6, 6]
print(unique (numbers))

Actually, there are two more reasons why your teacher's solution is better:

You mutate the input list when you do pop. This means that you would have to make a copy if you need the original list afterwards.
Your teacher's code works as long as numbers is iterable, i.e. you can do for x in numbers. Your code relies on less universal methods like pop, count and index which are not implemented by all data structures. Being able to be iterated over is very common, though.

Question 13

Thanks for the explanation

Question 14

It appears the two variables seen and uniques play the exact same role. You could simplify the code further: uniques = list(set(numbers)).

Question 15

@Stef They do not. uniques is the output, which preserves the input order because it is a list. seen provides fast in comparisons, but no order information, since it is a set. Your approach is perfectly fine if the order does not matter (I forgot to add that as a note in my answer, though!).

Question 16

Oops, you're right.

Question 17

numbers.pop(numbers.index(i)) is equivalent to numbers.remove(i).

Given that your example lists are sorted and that your solution would fail for example [1, 3, 1, 2, 3, 2] (which it turns into [3, 1, 3, 2], not removing the duplicate 3), I'm going to assume that the input being sorted is part of the task specification. In which case that should be taken advantage of. Neither of the two solutions does.

The teacher's can take advantage of it simply by checking if number not in uniques[-1:] instead of if number not in uniques:. Then it's O(n) instead of O(n^2) time.

An alternative with itertools.groupby:

unique = [k for k, _ in groupby(numbers)]

Or in-place with just a counter of the unique values so far:

u = 0
for x in numbers:
 if u == 0 or x > numbers[u - 1]:
 numbers[u] = x
 u += 1
del numbers[u:]

Both take O(n) time. The latter is O(1) space if the final deletion doesn't reallocate or reallocates in place.

Question 18

not in uniques [-1:] is equivalent to != uniques [-1]?

Question 19

@hjpotter92 No, the latter would crash with an IndexError, as uniques starts empty.

hjpotter92 hjpotter92 8,9011 gold badge26 silver badges49 bronze badges · Accepted Answer · 2020-11-03 06:47:06Z

11

\$\begingroup\$

The teacher is obviously using a better way. Check/read more about time complexity for lists in python here (python wiki).

In your case, you are going for a time complexity:

\$ O(n) \$ for iteration
\$ O(n) \$ for .count()
\$ O(n) \$ for intermediate pop
\$ O(n) \$ for .index

for a final (worst case) time: \$ O(n^2) \$ (see explanation in comments from superb rain).

In case of the tutorial:

\$ O(n) \$ for iteration
\$ O(k) \$ for not in check (using \$ k \$ since the list is different now.
\$ O(1) \$ for append

generating a worst case performance of \$ O(n \cdot k) \$.

A more efficient solution, disregarding the order of elements would be the call:

uniques = list(set(numbers))

as suggested in comments. This would have \$ O(n) \$ time.

Share

edited Nov 4, 2020 at 9:08

answered Nov 3, 2020 at 6:47

hjpotter92's user avatar

hjpotter92 hjpotter92

8,9011 gold badge26 silver badges49 bronze badges

\$\endgroup\$

15

3

\$\begingroup\$ Theirs is O(n^2), not just O(n^3). \$\endgroup\$

superb rain
– superb rain

2020年11月03日 13:44:31 +00:00
Commented Nov 3, 2020 at 13:44
1

\$\begingroup\$ Improvement on uniques = list(set(numbers)) would be uniques = list(dict.fromkeys(numbers)); on modern Python (CPython/PyPy 3.6+, all Python 3.7+) dicts are insertion ordered, so the unique elements will remain in order of first appearance in the input, where set would apply a quasi-randomized order. \$\endgroup\$

ShadowRanger
– ShadowRanger

2020年11月03日 15:09:13 +00:00
Commented Nov 3, 2020 at 15:09
1

\$\begingroup\$ @hjpotter92 I have no idea why you think they take n^3. They don't. Only n^2. In that "worst case", you have n count, n-1 index and n-1 pop, each of which takes time O(n). So ~3n*O(n) = O(n^2). \$\endgroup\$

superb rain
– superb rain

2020年11月03日 21:26:02 +00:00
Commented Nov 3, 2020 at 21:26
1

\$\begingroup\$ @hjpotter92 Actually that's a best case. Someone sadly deleted the explanation I posted earlier, but here's a benchmark now. \$\endgroup\$

superb rain
– superb rain

2020年11月04日 01:44:20 +00:00
Commented Nov 4, 2020 at 1:44
1

\$\begingroup\$ @hjpotter92 The example you gave contains only duplicates of a single number. Nice way to choose an example so the numbers tell what you want them to tell! \$\endgroup\$

Stef
– Stef

2020年11月04日 13:52:59 +00:00
Commented Nov 4, 2020 at 13:52

| Show 10 more comments

Stack Exchange Network

Two ways to remove duplicates from a list

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Two ways to remove duplicates from a list

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions