The question is - Write a program to remove the duplicates in a list
Here's how I did it-
numbers = [1, 1, 1, 3, 3, 7, 7]
for i in numbers:
while numbers.count(i) > 1:
numbers.pop(numbers.index(i))
print(numbers)
Here's how the teacher in the youtube video did it-
numbers = [2, 2, 4, 4, 6, 6]
uniques = []
for number in numbers:
if number not in uniques:
uniques.append(number)
print(uniques)
I know both the programs are objectively different in terms of how they do the task and the second program will I reckon consume more memory to create the new list however given the question which is the better approach and for what reasons?
3 Answers 3
The teacher is obviously using a better way. Check/read more about time complexity for lists in python here (python wiki).
In your case, you are going for a time complexity:
- \$ O(n) \$ for iteration
- \$ O(n) \$ for
.count()
- \$ O(n) \$ for intermediate pop
- \$ O(n) \$ for
.index
for a final (worst case) time: \$ O(n^2) \$ (see explanation in comments from superb rain).
In case of the tutorial:
- \$ O(n) \$ for iteration
- \$ O(k) \$ for
not in
check (using \$ k \$ since the list is different now. - \$ O(1) \$ for append
generating a worst case performance of \$ O(n \cdot k) \$.
A more efficient solution, disregarding the order of elements would be the call:
uniques = list(set(numbers))
as suggested in comments. This would have \$ O(n) \$ time.
-
3\$\begingroup\$ Theirs is O(n^2), not just O(n^3). \$\endgroup\$superb rain– superb rain2020年11月03日 13:44:31 +00:00Commented Nov 3, 2020 at 13:44
-
1\$\begingroup\$ Improvement on
uniques = list(set(numbers))
would beuniques = list(dict.fromkeys(numbers))
; on modern Python (CPython/PyPy 3.6+, all Python 3.7+)dict
s are insertion ordered, so the unique elements will remain in order of first appearance in the input, whereset
would apply a quasi-randomized order. \$\endgroup\$ShadowRanger– ShadowRanger2020年11月03日 15:09:13 +00:00Commented Nov 3, 2020 at 15:09 -
1\$\begingroup\$ @hjpotter92 I have no idea why you think they take n^3. They don't. Only n^2. In that "worst case", you have n count, n-1 index and n-1 pop, each of which takes time O(n). So ~3n*O(n) = O(n^2). \$\endgroup\$superb rain– superb rain2020年11月03日 21:26:02 +00:00Commented Nov 3, 2020 at 21:26
-
1\$\begingroup\$ @hjpotter92 Actually that's a best case. Someone sadly deleted the explanation I posted earlier, but here's a benchmark now. \$\endgroup\$superb rain– superb rain2020年11月04日 01:44:20 +00:00Commented Nov 4, 2020 at 1:44
-
1\$\begingroup\$ @hjpotter92 The example you gave contains only duplicates of a single number. Nice way to choose an example so the numbers tell what you want them to tell! \$\endgroup\$Stef– Stef2020年11月04日 13:52:59 +00:00Commented Nov 4, 2020 at 13:52
Your teacher's approach is better than yours, the main reason is the one noted in the answer by @hjpotter92.
However, their approach can be optimized even more. Note that there is the \$O(k)\$ check from not in
. This is because when you check if an element is in
a list, it goes through the whole list and tries to find it.
A set
on the other hand stores elements via their hashes, giving you \$O(1)\$ in
performance.
If you don't care about the order of elements, you can just pass the input directly to set
:
numbers = [2, 2, 4, 4, 6, 6]
uniques = list(set(numbers))
If you do, however, care about maintaining the order of elements (and always using the position of where the element first appeared), you have to fill both a list
and a set
:
uniques = []
seen = set()
for number in numbers:
if number not in seen:
uniques.append(number)
seen.add(number)
print(uniques)
There are two caveats here, though:
- This takes additional memory, specifically \$O(k)\$, where \$k\$ is the number of unique elements.
- This only works if all elements of the input list are "hashable". This basically boils down to them being not mutable, i.e. you can't use it with a list of lists or dicts. Numbers and strings are perfectly fine, though.
Another thing you will probably learn about, if you haven't already, are functions. They allow you to encapsulate some functionality in one place, give it a name and reuse it:
def unique(x):
uniques = []
seen = set()
for number in numbers:
if number not in seen:
uniques.append(number)
seen.add(number)
return uniques
numbers = [2, 2, 4, 4, 6, 6]
print(unique (numbers))
Actually, there are two more reasons why your teacher's solution is better:
- You mutate the input list when you do
pop
. This means that you would have to make a copy if you need the original list afterwards. - Your teacher's code works as long as
numbers
is iterable, i.e. you can dofor x in numbers
. Your code relies on less universal methods likepop
,count
andindex
which are not implemented by all data structures. Being able to be iterated over is very common, though.
-
\$\begingroup\$ Thanks for the explanation \$\endgroup\$FoundABetterName– FoundABetterName2020年11月03日 07:38:55 +00:00Commented Nov 3, 2020 at 7:38
-
\$\begingroup\$ It appears the two variables
seen
anduniques
play the exact same role. You could simplify the code further:uniques = list(set(numbers))
. \$\endgroup\$Stef– Stef2020年11月04日 13:05:40 +00:00Commented Nov 4, 2020 at 13:05 -
\$\begingroup\$ @Stef They do not.
uniques
is the output, which preserves the input order because it is alist
.seen
provides fastin
comparisons, but no order information, since it is aset
. Your approach is perfectly fine if the order does not matter (I forgot to add that as a note in my answer, though!). \$\endgroup\$Graipher– Graipher2020年11月04日 13:09:51 +00:00Commented Nov 4, 2020 at 13:09 -
\$\begingroup\$ Oops, you're right. \$\endgroup\$Stef– Stef2020年11月04日 13:10:43 +00:00Commented Nov 4, 2020 at 13:10
numbers.pop(numbers.index(i))
is equivalent to numbers.remove(i)
.
Given that your example lists are sorted and that your solution would fail for example [1, 3, 1, 2, 3, 2]
(which it turns into [3, 1, 3, 2]
, not removing the duplicate 3
), I'm going to assume that the input being sorted is part of the task specification. In which case that should be taken advantage of. Neither of the two solutions does.
The teacher's can take advantage of it simply by checking if number not in uniques[-1:]
instead of if number not in uniques:
. Then it's O(n) instead of O(n^2) time.
An alternative with itertools.groupby
:
unique = [k for k, _ in groupby(numbers)]
Or in-place with just a counter of the unique values so far:
u = 0
for x in numbers:
if u == 0 or x > numbers[u - 1]:
numbers[u] = x
u += 1
del numbers[u:]
Both take O(n) time. The latter is O(1) space if the final deletion doesn't reallocate or reallocates in place.
-
1\$\begingroup\$
not in uniques [-1:]
is equivalent to!= uniques [-1]
? \$\endgroup\$hjpotter92– hjpotter922020年11月03日 16:46:28 +00:00Commented Nov 3, 2020 at 16:46 -
3\$\begingroup\$ @hjpotter92 No, the latter would crash with an IndexError, as
uniques
starts empty. \$\endgroup\$superb rain– superb rain2020年11月03日 22:48:13 +00:00Commented Nov 3, 2020 at 22:48
numbers = list(set(numbers))
\$\endgroup\$set(numbers)
will do the job \$\endgroup\$