homepage

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: set(range(100000)).difference(set()) is slow
Type: performance Stage: resolved
Components: Interpreter Core Versions: Python 3.2
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: pitrou Nosy List: belopolsky, flox, pitrou, pjenvey, rhettinger, spiv, stutzbach
Priority: low Keywords: patch

Created on 2010年05月11日 09:04 by spiv, last changed 2022年04月11日 14:57 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
set-difference-speedup.diff spiv, 2010年05月11日 09:04 Small patch to improve the performance of large_set.difference(small_set)
set-difference-speedup-2.diff spiv, 2010年05月11日 10:13 Corrected patch.
set-mem.py spiv, 2010年05月12日 13:15 Quick script to show memory consumption of set.difference on linux
Messages (19)
msg105489 - (view) Author: Andrew Bennetts (spiv) Date: 2010年05月11日 09:04
set.difference(s), when s is also a set, basically does::
 res = set()
 for elem in self:
 if elem not in other:
 res.add(elem)
This is wasteful when len(self) is much greater than len(other):
$ python -m timeit -s "s = set(range(100000)); sd = s.difference; empty = set()" "sd(empty)"
100 loops, best of 3: 12.8 msec per loop
$ python -m timeit -s "s = set(range(10)); sd = s.difference; empty = set()" "sd(empty)"
1000000 loops, best of 3: 1.18 usec per loop
Here's a patch that compares the lengths of self and other before that loop, and if len(self) is greater, swaps them. The new timeit results are:
$ python -m timeit -s "s = set(range(100000)); sd = s.difference; empty = set()" "sd(empty)"
1000000 loops, best of 3: 0.289 usec per loop
$ python -m timeit -s "s = set(range(10)); sd = s.difference; empty = set()" "sd(empty)"
1000000 loops, best of 3: 0.294 usec per loop
msg105490 - (view) Author: Andrew Bennetts (spiv) Date: 2010年05月11日 09:24
Oops, obvious bug in this patch. set('abc') - set('bcd') != set('bcd') - set('abc'). I'll see if I can make a more sensible improvement.
See also <http://bugs.python.org/issue8425>. Thanks dickinsm on #python-dev.
msg105494 - (view) Author: Andrew Bennetts (spiv) Date: 2010年05月11日 10:13
Ok, this time test_set* passes :)
Currently if you have large set and small set the code will do len(large) lookups in the small set. When large is >> than small, it is cheaper to copy large and do len(small) lookups in large. On my laptop a size difference of 4 times is a clear winner for copy+difference_update over the status quo, even for sets of millions of entries. For more similarly sized sets (even only factor of 2 size difference) the cost of allocating a large set that is likely to be shrunk significantly is greater than the benefit. So my patch only switches behaviour for len(x)/4 > len(y).
This patch is complementary to the patch in issue8425, I think.
msg105524 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010年05月11日 16:11
I have two problems with this proposal:
1. In constrained memory environments, creating a temporary internal copy of a large set may cause the difference operation to fail that would otherwise succeed.
2. The break-even point between extra lookups and a copy is likely to be different on different systems or even on the same system under different loads.
Programs that suffer from poor large_set.difference(small_set) performance can be rewritten as large_set_copy = large_set.copy(); large_set_copy.difference_updste(small_set) or even simply as large_set.difference_updste(small_set) if program logic allows it.
msg105585 - (view) Author: Andrew Bennetts (spiv) Date: 2010年05月12日 13:14
Regarding memory, good question... but this patch turns out to be an improvement there too.
This optimisation only applies when len(x) > len(y) * 4. So the minimum size of the result is a set with 3/4 of the elems of x (and possibly would be a full copy of x anyway).
So if you like this optimisation is simply taking advantage of the fact we're going to be copying almost all of these elements anyway. We could make it less aggressive, but large sets are tuned to be between 1/2 and 1/3 empty internally anyway, so 1/4 overhead seems reasonable.
Also, because this code immediately makes the result set be about the right size, rather than growing it one element at a time, the memory consumption is actually *better*. I'll attach a script that demonstrates this; for me it shows that large_set.difference(small_set) [where large_set has 4M elems, small_set has 100] peaks at 50MB memory consumption without my patch, but only 18MB with. (after discounting the memory required for large_set itself, etc.)
msg105586 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010年05月12日 13:19
> 1. In constrained memory environments, creating a temporary internal
> copy of a large set may cause the difference operation to fail that
> would otherwise succeed.
It's a space/time tradeoff. There's nothing wrong about that.
(do note that hash tables themselves take much more space than the "equivalent" list)
> 2. The break-even point between extra lookups and a copy is likely to
> be different on different systems or even on the same system under
> different loads.
So what? It's just a matter of choosing reasonable settings. There are other optimization heuristics in the interpreter.
The optimization here looks ok to me.
msg105823 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010年05月15日 20:53
The current patch gives much smaller benefits than the originally posted benchmarks, although they are still substantial:
$ ./python -m timeit -s "a = set(range(100000)); sd = a.difference; b = set(range(1000))" "sd(b)"
- before: 5.56 msec per loop
- after: 3.18 msec per loop
$ ./python -m timeit -s "a = set(range(1000000)); sd = a.difference; b = set(range(10))" "sd(b)"
- before: 67.9 msec per loop
- after: 41.8 msec per loop
msg105885 - (view) Author: Andrew Bennetts (spiv) Date: 2010年05月16日 22:16
Antoine: Thanks for the updated benchmark results! I should have done that myself earlier.
msg105909 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2010年05月17日 15:09
Will look at this when I get back to the U.S.
msg113405 - (view) Author: Andrew Bennetts (spiv) Date: 2010年08月09日 11:24
On 2010年05月17日 rhettinger wrote:
> Will look at this when I get back to the U.S.
Ping! This patch (set-difference-speedup-2.diff) has been sitting around for a fair few weeks now. It's a small patch, so it should be relatively easy to review. It makes a significant improvement to speed and memory in one case (which we have encountered and worked around in bzr), and has no significant downside to any other cases.
Thanks :)
msg113418 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010年08月09日 14:48
Andrew,
This issue is somewhat similar to issue8425. I may be reading too much into the "priority" field, but it looks like Raymond would like to review #8425 first. You can help by commenting on how the two issues relate to each other. I believe the two are complementary, but I did not attempt to apply both patches.
(The patch still applies with little fuzz.)
msg113430 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2010年08月09日 17:34
I'll be looking at it shortly. Py3.2 is still aways from release so there is no hurry.
msg113498 - (view) Author: Andrew Bennetts (spiv) Date: 2010年08月09日 23:40
Alexander: yes, they are complementary. My patch improves set.difference, which always creates a new set. issue8425 on the other hand improves in-place difference (via the -= operator or set.difference_update). Looking at the two patches, my patch will not improve in-place difference, and issue8425's patch will not improve set.difference. So they are complementary.
msg115295 - (view) Author: Jean-Paul Calderone (exarkun) * (Python committer) Date: 2010年09月01日 13:07
> I'll be looking at it shortly. Py3.2 is still aways from release so there is no hurry.
I would consider reviewing and possibly apply this change, but I don't want to invade anyone's territory.
msg115315 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010年09月01日 15:19
> I would consider reviewing and possibly apply this change, but I don't 
> want to invade anyone's territory.
I don't think there would be any invasion. I think the patch is simple enough, and seems to provide a nice benefit.
msg115319 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2010年09月01日 17:24
Please leave this for me.
Thank you.
msg122936 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010年11月30日 21:46
Raymond, unless you object, I'd like to commit this before beta1.
msg122937 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2010年11月30日 21:56
Thx.
msg122942 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010年11月30日 22:24
Modified patch committed in r86905. Thanks!
History
Date User Action Args
2022年04月11日 14:57:00adminsetgithub: 52931
2010年11月30日 22:24:53pitrousetstatus: open -> closed
resolution: accepted -> fixed
messages: + msg122942

stage: patch review -> resolved
2010年11月30日 21:56:22rhettingersetassignee: rhettinger -> pitrou
resolution: accepted
messages: + msg122937
2010年11月30日 21:54:58pjenveysetnosy: + pjenvey
2010年11月30日 21:46:59pitrousetmessages: + msg122936
2010年09月01日 21:29:40stutzbachsetnosy: + stutzbach
2010年09月01日 19:37:49exarkunsetnosy: - exarkun
2010年09月01日 17:24:06rhettingersetmessages: + msg115319
2010年09月01日 15:19:14pitrousetmessages: + msg115315
2010年09月01日 13:07:52exarkunsetnosy: + exarkun
messages: + msg115295
2010年08月09日 23:40:12spivsetmessages: + msg113498
2010年08月09日 17:34:43rhettingersetmessages: + msg113430
2010年08月09日 14:48:35belopolskysetmessages: + msg113418
stage: patch review
2010年08月09日 14:21:34floxsetnosy: + flox
2010年08月09日 11:24:40spivsetmessages: + msg113405
2010年05月17日 15:09:21rhettingersetpriority: normal -> low

messages: + msg105909
2010年05月16日 22:16:12spivsetmessages: + msg105885
2010年05月15日 20:53:36pitrousetmessages: + msg105823
versions: - Python 2.7
2010年05月12日 13:19:14pitrousetnosy: + pitrou
messages: + msg105586
2010年05月12日 13:15:39spivsetfiles: + set-mem.py
2010年05月12日 13:14:56spivsetmessages: + msg105585
2010年05月11日 16:11:01belopolskysetmessages: + msg105524
2010年05月11日 15:52:26belopolskysetnosy: + belopolsky
2010年05月11日 10:13:59spivsetfiles: + set-difference-speedup-2.diff

messages: + msg105494
2010年05月11日 09:27:06mark.dickinsonsetassignee: rhettinger

nosy: + rhettinger
versions: + Python 3.2
2010年05月11日 09:24:06spivsetmessages: + msg105490
2010年05月11日 09:04:17spivcreate

AltStyle によって変換されたページ (->オリジナル) /