Issue 2013: Long object free list optimization

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/46297

classification

Title:	Long object free list optimization
Type:	performance	Stage:
Components:	Interpreter Core	Versions:	Python 3.0

process

Dependencies:	Superseder:
Status:	closed	Resolution:	rejected
Assigned To:	gregory.p.smith	Nosy List:	christian.heimes, gregory.p.smith, pitrou
Priority:	normal	Keywords:	patch

Created on 2008年02月05日 03:07 by christian.heimes, last changed 2022年04月11日 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
long_freelist.patch	christian.heimes, 2008年02月05日 03:07
py3k_longfreelist.patch	christian.heimes, 2008年02月06日 17:09
py3k_longfreelist2.patch	pitrou, 2008年02月12日 22:19
py3k_longfreelist2-gps01.patch	gregory.p.smith, 2008年03月22日 19:54	improvement over longfreelist2
py3k_longfreelist2-gps02.patch	gregory.p.smith, 2008年03月23日 01:24
issue2013-benchmarks-gps02.txt	gregory.p.smith, 2008年03月23日 02:04	benchmarks of the -gps02 patch
py3k_longfreelist2_gps03.patch	gregory.p.smith, 2008年08月18日 04:56	updated, now works in debug mode.
py3k_longfreelist2_gps04.patch	gregory.p.smith, 2008年08月18日 04:56	resimplified for only 1 digit longs

Messages (13)
msg62060 - (view)	Author: Christian Heimes (christian.heimes) * (Python committer)	Date: 2008年02月05日 03:07
The patch adds a free list of long objects with size 1 or -1. It has quite a tremendous effect on code like list(range(1000)). $ ./python -m timeit "for i in range(100): list(range(1000))" Without patch: 10 loops, best of 3: 79 msec per loop With patch: 10 loops, best of 3: 20.8 msec per loop
msg62106 - (view)	Author: Christian Heimes (christian.heimes) * (Python committer)	Date: 2008年02月06日 17:09
The updated patch limits the free list.
msg62338 - (view)	Author: Antoine Pitrou (pitrou) * (Python committer)	Date: 2008年02月12日 22:19
I don't get the same impressive speedup as you do, although it's still significant. $ ./python -m timeit "for i in range(100): list(range(1000))" Without patch: 100 loops, best of 3: 5.05 msec per loop With patch: 100 loops, best of 3: 3.57 msec per loop Also, your patch is leaky. I'm attaching a fixed version (for py3k, didn't check the trunk version).
msg62339 - (view)	Author: Antoine Pitrou (pitrou) * (Python committer)	Date: 2008年02月12日 22:32
Christian, maybe you did your measurements with the DEBUG flag turned on? That would explain the discrepancy. Also, I'm not sure the patch is useful for 2.x since long objects with size -1 or 1 should be represented as ints instead :-)
msg62342 - (view)	Author: Christian Heimes (christian.heimes) * (Python committer)	Date: 2008年02月12日 22:48
Antoine Pitrou wrote: > Christian, maybe you did your measurements with the DEBUG flag turned > on? That would explain the discrepancy. > > Also, I'm not sure the patch is useful for 2.x since long objects with > size -1 or 1 should be represented as ints instead :-) Yes, I've used a Py_Debug build to measure the speed difference. You are right. The patch makes no sense for the 2.x series. Christian
msg64331 - (view)	Author: Antoine Pitrou (pitrou) * (Python committer)	Date: 2008年03月22日 16:45
Here are some updated timings against the current py3k branch: $ ./python -m timeit "sum(range(10000))" Without patch: 1000 loops, best of 3: 675 usec per loop With patch: 1000 loops, best of 3: 462 usec per loop $ ./python -m timeit -s "n=1000000" "sum(range(n, n+10000))" Without patch: 1000 loops, best of 3: 1.36 msec per loop With patch: 1000 loops, best of 3: 1.38 msec per loop $ ./python -m timeit "min(range(255))" Without patch: 10000 loops, best of 3: 18.7 usec per loop With patch: 10000 loops, best of 3: 19.4 usec per loop As you can see the patch makes things quite a bit better for 1-digit long objects, and there is only a small slowdown for longer or tiny ints. Given that 1-digit long objects should be prevalent in most code I think it's probably a winner overall.
msg64337 - (view)	Author: Gregory P. Smith (gregory.p.smith) * (Python committer)	Date: 2008年03月22日 19:54
I'm attaching a slightly improved version of the longfreelist2 patch. It moves the important test to be first and removes the unneeded ? : conditional. ........................ Here are some more benchmarks: ........................ --- issue2013-without-patch.txt 2008年03月22日 12:18:37.000000000 -0700 +++ issue2013-longfreelist2-gps01-size-first.txt 2008年03月22日 12:34:44.000000000 -0700 @@ -1,43 +1,43 @@ -3.0a3+ (py3k:61746M, Mar 22 2008, 12:12:29) +3.0a3+ (py3k:61746M, Mar 22 2008, 12:33:47) [GCC 4.0.1 (Apple Computer, Inc. build 5367)] ./python.exe -m timeit min(range(255)) -100000 loops, best of 3: 17.2 usec per loop +100000 loops, best of 3: 15.8 usec per loop ./python.exe -m timeit -s n=1000000 sum(range(n,n+10000)) -1000 loops, best of 3: 1.21 msec per loop +1000 loops, best of 3: 1.22 msec per loop ./python.exe -m timeit sum(range(-32768,32768)) -100 loops, best of 3: 3.98 msec per loop +100 loops, best of 3: 2.63 msec per loop ./python.exe -m timeit sum(range(-1000000,1000000)) -10 loops, best of 3: 296 msec per loop +10 loops, best of 3: 273 msec per loop ./python.exe -m timeit sum(range(4000)) -1000 loops, best of 3: 239 usec per loop +10000 loops, best of 3: 151 usec per loop ./python.exe -m timeit sum(list(range(4000))) -1000 loops, best of 3: 360 usec per loop +1000 loops, best of 3: 274 usec per loop ./python.exe -m timeit sum(range(10000)) -1000 loops, best of 3: 615 usec per loop +1000 loops, best of 3: 382 usec per loop ./python.exe -m timeit sum(list(range(10000))) -1000 loops, best of 3: 887 usec per loop +1000 loops, best of 3: 802 usec per loop ./python.exe -m timeit sum(range(40000)) -100 loops, best of 3: 2.55 msec per loop +100 loops, best of 3: 1.79 msec per loop ./python.exe -m timeit sum(range(80000)) -100 loops, best of 3: 6.4 msec per loop +100 loops, best of 3: 5.77 msec per loop ........................ And a benchmarks of the longfreelist2 patch before my improvement: ........................ --- issue2013-without-patch.txt 2008年03月22日 12:18:37.000000000 -0700 +++ issue2013-longfreelist2.txt 2008年03月22日 12:19:46.000000000 -0700 @@ -1,43 +1,43 @@ -3.0a3+ (py3k:61746M, Mar 22 2008, 12:12:29) +3.0a3+ (py3k:61746M, Mar 22 2008, 12:18:57) [GCC 4.0.1 (Apple Computer, Inc. build 5367)] ./python.exe -m timeit min(range(255)) -100000 loops, best of 3: 17.2 usec per loop +100000 loops, best of 3: 16.1 usec per loop ./python.exe -m timeit -s n=1000000 sum(range(n,n+10000)) -1000 loops, best of 3: 1.21 msec per loop +1000 loops, best of 3: 1.25 msec per loop ./python.exe -m timeit sum(range(-32768,32768)) -100 loops, best of 3: 3.98 msec per loop +100 loops, best of 3: 2.95 msec per loop ./python.exe -m timeit sum(range(-1000000,1000000)) -10 loops, best of 3: 296 msec per loop +10 loops, best of 3: 283 msec per loop ./python.exe -m timeit sum(range(4000)) -1000 loops, best of 3: 239 usec per loop +1000 loops, best of 3: 177 usec per loop ./python.exe -m timeit sum(list(range(4000))) -1000 loops, best of 3: 360 usec per loop +1000 loops, best of 3: 276 usec per loop ./python.exe -m timeit sum(range(10000)) -1000 loops, best of 3: 615 usec per loop +1000 loops, best of 3: 453 usec per loop ./python.exe -m timeit sum(list(range(10000))) -1000 loops, best of 3: 887 usec per loop +1000 loops, best of 3: 792 usec per loop ./python.exe -m timeit sum(range(40000)) -100 loops, best of 3: 2.55 msec per loop +100 loops, best of 3: 2.03 msec per loop ./python.exe -m timeit sum(range(80000)) -100 loops, best of 3: 6.4 msec per loop +100 loops, best of 3: 6.01 msec per loop ........................ And diffs of longfreelist2 vs longfreelist2-gps01: ........................ --- issue2013-longfreelist2.txt 2008年03月22日 12:19:46.000000000 -0700 +++ issue2013-longfreelist2-gps01-size-first.txt 2008年03月22日 12:34:44.000000000 -0700 @@ -1,43 +1,43 @@ -3.0a3+ (py3k:61746M, Mar 22 2008, 12:18:57) +3.0a3+ (py3k:61746M, Mar 22 2008, 12:33:47) [GCC 4.0.1 (Apple Computer, Inc. build 5367)] ./python.exe -m timeit min(range(255)) -100000 loops, best of 3: 16.1 usec per loop +100000 loops, best of 3: 15.8 usec per loop ./python.exe -m timeit -s n=1000000 sum(range(n,n+10000)) -1000 loops, best of 3: 1.25 msec per loop +1000 loops, best of 3: 1.22 msec per loop ./python.exe -m timeit sum(range(-32768,32768)) -100 loops, best of 3: 2.95 msec per loop +100 loops, best of 3: 2.63 msec per loop ./python.exe -m timeit sum(range(-1000000,1000000)) -10 loops, best of 3: 283 msec per loop +10 loops, best of 3: 273 msec per loop ./python.exe -m timeit sum(range(4000)) -1000 loops, best of 3: 177 usec per loop +10000 loops, best of 3: 151 usec per loop ./python.exe -m timeit sum(list(range(4000))) -1000 loops, best of 3: 276 usec per loop +1000 loops, best of 3: 274 usec per loop ./python.exe -m timeit sum(range(10000)) -1000 loops, best of 3: 453 usec per loop +1000 loops, best of 3: 382 usec per loop ./python.exe -m timeit sum(list(range(10000))) -1000 loops, best of 3: 792 usec per loop +1000 loops, best of 3: 802 usec per loop ./python.exe -m timeit sum(range(40000)) -100 loops, best of 3: 2.03 msec per loop +100 loops, best of 3: 1.79 msec per loop ./python.exe -m timeit sum(range(80000)) -100 loops, best of 3: 6.01 msec per loop +100 loops, best of 3: 5.77 msec per loop
msg64351 - (view)	Author: Gregory P. Smith (gregory.p.smith) * (Python committer)	Date: 2008年03月23日 01:24
Looking at how much memory was actually used by PyLongObjects the way our code allocates them... we can use this freelist for 2 digit PyLongs on 32-bit systems and for 4 digit PyLongs on 64-bit systems. Doing this speeds things up even more as numbers > 32767 are still quite common. why? sizeof(PyVarObject) == 12, sizeof(digit) == 2. Objects/obmalloc.c allocates on 8, 16 and 32 byte boundaries. Theres no such thing as a 14 byte allocation, its 16 byte aligned. The same applies for 64-bit platforms where the sizeof(PyVarObject) == 24, our allocation size is 32 bytes there. patch attached as -gps02.
msg64352 - (view)	Author: Gregory P. Smith (gregory.p.smith) * (Python committer)	Date: 2008年03月23日 02:04
And the benchmarks of gps02... see the attached file. At the end I had it run pybench rather than these microbenchmarks that obviously show the speedup, the -gps02 patch ran 1.3% faster best case a 0.5% faster on average (32-bit; my mac's toolchain can't build enough of py3k 64-bit to run pybench). Things left for someone to do? Determine what a good size for the PyLong free list is. 4096 was a magic number chosen by tiran.
msg64373 - (view)	Author: Antoine Pitrou (pitrou) * (Python committer)	Date: 2008年03月23日 19:05
The problem with choosing a sensible freelist size is that we don't have any reference workloads. However, I just tested with 10000 and it doesn't seem to slow anything down anyway. It doesn't make our microbenchmarks I thought the patch to compact freelists at each full gc collection had been committed, but it doesn't seem there. Perhaps it will change matters quite a bit. On the one hand, it will allow for bigger freelists with less worries of degrading memory footprint (but still, potential cache pollution). On the other hand, the bigger the freelists, the more expensive it is to deallocate them.
msg69018 - (view)	Author: Antoine Pitrou (pitrou) * (Python committer)	Date: 2008年06月30日 21:20
Gregory, your patch probably deserves checking in, it doesn't seem to me there is any concern preventing you to do that.
msg71306 - (view)	Author: Gregory P. Smith (gregory.p.smith) * (Python committer)	Date: 2008年08月18日 01:15
preventing this right now is that when i apply this to py3k today, it fails miserably in a debug build. first, my patch has an invalid assert(size > 0) in it in _PyLong_New as 0 size is used when the value is 0. get rid of that line. then things at least run but you'll end up in an infinite loop when the interpreter exits at best if you've compiled in debug mode. things work great in a non-pydebug build. i believe the reason is this change is not properly looking at the structure allocation sizes. debug builds add extra structure fields. i'm investigating. the free_list code in floatobject.c does not have this problem so at least there's a good example to go from. and yet another reason why a more general free list library for various internals to use would be useful...
msg71314 - (view)	Author: Gregory P. Smith (gregory.p.smith) * (Python committer)	Date: 2008年08月18日 04:56
attached is an updated patch that also works in debug mode. to do this i had to exclude all zero length (value = 0) PyLongObjects from the free list. i'm still not sure why, they all have the same underlying allocation size. simplifying it to only freelist one digit longs the only useful differences are in the microbenchmark of -32767..32768. otherwise things are slightly slower. pybench is identical and pystone drops a bit. i'm closing this as not worth it as things are written. if someone wants to pick up the idea and play with freelists go ahead but this doesn't need to be an open tracker issue. there is room for improvement but these patches aren't it.

History
Date	User	Action	Args
2022年04月11日 14:56:30	admin	set	github: 46297
2008年08月18日 04:56:57	gregory.p.smith	set	files: + py3k_longfreelist2_gps04.patch
2008年08月18日 04:56:17	gregory.p.smith	set	status: open -> closed resolution: rejected messages: + msg71314 files: + py3k_longfreelist2_gps03.patch
2008年08月18日 01:15:45	gregory.p.smith	set	assignee: gregory.p.smith messages: + msg71306
2008年06月30日 21:20:46	pitrou	set	messages: + msg69018
2008年03月23日 19:05:33	pitrou	set	messages: + msg64373
2008年03月23日 02:04:46	gregory.p.smith	set	files: + issue2013-benchmarks-gps02.txt messages: + msg64352
2008年03月23日 01:24:20	gregory.p.smith	set	files: + py3k_longfreelist2-gps02.patch messages: + msg64351
2008年03月22日 19:54:14	gregory.p.smith	set	files: + py3k_longfreelist2-gps01.patch type: enhancement -> performance messages: + msg64337 nosy: + gregory.p.smith versions: - Python 2.6
2008年03月22日 16:45:03	pitrou	set	messages: + msg64331
2008年02月12日 22:48:19	christian.heimes	set	messages: + msg62342
2008年02月12日 22:32:17	pitrou	set	messages: + msg62339
2008年02月12日 22:19:05	pitrou	set	files: + py3k_longfreelist2.patch nosy: + pitrou messages: + msg62338
2008年02月06日 17:09:57	christian.heimes	set	files: + py3k_longfreelist.patch type: enhancement messages: + msg62106
2008年02月05日 03:07:46	christian.heimes	create

homepage