Message 259612 - Python tracker

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

In-reply-to
Author	yselivanov
Recipients	Yury.Selivanov, casevh, josh.r, lemburg, mark.dickinson, pitrou, rhettinger, serhiy.storchaka, skrah, vstinner, yselivanov, zbyrne
Date	2016年02月05日.01:37:39
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1454636263.61.0.377841355544.issue21955@psf.upfronthosting.co.za>

Content
tl;dr I'm attaching a new patch - fastint4 -- the fastest of them all. It incorporates Serhiy's suggestion to export long/float functions and use them. I think it's reasonable complete -- please review it, and let's get it committed. == Benchmarks == spectral_norm (fastint_alt) -> 1.07x faster spectral_norm (fastintfloat) -> 1.08x faster spectral_norm (fastint3.patch) -> 1.29x faster spectral_norm (fastint4.patch) -> 1.16x faster spectral_norm (fastint.patch)-> 1.31x faster nbody (fastint.patch) -> 1.16x faster Where: - fastint3 - is my previous patch that nobody likes (it inlined a lot of logic from longobject/floatobject) - fastint4 - is the patch I'm attaching and ideally want to commit - fastint** - is a modification of fastint4. This is very interesting -- I started to profile different approaches, and found two bottlenecks, that really made Serhiy's and my other patches slower than fastint3. What I found is that PyLong_AsDouble can be significantly optimized, and PyLong_FloorDiv is super inefficient. PyLong_AsDouble can be sped up several times if we add a fastpath for 1-digit longs: // longobject.c: PyLong_AsDouble if (PyLong_CheckExact(v) && Py_ABS(Py_SIZE(v)) <= 1) { /* fast path; single digit will always fit decimal / return (double)MEDIUM_VALUE((PyLongObject )v); } PyLong_FloorDiv (fastint4 adds it) can be specialized for single digits, which gives it a tremendous boost. With those too optimizations, fastint4 becomes as fast as fastint3. I'll create separate issues for PyLong_AsDouble and FloorDiv. == Micro-benchmarks == Floats + ints: -m timeit -s "x=2" "x2.2 + 2 + x2.5 + 1.0 - x / 2.0 + (x+0.1)/(x-0.1)2 + (x+10)(x-30)" 2.7: 0.42 (usec) 3.5: 0.619 fastint_alt 0.619 fastintfloat: 0.52 fastint3: 0.289 fastint4: 0.51 fastint*: 0.314 === Ints: -m timeit -s "x=2" "x + 10 + x 20 - x // 3 + x* 10 + 20 -x" 2.7: 0.151 (usec) 3.5: 0.19 fastint_alt: 0.136 fastintfloat: 0.135 fastint3: 0.135 fastint4: 0.122 fastint*: 0.122 P.S. I have another variant of fastint4 that uses fast_ functions in ceval loop, instead of a big macro. Its performance is slightly worse than with the macro.

Content

tl;dr I'm attaching a new patch - fastint4 -- the fastest of them all. It incorporates Serhiy's suggestion to export long/float functions and use them. I think it's reasonable complete -- please review it, and let's get it committed.
== Benchmarks ==
spectral_norm (fastint_alt) -> 1.07x faster
spectral_norm (fastintfloat) -> 1.08x faster
spectral_norm (fastint3.patch) -> 1.29x faster
spectral_norm (fastint4.patch) -> 1.16x faster
spectral_norm (fastint**.patch)-> 1.31x faster
nbody (fastint**.patch) -> 1.16x faster
Where:
- fastint3 - is my previous patch that nobody likes (it inlined a lot of logic from longobject/floatobject)
- fastint4 - is the patch I'm attaching and ideally want to commit
- fastint** - is a modification of fastint4. This is very interesting -- I started to profile different approaches, and found two bottlenecks, that really made Serhiy's and my other patches slower than fastint3. What I found is that PyLong_AsDouble can be significantly optimized, and PyLong_FloorDiv is super inefficient.
PyLong_AsDouble can be sped up several times if we add a fastpath for 1-digit longs:
 // longobject.c: PyLong_AsDouble
 if (PyLong_CheckExact(v) && Py_ABS(Py_SIZE(v)) <= 1) {
 /* fast path; single digit will always fit decimal */
 return (double)MEDIUM_VALUE((PyLongObject *)v);
 }
PyLong_FloorDiv (fastint4 adds it) can be specialized for single digits, which gives it a tremendous boost.
With those too optimizations, fastint4 becomes as fast as fastint3. I'll create separate issues for PyLong_AsDouble and FloorDiv.
== Micro-benchmarks ==
Floats + ints: -m timeit -s "x=2" "x*2.2 + 2 + x*2.5 + 1.0 - x / 2.0 + (x+0.1)/(x-0.1)*2 + (x+10)*(x-30)"
2.7: 0.42 (usec)
3.5: 0.619
fastint_alt 0.619
fastintfloat: 0.52
fastint3: 0.289
fastint4: 0.51
fastint**: 0.314
===
Ints: -m timeit -s "x=2" "x + 10 + x * 20 - x // 3 + x* 10 + 20 -x"
2.7: 0.151 (usec)
3.5: 0.19
fastint_alt: 0.136
fastintfloat: 0.135
fastint3: 0.135
fastint4: 0.122
fastint**: 0.122
P.S. I have another variant of fastint4 that uses fast_* functions in ceval loop, instead of a big macro. Its performance is slightly worse than with the macro.

History
Date	User	Action	Args
2016年02月05日 01:37:43	yselivanov	set	recipients: + yselivanov, lemburg, rhettinger, mark.dickinson, pitrou, vstinner, casevh, skrah, Yury.Selivanov, serhiy.storchaka, josh.r, zbyrne
2016年02月05日 01:37:43	yselivanov	set	messageid: <1454636263.61.0.377841355544.issue21955@psf.upfronthosting.co.za>
2016年02月05日 01:37:43	yselivanov	link	issue21955 messages
2016年02月05日 01:37:41	yselivanov	create

homepage