homepage

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: faster long multiplication
Type: performance Stage:
Components: Interpreter Core Versions: Python 3.1, Python 2.7
process
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: mark.dickinson Nosy List: christian.heimes, fredrikj, mark.dickinson, pernici
Priority: normal Keywords: patch

Created on 2008年09月23日 10:25 by pernici, last changed 2022年04月11日 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
longobject2.diff pernici, 2008年09月29日 14:15
longobject_diff1 pernici, 2009年03月22日 09:53 patch from issue 4258, modified
faster_long_mul.patch mark.dickinson, 2009年03月24日 10:07
Messages (14)
msg73624 - (view) Author: Pernici Mario (pernici) Date: 2008年09月23日 10:25
In this patch x_mul(a, b) uses fewer bit operations for a != b,
asymptotically half of them.
On the three computers I tried the speed-up is around 5% for size=4
and it increases up to 45-60% just below the Karatsuba cutoff,
then it decreases a bit after this cutoff (on one computer the speed-up
is only 16% after KARATSUBA_CUTOFF=70, but raising the cutoff to 140,
for which with the current code the multiplication is also faster,
the speed-up is 45%).
msg73694 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2008年09月24日 09:19
Just to be clear: this patch halves the number of shifts and masks, 
asymptotically; it doesn't affect the number of adds and multiplies 
(except for saving a few additions to zero by setting the initial carry 
intelligently). Is that correct? It's quite surprising that the bit 
operations have such a large effect on the running time.
Perhaps this is an argument for considering changing PyLong_SHIFT to 16 
(or better, 32, since surely almost every compiler provides uint64_t or 
equivalent these days)? Though that would be quite an involved task.
While I believe the idea is sound, the patch isn't quite correct---it's 
got some assertions that won't always hold. For example, with the 
patch, in a debug build of Python, I get:
>>> a = b = 2**30-1
[36413 refs]
>>> a*b
Assertion failed: (carry <= PyLong_MASK), function x_mul, file 
Objects/longobject.c, line 2228.
Abort trap
I'm fairly sure that the assert(carry <= PyLong_MASK) doesn't matter, 
and that the assertion at the end of the main outer loop (assert(carry 
>> PyLong_SHIFT) == 0)) should still hold. But some more comments and 
analysis in the code would be good. For example, if carry >= 
PyLong_MASK the the comment that 2*MASK + 2*MASK*MASK is contained in a 
twodigit isn't so useful. How big can carry get?
msg73722 - (view) Author: Pernici Mario (pernici) Date: 2008年09月24日 15:57
Yes, I think that the speed-up is due to reducing the number of 
shifts and masks.
Changing PyLong_SHIFT to 16 would be complicated; for instance in 
v_iadd() carry could not be a digit of 16 bits anymore; writing code
specific for 64 bit machines would surely improve performance;
maybe with PyLong_SHIFT=30 few changes to the code would be needed?
I did not modify the case a = b.
I changed the documentation, which was wrong,
adding detailed bounds on carry
in the various steps to check that it does not overflow.
I corrected the wrong assertion (carry <= PyLong_MASK).
msg73773 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2008年09月25日 08:44
Thanks for the updated patch! Looks good, on a quick scan.
(One comment typo I noticed: there's a line
BASE - 3 = 2*MASK - 1
presumably this should be 2*BASE - 3 on the LHS.)
Just out of interest, is it possible to go further, and combine 4
partial multiplications at once instead of 2? Or does the extra
bookkeeping involved make it not worth it?
I think it's important to make sure that any changes to longobject.c
don't slow down operations on small integers (where "small" means "less
than 2**32") noticeably.
Re: possible changes to PyLong_SHIFT
Yes, changing PyLong_SHIFT to 16 (or 32) would be complicated, and would
involve almost a complete rewrite of longobject.c, together with much
else... It wasn't really a serious suggestion, but it probably would
provide a speedup. The code in GMP gives some idea how things might work.
Changing PyLong_SHIFT to 30 doesn't seem like a totally ridiculous idea,
though. One problem is that there's no 64-bit integer type (for
twodigits) in *standard* C89; so since Python is only allowed to assume
C89 there would have to be some fallback code for those (very few,
surely) platforms that didn't have a 64-bit integer type available.
On 64-bit machines one could presumably go further, and have
PyLong_SHIFT be 60 (or 62, or 63 --- but those break the assumption
in long_pow that the PyLong_SHIFT is a multiple of 5). This would
depend on the compiler providing a 128-bit type for twodigits (like
__uint128_t on gcc/x86-64). Probably not worth it, especially if it
ends up slowing down operations on everyday small integers.
Any of these changes is also going to affect a good few other parts of
the codebase (e.g. marshal, pickle?, struct?, floatobject.c, ...). It
shouldn't be difficult to find most of the files affected (just look to
see which files include longintrepr.h), but I have a suspicion there are
a couple of other places that just assume PyLong_SHIFT is 15).
msg74028 - (view) Author: Pernici Mario (pernici) Date: 2008年09月29日 14:15
Mark, following your suggestions about using bigger integer types,
I added code to convert Python numbers to arrays of twodigits,
when a 64 bit integer type is supported, and for numbers with size
larger than 20; otherwise the code of the previous patch is used.
This 64 bit integer is used only inside multiplication, so no
modifications need to be made in other parts of the Python code.
Now with numbers with 300 decimal digits or more the speedup is
2x on 32 bit machine, 3x on 64 bit machine (50% and 2x respectively
for squaring).
There is a macro HAVE_INT64 to control if there is a 64 bit type;
the preprocessor instructions should be OK with gcc, but other
compilers might have a 64 bit type and not long long, so HAVE_INT64
is wrongly not defined and one falls back to multiplying arrays of
16 bit digits; these preprocessor instructions need to be fixed.
The speed difference for small integers is small;
here is a summary of some benchmarks on
Pentium M 1.6GHz, Athlon XP 2600+,
Athlon 64 X2 Dual Core 3800+, all with Debian;
speedup of this patch with respect to the current revision
(+ means the patch is faster):
In pybench, SimpleIntegerArithmetic: from -0.5% to +0.5%
 SimpleLongArithmetic: : from -1% to +7%
pystone : from +0.5% to +1.5%
msg74029 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2008年09月29日 14:36
Nice work! Seems like we're going to be able to look forward to faster 
integer arithmetic in Python 2.7 / 3.1. I'll try to find time to review 
your patch properly sometime soon.
Regarding the HAVE_INT64, there's a standard autoconf macro 
AC_TYPE_INT64_T (and its unsigned counterpart AC_TYPE_UINT64_T) that 
automatically sets int64_t (or uint64_t) to a 64-bit (unsigned) integer 
type whenever it exists. So don't worry about the preprocessor stuff for 
the moment, so long as it's working on your machine; we can fix it in 
Python's configure script instead sometime.
msg74035 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2008年09月29日 17:36
Nice work :)
I'm changing the target versions to 2.7 and 3.1. The proposed changes
are too large for a maintenance release.
msg74430 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2008年10月07日 09:15
It looks as though changing PyLong_SHIFT to 30 everywhere is much simpler 
than I feared. Here's a short patch that does exactly that. It:
 - changes the definitions in longintrepr.h
 - changes marshal.c to write digits as longs, not shorts
 - adds some casts to longobject.c (most of which should really
 have been there already---clearly Python's never encountered
 a machine where ints are only 2 bytes long, even though the
 standard seems to permit it).
With this patch, all tests pass on my machine with the exception of
the getsizeof tests in test_sys; and sys.getsizeof is working fine---it's 
the tests that need to be changed.
Still to do:
 - use uint64 and uint32 instead of unsigned long long and unsigned long,
 when available; this avoids wasting lots of space on platforms
 where a long is 64 bits.
 - provide fallback definitions for platforms that don't have any 64-bit
 type available
 - (?)expose the value of PyLong_SHIFT to Python somewhere (in sys?); then
 the getsizeof tests could use this value to determine whether a digit
 is expected to take 2 bytes or 4 (or ...)
msg75495 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2008年11月04日 15:37
I've opened a separate issue (issue 4258) for the idea of using 30-bit
long digits instead of 15-bit ones. I'll remove the patch from this 
issue.
Pernici Mario's idea applies even better to base 2**30 longs: one can
clump together 16 (!) of the multiplications at once, essentially
eliminating the overhead of shifts almost completely.
msg75750 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2008年11月11日 17:12
See issue 4258 for a patch that combines 30-bit digits with
this multiplication optimization. The code is quite different
from the patch posted here, but the basic idea is the same.
msg83964 - (view) Author: Pernici Mario (pernici) Date: 2009年03月22日 09:53
This patch comes from 30bit_longdigit13+optimizations1.patch in issue4258
with modification for the case of multiplication by 0; it passes
test_long.py and pidigits is a bit faster.
msg84069 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2009年03月24日 09:36
Thanks! Unfortunately, it looks like I messed this up yesterday by 
removing mul1 (after the division patch went in, mul1 wasn't used any 
more). I'll fix this.
msg84072 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2009年03月24日 10:07
Updated version of longobject_diff1:
 - add mul1 back in
 - rename MAX_PARTIALS to the more descriptive BLOCK_MUL_SIZE
 - rewrite digits_multiply so that the call to digits_multiply_add
 always has b_size=BLOCK_MUL_SIZE, then hard-code this and get
 rid of the b_size argument. This should give the compiler some
 opportunities for loop-unrolling in digits_multiply_add.
msg95792 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2009年11月28日 16:01
I'm going to close this: it's a really nice idea, but after the 30-bit 
long digits were implemented, the speedup doesn't seem to be worth the 
extra code complication.
The only situation I've found where this optimization really does make a 
big difference is when using 60-bit digits, but allowing those in Python 
would take a bit more work (essentially because it requires using some 
inline assembler to get at the CPU widening multiply and 128-bit-by-64-bit 
division instructions).
History
Date User Action Args
2022年04月11日 14:56:39adminsetgithub: 48194
2009年11月28日 16:01:03mark.dickinsonsetstatus: open -> closed
resolution: rejected
messages: + msg95792
2009年11月15日 15:41:37mark.dickinsonsetpriority: high -> normal
2009年03月24日 10:07:47mark.dickinsonsetfiles: + faster_long_mul.patch

messages: + msg84072
2009年03月24日 09:36:37mark.dickinsonsetmessages: + msg84069
2009年03月22日 09:53:19pernicisetfiles: + longobject_diff1

messages: + msg83964
2008年11月11日 17:12:03mark.dickinsonsetmessages: + msg75750
2008年11月11日 02:03:32vstinnersetfiles: - longobject1.diff
2008年11月11日 02:03:28vstinnersetfiles: - longobject_diff
2008年11月04日 15:38:13mark.dickinsonsetfiles: - 30bit.patch
2008年11月04日 15:37:52mark.dickinsonsetmessages: + msg75495
2008年10月07日 09:16:02mark.dickinsonsetfiles: + 30bit.patch
messages: + msg74430
2008年09月29日 17:36:09christian.heimessetpriority: high
nosy: + christian.heimes
messages: + msg74035
components: + Interpreter Core
versions: + Python 3.1, Python 2.7, - Python 3.0
2008年09月29日 14:36:01mark.dickinsonsetmessages: + msg74029
2008年09月29日 14:15:59pernicisetfiles: + longobject2.diff
messages: + msg74028
2008年09月26日 16:11:42fredrikjsetnosy: + fredrikj
2008年09月25日 08:44:32mark.dickinsonsetmessages: + msg73773
2008年09月24日 15:57:54pernicisetfiles: + longobject1.diff
keywords: + patch
messages: + msg73722
2008年09月24日 09:20:49mark.dickinsonsetassignee: mark.dickinson
2008年09月24日 09:19:19mark.dickinsonsetnosy: + mark.dickinson
messages: + msg73694
2008年09月23日 10:25:58pernicicreate

AltStyle によって変換されたページ (->オリジナル) /