homepage

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Unified hash for numeric types.
Type: enhancement Stage: resolved
Components: Versions: Python 3.2
process
Status: closed Resolution: accepted
Dependencies: Superseder:
Assigned To: mark.dickinson Nosy List: Rhamphoryncus, casevh, eric.smith, mark.dickinson, rhettinger, skrah
Priority: normal Keywords: patch

Created on 2010年03月20日 20:34 by mark.dickinson, last changed 2022年04月11日 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
numeric_hash.patch mark.dickinson, 2010年03月20日 20:34
numeric_hash2.patch mark.dickinson, 2010年03月20日 23:01
numeric_hash3.patch mark.dickinson, 2010年03月21日 11:02
numeric_hash4.patch mark.dickinson, 2010年03月23日 14:01
numeric_hash5.patch mark.dickinson, 2010年03月27日 11:49
numeric_hash6.patch mark.dickinson, 2010年04月04日 13:32
doc_stdtypes.patch skrah, 2010年04月06日 14:20 apply after numeric_hash6.patch
numeric_hash7.patch mark.dickinson, 2010年04月13日 10:38
numeric_hash8.patch mark.dickinson, 2010年05月22日 10:14
Messages (25)
msg101395 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2010年03月20日 20:34
Here's a patch that makes hash(x) == hash(y) for any numeric types (int, float, complex, Decimal, Fraction, bool) when x and y are numerically equal.
This is a prerequisite for making all numeric types accurately comparable with each other.
msg101397 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2010年03月20日 20:54
Uploaded to Rietveld:
http://codereview.appspot.com/660042 
msg101401 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2010年03月20日 22:14
Updated patch, with a bit of cleanup and some comments describing the hashing strategy; I'll update the Rietveld issue as well.
msg101402 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2010年03月20日 22:36
Whoops; that patch included some accidental Lib/test/test_decimal changes. Here's the correct patch.
msg101403 - (view) Author: Adam Olsen (Rhamphoryncus) Date: 2010年03月20日 22:41
Why aren't you using 64-bit hashes on 64-bit architectures?
msg101404 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2010年03月20日 22:43
> Why aren't you using 64-bit hashes on 64-bit architectures?
Mostly because I haven't got around to putting that in yet. :)
Ideal would be to use _PyHASH_BITS=61 for 64-bit machines, throughout.
msg101405 - (view) Author: Adam Olsen (Rhamphoryncus) Date: 2010年03月20日 22:44
I assume you mean 63. ;)
msg101406 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2010年03月20日 22:45
No, I mean 61. 2**61 - 1 is prime; 2**63-1 is not. (So 2 bits of the hash get wasted, but that's not a big deal, especially since they're the high-end bits and Python mostly cares about the lower-order bits.)
msg101407 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2010年03月20日 23:01
Restore tests accidentally omitted from second patch.
msg101417 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2010年03月21日 11:02
Updated patch:
 - put hash parameters into pyport.h, to avoid repetition; make them
 available to Python code via a private attribute sys._hash_info.
 - use a modulus of 2**61-1 on systems where SIZEOF_LONG >= 8, and
 a modulus of 2**31 - 1 otherwise.
 - remove _invmod from fractions module. It's faster (and easier) to
 use 3-argument pow to compute inverses modulo a prime.
 - add a few more tests.
msg101581 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2010年03月23日 14:01
Another update, partly to address comments raised by Guido on Rietveld. I'll upload these changes to Rietveld later today.
 - rename sys._hash_info to sys.hash_info and make it public rather than private (it still needs docs somewhere)
 - add some explanatory comments to long_hash; remove an outdated comment
 - fix missing error check (in previous patch) in slot_tp_hash. slot_tp_hash also now always raises a TypeError if __hash__ returns a non-integer; this is a change from current behaviour, which allows small floats to be returned by __hash__, but not large floats (where large means > 2**31 or > 2**63 in absolute value, depending on the system). I'm assuming this was unintentional (the docs specify that __hash__ should return an integer).
 - simplify specification of hash function slightly: for nonnegative x it simply computes the reduction of x; previously it computed 1 + reduction of (x-1) for positive values. This extra +-1 doesn't really add anything of value, and makes it slightly more complicated and error-prone to write your own hash function.
msg101824 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2010年03月27日 11:49
Here's a version of the patch that adds exact comparisons between the various numeric types. The only slightly tricky comparison is the Fraction <-> Decimal one: an obvious strategy is to convert the Decimal exactly to a Fraction and then use the fraction comparison, but this is inefficient for Decimal instances with large exponent. So instead, we compare a Decimal `x` with a Fraction `n/d` by comparing `x*d` with `n` in the Decimal domain.
msg102339 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2010年04月04日 13:32
New patch:
 - document and test sys.hash_info
 - document numeric hash definition (in Doc/library/stdtypes.rst; I'm
 not sure whether this is the best place for it)
 - document Decimal change (Decimal instances are now comparable
 with instances of float, fraction.Fraction)
 - refresh patch to apply cleanly to current svn.
I think this is close to final form: I intend to apply this patch (or something very much like it) soon; any review would be appreciated.
msg102341 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2010年04月04日 13:35
I've refreshed the Rietveld patch as well:
http://codereview.appspot.com/660042 
msg102347 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2010年04月04日 18:57
Mark, very nice concept! - I'm just starting to review the patch, but I
think the unsigned longs in_Py_HashDouble() and long_hash() should be
uint64_t on a 64-bit OS.
For instance, now on Windows 64-bit:
>>> hash(2**61-1)
1073741823 
msg102350 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2010年04月04日 20:53
Actually the current long_hash() is affected as well. On Windows 64-bit:
>>> hash(2**31)
-2147483648
>>> hash(2**32)
1
msg102352 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2010年04月04日 21:25
Yes, hash values are C longs, regardless of platform. I think that's probably too ingrained to consider changing it (we'd have to change hashes of all the non-numeric types, too).
msg102467 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2010年04月06日 14:20
I've finished reviewing the patch and I think it's quite ready to be
applied.
The documentation in stdtypes.rst says that P = 2**61-1 on 64-bit
machines. This should be changed to reflect the fact that actually
sizeof long is the deciding factor. I attach a patch that also fixes
the typo pointed out by Christophe Simonis.
msg103031 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2010年04月13日 10:37
Many thanks for reviewing, Stefan, and for the patch.
Here's an updated patch:
 - specify 32-bit/64-bit C long rather than 32-bit/64-bit machine
 - apply hash->hash_ fix to Python hash recipe
 - use _PyHASH_MODULUS instead of _PyHASH_MASK throughout (this
 was bugging me).
 - reorganize the stdtypes doc entry slightly
 - update against current svn, and remove outdated test_float tests
 for the values of float('inf') and float('nan')
One unresolved issue: it would probably make sense to specify (publicly) a hash algorithm for complex types, too, so that someone implementing e.g. Gaussian integers can make their hash function agree with that for the complex type where appropriate.
That hash algorithm would probably be as simple as:
 hr = hash(x.real)
 hi = hash(x.imag)
 return <some suitably bit-mixing combination of hi and hr>
where the algorithm for the combination needs to be specified explicitly, and any relevant parameters put into sys.hash_info.
(Unfortunately, -1 doesn't have square roots modulo the prime P used, else we could do the cute thing and make use of a square root of -1 modulo P.)
Another tiny detail: I'm wondering whether hash(m/P) should care about the sign of m: currently it doesn't, which means that the symmetry hash(-x) = -hash(x) *almost* always holds for a rational x, but not always. An almost-always-true symmetry seems like a recipe for hard-to-find bugs.
msg103032 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2010年04月13日 10:38
... and here's the actual patch... Forget my own head next. :)
msg103800 - (view) Author: Case Van Horsen (casevh) Date: 2010年04月21日 03:23
I've spent some time working through the new hash function by re-implementing it for gmpy. Very elegant design.
I like _PyHASH_MODULUS better, too.
Regarding a hash function for complex types, I think documenting the existing behavior for PyComplex is sufficient. The magic number 1000003 could be documented in hash_info as 'multiplier' and _PyHASH_MULTIPLIER. The same constant, but a different algorithm, is also used when hashing a tuple.
I think hash(m/P) should preserve sign. It just seems more symmetrical. :)
msg103843 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2010年04月21日 13:47
> Regarding a hash function for complex types, I think documenting the 
> existing behavior for PyComplex is sufficient. The magic number 1000003 > could be documented in hash_info as 'multiplier' and _PyHASH_MULTIPLIER. 
Seems reasonable; I'm tempted to call the constant it hash_info.imaginary and _PyHASH_IMAGINARY, though. :) 
There's also an implicit parameter in this algorithm, namely the size of a C long; I think this should go into sys.hash_info, too.
complex_hash does need fixing in one respect: it currently depends on signed overflow wrapping modulo 2**BIT_IN_LONG, but that's undefined behaviour; it should use unsigned arithmetic instead.
> I think hash(m/P) should preserve sign. It just seems more symmetrical. :)
Agreed. Along similar lines, I think I'm also going to get rid of _PyHASH_NINF and just use -PyHASH_INF instead.
msg106291 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2010年05月22日 10:14
Updated patch:
 - make hash(m/P) preserve sign, as discussed earlier
 - add details for computing the hash of a complex number
 - reorganize sys.hash_info
 - drop sys.hash_info.bits (the exponent of the Mersenne prime);
 it's not needed in the Python code, and it can be deduced from
 the prime itself if necessary. This also means that there's no
 public requirement that the prime be a Mersenne prime.
 - drop sys.hash_info.ninf; just use -sys.hash_info.inf instead
 - add sys.hash_info.width: the underlying width in bits for hashes
 of all descriptions; in other words, it's just the number of bits
 in a C long in the current implementation
 - add sys.hash_info.imag, the multiplier used for the imaginary
 part of a complex number
msg106333 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2010年05月23日 13:35
Committed the hash changes in r81486. This commit just changes the method for computing hash values; it doesn't include the changes to the decimal module that make Decimal instances comparable with Fraction instances.
msg107539 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2010年06月11日 10:51
Committed the Decimal-to-Fraction comparisons in r81893. All numeric types should now compare nicely with each other.
History
Date User Action Args
2022年04月11日 14:56:58adminsetgithub: 52435
2010年06月11日 10:51:51mark.dickinsonsetmessages: + msg107539
2010年05月23日 13:35:21mark.dickinsonsetstatus: open -> closed
resolution: accepted
messages: + msg106333

stage: commit review -> resolved
2010年05月22日 10:14:48mark.dickinsonsetfiles: + numeric_hash8.patch

messages: + msg106291
2010年04月21日 13:47:35mark.dickinsonsetmessages: + msg103843
2010年04月21日 03:23:05casevhsetnosy: + casevh
messages: + msg103800
2010年04月13日 10:41:40mark.dickinsonsetpriority: normal
2010年04月13日 10:39:02mark.dickinsonsetfiles: + numeric_hash7.patch

messages: + msg103032
2010年04月13日 10:37:47mark.dickinsonsetassignee: mark.dickinson
2010年04月13日 10:37:33mark.dickinsonsetmessages: + msg103031
2010年04月06日 14:20:45skrahsetfiles: + doc_stdtypes.patch

messages: + msg102467
2010年04月04日 21:25:31mark.dickinsonsetmessages: + msg102352
2010年04月04日 20:53:24skrahsetmessages: + msg102350
2010年04月04日 18:57:51skrahsetmessages: + msg102347
2010年04月04日 13:35:38mark.dickinsonsetmessages: + msg102341
2010年04月04日 13:32:15mark.dickinsonsetfiles: + numeric_hash6.patch

nosy: + rhettinger
messages: + msg102339

stage: commit review
2010年03月27日 11:49:27mark.dickinsonsetfiles: + numeric_hash5.patch

messages: + msg101824
2010年03月24日 13:20:19skrahsetnosy: + skrah
2010年03月23日 14:01:33mark.dickinsonsetfiles: + numeric_hash4.patch

messages: + msg101581
2010年03月21日 11:02:48mark.dickinsonsetfiles: + numeric_hash3.patch

messages: + msg101417
2010年03月20日 23:19:48eric.smithsetnosy: + eric.smith
2010年03月20日 23:01:36mark.dickinsonsetfiles: - numeric_hash2.patch
2010年03月20日 23:01:27mark.dickinsonsetfiles: + numeric_hash2.patch

messages: + msg101407
2010年03月20日 22:45:57mark.dickinsonsetmessages: + msg101406
2010年03月20日 22:44:46Rhamphoryncussetmessages: + msg101405
2010年03月20日 22:43:20mark.dickinsonsetmessages: + msg101404
2010年03月20日 22:41:27Rhamphoryncussetnosy: + Rhamphoryncus
messages: + msg101403
2010年03月20日 22:36:46mark.dickinsonsetfiles: + numeric_hash2.patch

messages: + msg101402
2010年03月20日 22:36:13mark.dickinsonsetfiles: - numeric_hash2.patch
2010年03月20日 22:14:06mark.dickinsonsetfiles: + numeric_hash2.patch

messages: + msg101401
2010年03月20日 20:54:00mark.dickinsonsetmessages: + msg101397
2010年03月20日 20:34:28mark.dickinsoncreate

AltStyle によって変換されたページ (->オリジナル) /