Python quickly hash mutable object

Question 1

I have a bytearray that I need to use as a key to a dictionary. Ideally I'd like to do this without doing a copy of memory the size of the bytearray. Is there anyway to do this? Basically,

b = some bytearray
d[byte(b)] = x

Is there any faster way to do this? byte(b) is an O(len(bytearray)) operation which is undesirable.

Question 2

How would you handle hash collisions?

Question 3

The problem isn't collisions, it's what happens if you mutate the key.

Question 4

What version of Python are you using?

Question 5

Are you concerned with memory or processing time, because 2(O(len(b))) [conversion to bytes, then hash] is still O(len(b)).

Question 6

Any hash algorithm that actually does its job correctly will use O(len(b)) time. So the answer to "is there any faster way to do this" is no.

If your actual concern is memory usage, then you could, in principle, add a __hash__ method to a subclass of bytearray. But that's a pretty bad idea. Look what happens:

>>> class HashableBytearray(bytearray):
... def __hash__(self):
... return hash(str(self))
... 
>>> h = HashableBytearray('abcd')
>>> hash(h)
-2835746963027601024
>>> h[2] = 'z'
>>> hash(h)
-2835746963002600949

So the same object could hash to two different spots in the dictionary, which isn't supposed to happen. And it gets worse:

>>> d = dict()
>>> hb1 = HashableBytearray('abcd')
>>> hb2 = HashableBytearray('abcd')
>>> d[hb1] = 0
>>> d[hb2] = 1
>>> d
{bytearray(b'abcd'): 1}

Ok, so far, so good. The values are equal, so there should be only one item in the dictionary. Everything is working as expected. Now let's see what happens when we change hb1:

>>> hb1[2] = 'z'
>>> d[hb2] = 2
>>> d
{bytearray(b'abzd'): 1, bytearray(b'abcd'): 2}

See how even though hb2 didn't change at all, it created a new key-value pair in the dictionary this time?

Every time I passed a key to d, that key was equal to 'abcd'. But because the value of the first key changed after being added to the dictionary, Python couldn't tell that the value of the new key was the same as the old key had been when it was added. Now there are two key-value pairs in the dictionary, when there should be only one.

This is only one of many ways that using mutable values as keys can lead to unpredictable and very wrong behavior. Just convert the bytearray to an immutable type, or work with immutable types in the first place.

And for the inquisitive: sure, buffer caches the first hash, but that doesn't help at all. There are only two key values, so this should generate only two dict entries:

>>> a, b, c = bytearray('abcd'), bytearray('abcd'), bytearray('abzd')
>>> a_buf, b_buf, c_buf = buffer(a), buffer(b), buffer(c)
>>> d = {b_buf:1, c_buf:2}
>>> b[2] = 'z'
>>> d[a_buf] = 0

But it generates three:

>>> d
{<read-only buffer for 0x1004a2300, size -1, offset 0 at 0x100499cb0>: 1, 
 <read-only buffer for 0x1004a2420, size -1, offset 0 at 0x100499cf0>: 0, 
 <read-only buffer for 0x1004a22d0, size -1, offset 0 at 0x100499c70>: 2}

Question 7

An old-style buffer in 2.x caches the first computed hash, even if the data is modified (e.g. using a bytearray).

Question 8

@eryksun, that still leads to inconsistent results. See above.

Question 9

Which version of Python 2.x are you using? I only end up with 2 dict entries in 2.7.3, which is what I'd expect since the hash is created when the dict is created.

Question 10

In 2.7.3, the b_hash field of the PyBufferObject is initialized to -1, and gets set the first time tp_hash is called. But anyway, this was just an offhand comment. Don't waste your time running down source code, etc.

Question 11

Or you can get an orphaned item in the dict. Start with two buffers that have the same hash. Modify one underlying object such that the buffers are no longer equal (a memcmp). Use the buffers to create a dict with 2 items. If the buffers become equal again, then they both map to the same item, and the other item is inaccessible.

Question 12

If you're concerned about time, and the key that you are using is always the same object, you can use its id (location in memory) as the key in your dictionary:

b = some byte array
d[id(b)] = x

If you're concerned about memory, you can use a good cryptographic hash function over your byte array, and you'll probably never get a collision (git, for example, uses sha1, and there are some discussions out on the internet about how likely it is to get an inadvertent sha1 collision). If you're okay with that infinitesimal risk, you could:

b = some byte array
d[hashlib.sha1(b).hexdigest()] = x

That's going to be O(n) in the size of your byte array in time (each time you calculate the hash), but you'd be able to have a different byte array read in at a later time, but representing the same sequence of bytes, that would hash to the same dictionary key.

And @senderle is absolutely right; you don't want to use an object that is actually mutable, when using it by value (as opposed to an immutable function of it, like id()) as the key to a dictionary. The hash of an object used as dictionary key must not change; it violates an invariant of what the dictionary object expects out of a hash function.

Question 13

Note that if you use id values, you need to ensure that the object whose id you take lives longer than the id. i.e don't use this technique to map objects that might be garbage collected before the dictionary is. Otherwise any object allocated at the same memory address as the old object (not an unlikely scenario, since the deallocation will have opened up an convenient hole in memory for a subsequent allocation) will appear to already be in the dictionary.

Question 14

I really like the cryptographic hash suggestion for those situations (embedded systems, unusually big keys) where there is a genuine risk of running out of memory.

Question 15

I think this may be close to what you want. It's relatively quick and doesn't copy memory the size of the bytearray, however it is O(len(bytearray)) -- as I can't think of any way to avoid that and also always produce unique values.

def byte(ba):
 """ decode a bytearray as though it were a base 256 number """
 return reduce(lambda a,d: a*256 + d, ba, 0)
ba = bytearray('this is a bytearray')
d = {}
d[byte(ba)] = 42
ba[8] = 'X' # now 'this is X bytearray'
d[byte(ba)] = 17 # accesses a separate entry in dict
print d

senderle senderle 152k36 gold badges217 silver badges243 bronze badges · Accepted Answer · 2012-10-24 01:22:49Z

Any hash algorithm that actually does its job correctly will use O(len(b)) time. So the answer to "is there any faster way to do this" is no.

If your actual concern is memory usage, then you could, in principle, add a __hash__ method to a subclass of bytearray. But that's a pretty bad idea. Look what happens:

>>> class HashableBytearray(bytearray):
... def __hash__(self):
... return hash(str(self))
... 
>>> h = HashableBytearray('abcd')
>>> hash(h)
-2835746963027601024
>>> h[2] = 'z'
>>> hash(h)
-2835746963002600949

So the same object could hash to two different spots in the dictionary, which isn't supposed to happen. And it gets worse:

>>> d = dict()
>>> hb1 = HashableBytearray('abcd')
>>> hb2 = HashableBytearray('abcd')
>>> d[hb1] = 0
>>> d[hb2] = 1
>>> d
{bytearray(b'abcd'): 1}

Ok, so far, so good. The values are equal, so there should be only one item in the dictionary. Everything is working as expected. Now let's see what happens when we change hb1:

>>> hb1[2] = 'z'
>>> d[hb2] = 2
>>> d
{bytearray(b'abzd'): 1, bytearray(b'abcd'): 2}

See how even though hb2 didn't change at all, it created a new key-value pair in the dictionary this time?

Every time I passed a key to d, that key was equal to 'abcd'. But because the value of the first key changed after being added to the dictionary, Python couldn't tell that the value of the new key was the same as the old key had been when it was added. Now there are two key-value pairs in the dictionary, when there should be only one.

This is only one of many ways that using mutable values as keys can lead to unpredictable and very wrong behavior. Just convert the bytearray to an immutable type, or work with immutable types in the first place.

And for the inquisitive: sure, buffer caches the first hash, but that doesn't help at all. There are only two key values, so this should generate only two dict entries:

>>> a, b, c = bytearray('abcd'), bytearray('abcd'), bytearray('abzd')
>>> a_buf, b_buf, c_buf = buffer(a), buffer(b), buffer(c)
>>> d = {b_buf:1, c_buf:2}
>>> b[2] = 'z'
>>> d[a_buf] = 0

But it generates three:

>>> d
{<read-only buffer for 0x1004a2300, size -1, offset 0 at 0x100499cb0>: 1, 
 <read-only buffer for 0x1004a2420, size -1, offset 0 at 0x100499cf0>: 0, 
 <read-only buffer for 0x1004a22d0, size -1, offset 0 at 0x100499c70>: 2}

An old-style buffer in 2.x caches the first computed hash, even if the data is modified (e.g. using a bytearray).
@eryksun, that still leads to inconsistent results. See above.
Which version of Python 2.x are you using? I only end up with 2 dict entries in 2.7.3, which is what I'd expect since the hash is created when the dict is created.
In 2.7.3, the b_hash field of the PyBufferObject is initialized to -1, and gets set the first time tp_hash is called. But anyway, this was just an offhand comment. Don't waste your time running down source code, etc.
Or you can get an orphaned item in the dict. Start with two buffers that have the same hash. Modify one underlying object such that the buffers are no longer equal (a memcmp). Use the buffers to create a dict with 2 items. If the buffers become equal again, then they both map to the same item, and the other item is inaccessible.

CollectivesTM on Stack Overflow

Python quickly hash mutable object

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related