I need to be able to store a numpy
array
in a dict
for caching purposes. Hash speed is important.
The array
represents indicies, so while the actual identity of the object is not important, the value is. Mutabliity is not a concern, as I'm only interested in the current value.
What should I hash in order to store it in a dict
?
My current approach is to use str(arr.data)
, which is faster than md5
in my testing.
I've incorporated some examples from the answers to get an idea of relative times:
In [121]: %timeit hash(str(y))
10000 loops, best of 3: 68.7 us per loop
In [122]: %timeit hash(y.tostring())
1000000 loops, best of 3: 383 ns per loop
In [123]: %timeit hash(str(y.data))
1000000 loops, best of 3: 543 ns per loop
In [124]: %timeit y.flags.writeable = False ; hash(y.data)
1000000 loops, best of 3: 1.15 us per loop
In [125]: %timeit hash((b*y).sum())
100000 loops, best of 3: 8.12 us per loop
It would appear that for this particular use case (small arrays of indicies), arr.tostring
offers the best performance.
While hashing the read-only buffer is fast on its own, the overhead of setting the writeable flag actually makes it slower.
6 Answers 6
You can simply hash the underlying buffer, if you make it read-only:
>>> a = random.randint(10, 100, 100000)
>>> a.flags.writeable = False
>>> %timeit hash(a.data)
100 loops, best of 3: 2.01 ms per loop
>>> %timeit hash(a.tostring())
100 loops, best of 3: 2.28 ms per loop
For very large arrays, hash(str(a))
is a lot faster, but then it only takes a small part of the array into account.
>>> %timeit hash(str(a))
10000 loops, best of 3: 55.5 us per loop
>>> str(a)
'[63 30 33 ..., 96 25 60]'
-
62In Python 3.4 I found I had to use
hash(a.data.tobytes())
ariddell– ariddell2014年10月16日 02:41:33 +00:00Commented Oct 16, 2014 at 2:41 -
4Sorry for coming to this kind late, but using
hash(a.data.tobytes())
as @ariddell suggested means I don't have to seta.flags.writeable = false
. Any reason for this and any potential problems in doing so?SCB– SCB2015年09月30日 09:11:01 +00:00Commented Sep 30, 2015 at 9:11 -
51Attention: The hash calculated with this method changes between processes. i.e.
hash(np.array([0,1]).data.tobytes())
has a different value in different python process. It is not a deterministic value calculated from the array content. To get a deterministic behavior you need to set thePYTHONHASHSEED
environment variable. More info: stackoverflow.com/questions/27522626/…Harald Thomson– Harald Thomson2018年04月16日 14:03:09 +00:00Commented Apr 16, 2018 at 14:03 -
6Using
hash(str(a))
is dangerous and leads to wrong results for long arrays, as only the shortened string representation of the array, namely'[63 30 33 ..., 96 25 60]
, is hashed (this is probably also the reason why this method is faster?). Particularly, all arraysa
having the same starting and ending values end up with the same hash. Could you please add a short note on that in our answer?Tobias Windisch– Tobias Windisch2021年01月09日 19:40:40 +00:00Commented Jan 9, 2021 at 19:40 -
5If you use hashlib you 1. will get a deterministic value and 2. don't have to set the array to read-only and 3. the operation is not restricted to any format:
hashlib.sha256( a.data )
.Jann Poppinga– Jann Poppinga2021年11月01日 07:22:16 +00:00Commented Nov 1, 2021 at 7:22
You can try xxhash
via its Python binding. For large arrays this is much faster than hash(x.tostring())
.
Example IPython session:
>>> import xxhash
>>> import numpy
>>> x = numpy.random.rand(1024 * 1024 * 16)
>>> h = xxhash.xxh64()
>>> %timeit hash(x.tostring())
1 loops, best of 3: 208 ms per loop
>>> %timeit h.update(x); h.intdigest(); h.reset()
100 loops, best of 3: 10.2 ms per loop
And by the way, on various blogs and answers posted to Stack Overflow, you'll see people using sha1
or md5
as hash functions. For performance reasons this is usually not acceptable, as those "secure" hash functions are rather slow. They're useful only if hash collision is one of the top concerns.
Nevertheless, hash collisions happen all the time. And if all you need is implementing __hash__
for data-array objects so that they can be used as keys in Python dictionaries or sets, I think it's better to concentrate on the speed of __hash__
itself and let Python handle the hash collision[1].
[1] You may need to override __eq__
too, to help Python manage hash collision. You would want __eq__
to return a boolean, rather than an array of booleans as is done by numpy
.
-
I think non-cryptographic hashes also try to prevent collisions for 'normal' data, right? The crypto part is that a malicious attacker cannot be more likely to find a collision or know learn about the hashes object. So like this answer says, definitely don't use sha1 or md5 when performance is an issue and security is not.Mark– Mark2016年11月08日 14:01:31 +00:00Commented Nov 8, 2016 at 14:01
-
The fourth line should be
h = xxhash.xxh64()
Micah Smith– Micah Smith2017年04月12日 14:30:49 +00:00Commented Apr 12, 2017 at 14:30 -
1@MicahSmith Thanks. Fixed.Cong Ma– Cong Ma2017年04月13日 03:34:22 +00:00Commented Apr 13, 2017 at 3:34
-
xxhash
from python has a limit of something like 2**32 bytes, so you may need to chunk up your array and combine the hashes using something like: stackoverflow.com/questions/113511/…nimish– nimish2018年02月03日 18:26:31 +00:00Commented Feb 3, 2018 at 18:26 -
2Note that
xxhash
is deterministic and yields the same result for different python processes. This is not the case (by default) forhash
, which uses a random seed for every new process. See Harald Thomson's comment or this SO thread.normanius– normanius2020年03月03日 13:35:08 +00:00Commented Mar 3, 2020 at 13:35
If your np.array()
is small and in a tight loop, then one option is to skip hash()
completely and just use np.array().data.tobytes()
directly as your dict key:
grid = np.array([[True, False, True],[False, False, True]])
hash = grid.data.tobytes()
cache = cache or {}
if hash not in cache:
cache[hash] = function(grid)
return cache[hash]
-
This works great for me. Remember, a dictionary is going to hash your keys anyway, so as long as the arrays are small and memory isn't too big a concern, this is a good choice.gregamis– gregamis2023年03月06日 17:54:54 +00:00Commented Mar 6, 2023 at 17:54
-
1If you have large arrays instead use
hash = hashlib.sha1(grid.tobytes()).hexdigest()
The digest is only 40 characters. However, be aware thattobytes
will ignore the SHAPE of the array. So you if you have multiple arrays with same content but different shapes, they will all have the SAME hash value.John Henckel– John Henckel2024年07月31日 16:14:22 +00:00Commented Jul 31, 2024 at 16:14
Coming late to the party, but for large arrays, I think a decent way to do it is to randomly subsample the matrix and hash that sample:
def subsample_hash(a):
rng = np.random.RandomState(89)
inds = rng.randint(low=0, high=a.size, size=1000)
b = a.flat[inds]
b.flags.writeable = False
return hash(b.data)
I think this is better than doing hash(str(a))
, because the latter could confuse arrays that have unique data in the middle but zeros around the edges.
-
4People doing one-hot coding will be sad.user48956– user489562019年10月23日 02:57:00 +00:00Commented Oct 23, 2019 at 2:57
-
@user48956 : Yes, when you have sparse data (including one-hot coded data), any approach that works from the dense version of the matrix will be sub-optimal (hashing the whole thing would be slow, and hashing only part could miss the non-zero data values). Better to work from a sparse representation of the data (i.e. indices and values of the non-zero elements) and hash both the indices and values.hunse– hunse2019年10月24日 17:23:22 +00:00Commented Oct 24, 2019 at 17:23
What kind of data do you have?
- array-size
- do you have an index several times in the array
If your array only consists of permutation of indices you can use a base-convertion
(1, 0, 2) -> 1 * 3**0 + 0 * 3**1 + 2 * 3**2 = 10(base3)
and use '10' as hash_key via
import numpy as num
base_size = 3
base = base_size ** num.arange(base_size)
max_base = (base * num.arange(base_size)).sum()
hashed_array = (base * array).sum()
Now you can use an array (shape=(base_size, )) instead of a dict in order to access the values.
-
1Why the list comprehension? This can be done much faster in NumPy as
base_size ** np.arange(base_size)
.Fred Foo– Fred Foo2013年05月16日 15:46:30 +00:00Commented May 16, 2013 at 15:46 -
Interesting approach, although slower for small arrays. I'll keep this in mind if I need to play with anything large :)sapi– sapi2013年05月16日 21:13:56 +00:00Commented May 16, 2013 at 21:13
While the other answers are good, intelligent analysis is almost always the way to go with this sort of challenge. When we know what it is that makes an array uniquely interesting, then it can often be encoded in a meaningful way. There are major advantages-
- We can determine when a clash is a clash - for example, it might be that 1e-20 is close enough to 2e-20 that we want them to clash.
- We can choose an encoding which compactly represents the array, and is thereby not costly.
- We aren't using a sledgehammer to crack a nut.
We must have an idea about what is being captured - and so domain expertise is required, whereas general purpose hashing obviates such a need.
For example, we might be attempting to manage a large array of polygons (or polyhedra) in space. If we know that each polygon must have a distinct centroid, we can 'hash' merely by using that centroid.
import numpy as np
np.random.seed(42)
count = 100000
polys = np.random.rand(count, 4, 2) # 100k 2d polygons.
keys = np.mean(polys, axis=1).round(4) # four decimal places...
hashes = set()
unique = []
for k, p in zip(keys, polys):
if tuple(k) not in hashes:
hashes.add(tuple(k))
unique.append(p)
print(f'Unique values: {len(unique)}, {count-len(unique)} culled')
# Unique values: 99808, 192 culled
arr.tostring()
does the same and is more aesthetically pleasing. If you have really big arrays you could try stringifying only a small part of the array.tostring
also appears to be orders of magnitude faster for small arrays (although 4× slower for an array of 10000 elements).str
only formats the head and tail of the array.str(arr.data)
simply wrong? I used this on different arrays and got the same strings back...!?