Fast way to Hash Numpy objects for Caching

Question 1

Implementing a system where, when it comes to the heavy mathematical lifting, I want to do as little as possible.

I'm aware that there are issues with memoisation with numpy objects, and as such implemented a lazy-key cache to avoid the whole "Premature optimisation" argument.

def magic(numpyarg,intarg):
 key = str(numpyarg)+str(intarg)
 try:
 ret = self._cache[key]
 return ret
 except:
 pass
 ... here be dragons ...
 self._cache[key]=value
 return value

but since string conversion takes quite a while...

t=timeit.Timer("str(a)","import numpy;a=numpy.random.rand(10,10)")
t.timeit(number=100000)/100000 = 0.00132s/call

What do people suggest as being 'the better way' to do it?

Question 2

possible duplicate of How to hash a large object (dataset) in Python?

Question 3

Note that str(a) only shows a part of the array, as would be later pointed out in this comment: stackoverflow.com/questions/16589791/…

Question 4

Borrowed from this answer... so really I guess this is a duplicate:

>>> import hashlib
>>> import numpy
>>> a = numpy.random.rand(10, 100)
>>> b = a.view(numpy.uint8)
>>> hashlib.sha1(b).hexdigest()
'15c61fba5c969e5ed12cee619551881be908f11b'
>>> t=timeit.Timer("hashlib.sha1(a.view(numpy.uint8)).hexdigest()", 
 "import hashlib;import numpy;a=numpy.random.rand(10,10)") 
>>> t.timeit(number=10000)/10000
2.5790500640869139e-05

Question 5

Nice! For multidimensional arrays this gives a different hash (for the "same" array) depending on whether it's fortran or c contiguous. If that's an issue, calling np.ascontiguousarray should solve it.

Question 6

Not sure why a known slow hash function sha1 is chosen. SHA-1 is OK for minimising hash collision but poor at speed. For speed you'll need something like murmurhash or xxhash (the latter claims to be even faster).

Question 7

@CongMa, thanks for the extra info. There are lots of options! But as you'll notice, this is already two orders of magnitude faster. And speed is never the only concern. It's probably worth using a well-understood hash if the alternative is only a few millionths of a second faster.

Question 8

There is a package for this called joblib. Found from this question.

from joblib import Memory
location = './cachedir'
memory = Memory(location)
# Create caching version of magic
magic_cached = memory.cache(magic)
result = magic_cached(...)
# Or (for one-time use)
result = memory.eval(magic, ...)

Question 9

It would be better to have a quote from those links copied over in your answer, in case these websites go offline.

Question 10

For small numpy arrays also this might be suitable:

tuple(map(float, a))

if a is the numpy array.

Question 11

Oh yes, tuple is hashable in comparison with list!

senderle senderle 152k36 gold badges217 silver badges243 bronze badges · Accepted Answer · 2011-03-22 04:21:59Z

30

Borrowed from this answer... so really I guess this is a duplicate:

>>> import hashlib
>>> import numpy
>>> a = numpy.random.rand(10, 100)
>>> b = a.view(numpy.uint8)
>>> hashlib.sha1(b).hexdigest()
'15c61fba5c969e5ed12cee619551881be908f11b'
>>> t=timeit.Timer("hashlib.sha1(a.view(numpy.uint8)).hexdigest()", 
 "import hashlib;import numpy;a=numpy.random.rand(10,10)") 
>>> t.timeit(number=10000)/10000
2.5790500640869139e-05

Share

Improve this answer

edited May 23, 2017 at 11:46

Community's user avatar

Community Bot

11 silver badge

answered Mar 22, 2011 at 4:21

senderle's user avatar

senderle senderle

152k36 gold badges217 silver badges243 bronze badges

3

6

Nice! For multidimensional arrays this gives a different hash (for the "same" array) depending on whether it's fortran or c contiguous. If that's an issue, calling np.ascontiguousarray should solve it.

jorgeca
– jorgeca

2014年01月27日 16:00:24 +00:00
Commented Jan 27, 2014 at 16:00
Not sure why a known slow hash function sha1 is chosen. SHA-1 is OK for minimising hash collision but poor at speed. For speed you'll need something like murmurhash or xxhash (the latter claims to be even faster).

Cong Ma
– Cong Ma

2015年08月05日 10:06:26 +00:00
Commented Aug 5, 2015 at 10:06
@CongMa, thanks for the extra info. There are lots of options! But as you'll notice, this is already two orders of magnitude faster. And speed is never the only concern. It's probably worth using a well-understood hash if the alternative is only a few millionths of a second faster.

senderle
– senderle

2015年08月08日 11:24:01 +00:00
Commented Aug 8, 2015 at 11:24

Add a comment |

CollectivesTM on Stack Overflow

Fast way to Hash Numpy objects for Caching

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

CollectivesTM on Stack Overflow

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related