I have a 3D numpy array, arr
, with shape m*n*k
.
for every set of values along the m
axis (e.g. arr[:, 0, 0]
) I want to generate a single value to represent this set, so that I may end up with a 2D matrix, n*k
.
If a set of values along the m
axis is repeated, then we should generate the same value each time.
I.e it is a hashing problem.
I created a solution to the problem using a dictionary, but it drastically reduces performance. For each set of values, I call this function:
def getCellId(self, valueSet):
# Turn the set of values (a numpy vector) to a tuple so it can be hashed
key = tuple(valueSet)
# Try and simply return an existing ID for this key
try:
return self.attributeDict[key]
except KeyError:
# If the key was new (and didnt exist), try and generate a new Id by adding one to the max of all current Id's. This will fail the very first time we do this (as there will be no Id's yet), so in that case, just assign the value '1' to the newId
try:
newId = max(self.attributeDict.values()) +1
except ValueError:
newId = 1
self.attributeDict[key] = newId
return newId
The array itself is typically of the size 30*256*256, so a single set of values will have 30 values. I have hundreds of these arrays to process at any one time. Currently, doing all processing that needs to be done up to calculating the hash takes 1.3s for a block of 100 arrays. Including the hashing bumps that up to 75s.
Is there a faster way to generate the single representative value?
3 Answers 3
This could be one approach using basic numpy
functions -
import numpy as np
# Random input for demo
arr = np.random.randint(0,3,[2,5,4])
# Get dimensions for later usage
m,n,k = arr.shape
# Reshape arr to a 2D array that has each slice arr[:, n, k] in each row
arr2d = np.transpose(arr,(1,2,0)).reshape([-1,m])
# Perform lexsort & get corresponding indices and sorted array
sorted_idx = np.lexsort(arr2d.T)
sorted_arr2d = arr2d[sorted_idx,:]
# Differentiation along rows for sorted array
df1 = np.diff(sorted_arr2d,axis=0)
# Look for changes along df1 that represent new labels to be put there
df2 = np.append([False],np.any(df1!=0,1),0)
# Get unique labels
labels = df2.cumsum(0)
# Store those unique labels in a n x k shaped 2D array
pos_labels = np.zeros_like(labels)
pos_labels[sorted_idx] = labels
out = pos_labels.reshape([n,k])
Sample run -
In [216]: arr
Out[216]:
array([[[2, 1, 2, 1],
[1, 0, 2, 1],
[2, 0, 1, 1],
[0, 0, 1, 1],
[1, 0, 0, 2]],
[[2, 1, 2, 2],
[0, 0, 2, 1],
[2, 1, 0, 0],
[1, 0, 1, 0],
[0, 1, 1, 0]]])
In [217]: out
Out[217]:
array([[6, 4, 6, 5],
[1, 0, 6, 4],
[6, 3, 1, 1],
[3, 0, 4, 1],
[1, 3, 3, 2]], dtype=int32)
Depending on how many new keys vs old keys need to be generated it's hard to say what will be optimal. But using your logic, the following should be fairly fast:
import collections
import hashlib
_key = 0
def _get_new_key():
global _key
_key += 1
return _key
attributes = collections.defaultdict(_get_new_key)
def get_cell_id(series):
global attributes
return attributes[hashlib.md5(series.tostring()).digest()]
Edit:
I now updated for looping all data series according to your question by using strides:
In [99]: import numpy as np
In [100]: A = np.random.random((30, 256, 256))
In [101]: A_strided = np.lib.stride_tricks.as_strided(A, (A.shape[1] * A.shape[2], A.shape[0]), (A.itemsize, A.itemsize * A.shape[1] * A.shape[2]))
In [102]: %timeit tuple(get_cell_id(S) for S in A_strided)
10 loops, best of 3: 169 ms per loop
The above does 256x256 lookups/assignments of 30 element arrays each. There is of course no guarantee that the md5 hash wont collide. If that should be an issue, you could of course change to other hashes in the same lib.
Edit 2:
Given that you seem to do the majority of costly operations on the first axis of your 3D array, I would suggest you reorganize your array:
In [254]: A2 = np.random.random((256, 256, 30))
In [255]: A2_strided = np.lib.stride_tricks.as_strided(A2, (A2.shape[0] * A2.shape[1], A2.shape[2]), (A2.itemsize * A2.shape[2], A2.itemsize))
In [256]: %timeit tuple(get_cell_id(S) for S in A2_strided)
10 loops, best of 3: 126 ms per loop
Not having to jump around long distances in memory does for about a 25% speed-up
Edit 3:
If there is no actual need for caching a hash to int
look-up, but that you just need actual hashes and if the 3D array is of int8
-type, then given the A2
and A2_strided
organization, time can be reduced some more. Of this 15ms is the tuple-looping.
In [9]: from hashlib import md5
In [10]: %timeit tuple(md5(series.tostring()).digest() for series in A2_strided)
10 loops, best of 3: 72.2 ms per loop
If it is just about hashing try this
import numpy as np
import numpy.random
# create random data
a = numpy.random.randint(10,size=(5,3,3))
# create some identical 0-axis data
a[:,0,0] = np.arange(5)
a[:,0,1] = np.arange(5)
# create matrix with the hash values
h = np.apply_along_axis(lambda x: hash(tuple(x)),0,a)
h[0,0]==h[0,1]
# Output: True
However, use it with caution and test first this code with your code. ... all I can say is that it works for this simple example.
In addition, it may be possible that two values might have the same hash value although they are different. This is an issue which can always happen using the hash function but they are very unlikely
Edit: In order to compare with the other solutions
timeit(np.apply_along_axis(lambda x: hash(tuple(x)),0,a))
# output: 1 loops, best of 3: 677 ms per loop
-
Try using my
hashlib.md5
andtostring
solution instead and you should gain some on that time.deinonychusaur– deinonychusaur2015年04月09日 13:13:06 +00:00Commented Apr 9, 2015 at 13:13 -
1@deinonychusaur : I completely agree that the python-builtin
hash
is slower ... but I don't want to steal ideas from other solutions ;) ... beside that I still wonder whether he wants 'nice' integers in the matrix or some 'ugly' hashintegers
plonser– plonser2015年04月09日 13:16:10 +00:00Commented Apr 9, 2015 at 13:16
Explore related questions
See similar questions with these tags.
30 x 256 x 256
?