3

I have a 3D numpy array, arr, with shape m*n*k.

for every set of values along the m axis (e.g. arr[:, 0, 0]) I want to generate a single value to represent this set, so that I may end up with a 2D matrix, n*k. If a set of values along the m axis is repeated, then we should generate the same value each time.

I.e it is a hashing problem.

I created a solution to the problem using a dictionary, but it drastically reduces performance. For each set of values, I call this function:

 def getCellId(self, valueSet):
 # Turn the set of values (a numpy vector) to a tuple so it can be hashed
 key = tuple(valueSet)
 # Try and simply return an existing ID for this key
 try:
 return self.attributeDict[key]
 except KeyError:
 # If the key was new (and didnt exist), try and generate a new Id by adding one to the max of all current Id's. This will fail the very first time we do this (as there will be no Id's yet), so in that case, just assign the value '1' to the newId
 try:
 newId = max(self.attributeDict.values()) +1
 except ValueError:
 newId = 1
 self.attributeDict[key] = newId
 return newId

The array itself is typically of the size 30*256*256, so a single set of values will have 30 values. I have hundreds of these arrays to process at any one time. Currently, doing all processing that needs to be done up to calculating the hash takes 1.3s for a block of 100 arrays. Including the hashing bumps that up to 75s.

Is there a faster way to generate the single representative value?

asked Apr 9, 2015 at 9:57
9
  • 1
    Does the representative value have to look nice? ... or can it be "anything"? Commented Apr 9, 2015 at 10:15
  • @plonser: Any integer Commented Apr 9, 2015 at 11:10
  • Are all those arrays of the same shape 30 x 256 x 256? Commented Apr 9, 2015 at 12:17
  • @divakar, yes, always Commented Apr 9, 2015 at 12:17
  • I am wondering if there would be a solution based around numpy.cross ? This may give very good performance. Commented Apr 9, 2015 at 12:18

3 Answers 3

1

This could be one approach using basic numpy functions -

import numpy as np
# Random input for demo
arr = np.random.randint(0,3,[2,5,4])
# Get dimensions for later usage
m,n,k = arr.shape
# Reshape arr to a 2D array that has each slice arr[:, n, k] in each row
arr2d = np.transpose(arr,(1,2,0)).reshape([-1,m])
# Perform lexsort & get corresponding indices and sorted array 
sorted_idx = np.lexsort(arr2d.T)
sorted_arr2d = arr2d[sorted_idx,:]
# Differentiation along rows for sorted array
df1 = np.diff(sorted_arr2d,axis=0)
# Look for changes along df1 that represent new labels to be put there
df2 = np.append([False],np.any(df1!=0,1),0)
# Get unique labels
labels = df2.cumsum(0)
# Store those unique labels in a n x k shaped 2D array
pos_labels = np.zeros_like(labels)
pos_labels[sorted_idx] = labels
out = pos_labels.reshape([n,k])

Sample run -

In [216]: arr
Out[216]: 
array([[[2, 1, 2, 1],
 [1, 0, 2, 1],
 [2, 0, 1, 1],
 [0, 0, 1, 1],
 [1, 0, 0, 2]],
 [[2, 1, 2, 2],
 [0, 0, 2, 1],
 [2, 1, 0, 0],
 [1, 0, 1, 0],
 [0, 1, 1, 0]]])
In [217]: out
Out[217]: 
array([[6, 4, 6, 5],
 [1, 0, 6, 4],
 [6, 3, 1, 1],
 [3, 0, 4, 1],
 [1, 3, 3, 2]], dtype=int32)
answered Apr 9, 2015 at 11:18
1

Depending on how many new keys vs old keys need to be generated it's hard to say what will be optimal. But using your logic, the following should be fairly fast:

import collections
import hashlib
_key = 0
def _get_new_key():
 global _key
 _key += 1
 return _key
attributes = collections.defaultdict(_get_new_key)
def get_cell_id(series): 
 global attributes
 return attributes[hashlib.md5(series.tostring()).digest()]

Edit:

I now updated for looping all data series according to your question by using strides:

In [99]: import numpy as np
In [100]: A = np.random.random((30, 256, 256))
In [101]: A_strided = np.lib.stride_tricks.as_strided(A, (A.shape[1] * A.shape[2], A.shape[0]), (A.itemsize, A.itemsize * A.shape[1] * A.shape[2]))
In [102]: %timeit tuple(get_cell_id(S) for S in A_strided)
10 loops, best of 3: 169 ms per loop

The above does 256x256 lookups/assignments of 30 element arrays each. There is of course no guarantee that the md5 hash wont collide. If that should be an issue, you could of course change to other hashes in the same lib.

Edit 2:

Given that you seem to do the majority of costly operations on the first axis of your 3D array, I would suggest you reorganize your array:

In [254]: A2 = np.random.random((256, 256, 30))
In [255]: A2_strided = np.lib.stride_tricks.as_strided(A2, (A2.shape[0] * A2.shape[1], A2.shape[2]), (A2.itemsize * A2.shape[2], A2.itemsize))
In [256]: %timeit tuple(get_cell_id(S) for S in A2_strided)
10 loops, best of 3: 126 ms per loop

Not having to jump around long distances in memory does for about a 25% speed-up

Edit 3:

If there is no actual need for caching a hash to int look-up, but that you just need actual hashes and if the 3D array is of int8-type, then given the A2 and A2_strided organization, time can be reduced some more. Of this 15ms is the tuple-looping.

In [9]: from hashlib import md5
In [10]: %timeit tuple(md5(series.tostring()).digest() for series in A2_strided) 
10 loops, best of 3: 72.2 ms per loop
answered Apr 9, 2015 at 10:48
0

If it is just about hashing try this

import numpy as np
import numpy.random
# create random data
a = numpy.random.randint(10,size=(5,3,3))
# create some identical 0-axis data
a[:,0,0] = np.arange(5)
a[:,0,1] = np.arange(5)
# create matrix with the hash values
h = np.apply_along_axis(lambda x: hash(tuple(x)),0,a)
h[0,0]==h[0,1]
# Output: True

However, use it with caution and test first this code with your code. ... all I can say is that it works for this simple example.

In addition, it may be possible that two values might have the same hash value although they are different. This is an issue which can always happen using the hash function but they are very unlikely

Edit: In order to compare with the other solutions

timeit(np.apply_along_axis(lambda x: hash(tuple(x)),0,a))
# output: 1 loops, best of 3: 677 ms per loop
answered Apr 9, 2015 at 10:31
2
  • Try using my hashlib.md5 and tostring solution instead and you should gain some on that time. Commented Apr 9, 2015 at 13:13
  • 1
    @deinonychusaur : I completely agree that the python-builtin hash is slower ... but I don't want to steal ideas from other solutions ;) ... beside that I still wonder whether he wants 'nice' integers in the matrix or some 'ugly' hash integers Commented Apr 9, 2015 at 13:16

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.