Generate unique values based on rows in a numpy array

Question 1

I have a 3D numpy array, arr, with shape m*n*k.

for every set of values along the m axis (e.g. arr[:, 0, 0]) I want to generate a single value to represent this set, so that I may end up with a 2D matrix, n*k. If a set of values along the m axis is repeated, then we should generate the same value each time.

I.e it is a hashing problem.

I created a solution to the problem using a dictionary, but it drastically reduces performance. For each set of values, I call this function:

 def getCellId(self, valueSet):
 # Turn the set of values (a numpy vector) to a tuple so it can be hashed
 key = tuple(valueSet)
 # Try and simply return an existing ID for this key
 try:
 return self.attributeDict[key]
 except KeyError:
 # If the key was new (and didnt exist), try and generate a new Id by adding one to the max of all current Id's. This will fail the very first time we do this (as there will be no Id's yet), so in that case, just assign the value '1' to the newId
 try:
 newId = max(self.attributeDict.values()) +1
 except ValueError:
 newId = 1
 self.attributeDict[key] = newId
 return newId

The array itself is typically of the size 30*256*256, so a single set of values will have 30 values. I have hundreds of these arrays to process at any one time. Currently, doing all processing that needs to be done up to calculating the hash takes 1.3s for a block of 100 arrays. Including the hashing bumps that up to 75s.

Is there a faster way to generate the single representative value?

Question 2

Does the representative value have to look nice? ... or can it be "anything"?

Question 3

@plonser: Any integer

Question 4

Are all those arrays of the same shape 30 x 256 x 256?

Question 5

@divakar, yes, always

Question 6

I am wondering if there would be a solution based around numpy.cross ? This may give very good performance.

Question 7

This could be one approach using basic numpy functions -

import numpy as np
# Random input for demo
arr = np.random.randint(0,3,[2,5,4])
# Get dimensions for later usage
m,n,k = arr.shape
# Reshape arr to a 2D array that has each slice arr[:, n, k] in each row
arr2d = np.transpose(arr,(1,2,0)).reshape([-1,m])
# Perform lexsort & get corresponding indices and sorted array 
sorted_idx = np.lexsort(arr2d.T)
sorted_arr2d = arr2d[sorted_idx,:]
# Differentiation along rows for sorted array
df1 = np.diff(sorted_arr2d,axis=0)
# Look for changes along df1 that represent new labels to be put there
df2 = np.append([False],np.any(df1!=0,1),0)
# Get unique labels
labels = df2.cumsum(0)
# Store those unique labels in a n x k shaped 2D array
pos_labels = np.zeros_like(labels)
pos_labels[sorted_idx] = labels
out = pos_labels.reshape([n,k])

Sample run -

In [216]: arr
Out[216]: 
array([[[2, 1, 2, 1],
 [1, 0, 2, 1],
 [2, 0, 1, 1],
 [0, 0, 1, 1],
 [1, 0, 0, 2]],
 [[2, 1, 2, 2],
 [0, 0, 2, 1],
 [2, 1, 0, 0],
 [1, 0, 1, 0],
 [0, 1, 1, 0]]])
In [217]: out
Out[217]: 
array([[6, 4, 6, 5],
 [1, 0, 6, 4],
 [6, 3, 1, 1],
 [3, 0, 4, 1],
 [1, 3, 3, 2]], dtype=int32)

Question 8

Depending on how many new keys vs old keys need to be generated it's hard to say what will be optimal. But using your logic, the following should be fairly fast:

import collections
import hashlib
_key = 0
def _get_new_key():
 global _key
 _key += 1
 return _key
attributes = collections.defaultdict(_get_new_key)
def get_cell_id(series): 
 global attributes
 return attributes[hashlib.md5(series.tostring()).digest()]

Edit:

I now updated for looping all data series according to your question by using strides:

In [99]: import numpy as np
In [100]: A = np.random.random((30, 256, 256))
In [101]: A_strided = np.lib.stride_tricks.as_strided(A, (A.shape[1] * A.shape[2], A.shape[0]), (A.itemsize, A.itemsize * A.shape[1] * A.shape[2]))
In [102]: %timeit tuple(get_cell_id(S) for S in A_strided)
10 loops, best of 3: 169 ms per loop

The above does 256x256 lookups/assignments of 30 element arrays each. There is of course no guarantee that the md5 hash wont collide. If that should be an issue, you could of course change to other hashes in the same lib.

Edit 2:

Given that you seem to do the majority of costly operations on the first axis of your 3D array, I would suggest you reorganize your array:

In [254]: A2 = np.random.random((256, 256, 30))
In [255]: A2_strided = np.lib.stride_tricks.as_strided(A2, (A2.shape[0] * A2.shape[1], A2.shape[2]), (A2.itemsize * A2.shape[2], A2.itemsize))
In [256]: %timeit tuple(get_cell_id(S) for S in A2_strided)
10 loops, best of 3: 126 ms per loop

Not having to jump around long distances in memory does for about a 25% speed-up

Edit 3:

If there is no actual need for caching a hash to int look-up, but that you just need actual hashes and if the 3D array is of int8-type, then given the A2 and A2_strided organization, time can be reduced some more. Of this 15ms is the tuple-looping.

In [9]: from hashlib import md5
In [10]: %timeit tuple(md5(series.tostring()).digest() for series in A2_strided) 
10 loops, best of 3: 72.2 ms per loop

Question 9

If it is just about hashing try this

import numpy as np
import numpy.random
# create random data
a = numpy.random.randint(10,size=(5,3,3))
# create some identical 0-axis data
a[:,0,0] = np.arange(5)
a[:,0,1] = np.arange(5)
# create matrix with the hash values
h = np.apply_along_axis(lambda x: hash(tuple(x)),0,a)
h[0,0]==h[0,1]
# Output: True

However, use it with caution and test first this code with your code. ... all I can say is that it works for this simple example.

In addition, it may be possible that two values might have the same hash value although they are different. This is an issue which can always happen using the hash function but they are very unlikely

Edit: In order to compare with the other solutions

timeit(np.apply_along_axis(lambda x: hash(tuple(x)),0,a))
# output: 1 loops, best of 3: 677 ms per loop

Question 10

Try using my hashlib.md5 and tostring solution instead and you should gain some on that time.

Question 11

@deinonychusaur : I completely agree that the python-builtin hash is slower ... but I don't want to steal ideas from other solutions ;) ... beside that I still wonder whether he wants 'nice' integers in the matrix or some 'ugly' hash integers

Divakar DivakarDivakar 222k19 gold badges272 silver badges373 bronze badges · Answer 1 · 2015-04-09 11:18:58Z

This could be one approach using basic numpy functions -

import numpy as np
# Random input for demo
arr = np.random.randint(0,3,[2,5,4])
# Get dimensions for later usage
m,n,k = arr.shape
# Reshape arr to a 2D array that has each slice arr[:, n, k] in each row
arr2d = np.transpose(arr,(1,2,0)).reshape([-1,m])
# Perform lexsort & get corresponding indices and sorted array 
sorted_idx = np.lexsort(arr2d.T)
sorted_arr2d = arr2d[sorted_idx,:]
# Differentiation along rows for sorted array
df1 = np.diff(sorted_arr2d,axis=0)
# Look for changes along df1 that represent new labels to be put there
df2 = np.append([False],np.any(df1!=0,1),0)
# Get unique labels
labels = df2.cumsum(0)
# Store those unique labels in a n x k shaped 2D array
pos_labels = np.zeros_like(labels)
pos_labels[sorted_idx] = labels
out = pos_labels.reshape([n,k])

Sample run -

In [216]: arr
Out[216]: 
array([[[2, 1, 2, 1],
 [1, 0, 2, 1],
 [2, 0, 1, 1],
 [0, 0, 1, 1],
 [1, 0, 0, 2]],
 [[2, 1, 2, 2],
 [0, 0, 2, 1],
 [2, 1, 0, 0],
 [1, 0, 1, 0],
 [0, 1, 1, 0]]])
In [217]: out
Out[217]: 
array([[6, 4, 6, 5],
 [1, 0, 6, 4],
 [6, 3, 1, 1],
 [3, 0, 4, 1],
 [1, 3, 3, 2]], dtype=int32)

score 1 · Answer 2 · 2015-04-09 10:48:22Z

Depending on how many new keys vs old keys need to be generated it's hard to say what will be optimal. But using your logic, the following should be fairly fast:

import collections
import hashlib
_key = 0
def _get_new_key():
 global _key
 _key += 1
 return _key
attributes = collections.defaultdict(_get_new_key)
def get_cell_id(series): 
 global attributes
 return attributes[hashlib.md5(series.tostring()).digest()]

Edit:

I now updated for looping all data series according to your question by using strides:

In [99]: import numpy as np
In [100]: A = np.random.random((30, 256, 256))
In [101]: A_strided = np.lib.stride_tricks.as_strided(A, (A.shape[1] * A.shape[2], A.shape[0]), (A.itemsize, A.itemsize * A.shape[1] * A.shape[2]))
In [102]: %timeit tuple(get_cell_id(S) for S in A_strided)
10 loops, best of 3: 169 ms per loop

The above does 256x256 lookups/assignments of 30 element arrays each. There is of course no guarantee that the md5 hash wont collide. If that should be an issue, you could of course change to other hashes in the same lib.

Edit 2:

Given that you seem to do the majority of costly operations on the first axis of your 3D array, I would suggest you reorganize your array:

In [254]: A2 = np.random.random((256, 256, 30))
In [255]: A2_strided = np.lib.stride_tricks.as_strided(A2, (A2.shape[0] * A2.shape[1], A2.shape[2]), (A2.itemsize * A2.shape[2], A2.itemsize))
In [256]: %timeit tuple(get_cell_id(S) for S in A2_strided)
10 loops, best of 3: 126 ms per loop

Not having to jump around long distances in memory does for about a 25% speed-up

Edit 3:

If there is no actual need for caching a hash to int look-up, but that you just need actual hashes and if the 3D array is of int8-type, then given the A2 and A2_strided organization, time can be reduced some more. Of this 15ms is the tuple-looping.

In [9]: from hashlib import md5
In [10]: %timeit tuple(md5(series.tostring()).digest() for series in A2_strided) 
10 loops, best of 3: 72.2 ms per loop

plonser plonserplonser 3,3632 gold badges21 silver badges23 bronze badges · Answer 3 · 2015-04-09 10:31:13Z

If it is just about hashing try this

import numpy as np
import numpy.random
# create random data
a = numpy.random.randint(10,size=(5,3,3))
# create some identical 0-axis data
a[:,0,0] = np.arange(5)
a[:,0,1] = np.arange(5)
# create matrix with the hash values
h = np.apply_along_axis(lambda x: hash(tuple(x)),0,a)
h[0,0]==h[0,1]
# Output: True

However, use it with caution and test first this code with your code. ... all I can say is that it works for this simple example.

In addition, it may be possible that two values might have the same hash value although they are different. This is an issue which can always happen using the hash function but they are very unlikely

Edit: In order to compare with the other solutions

timeit(np.apply_along_axis(lambda x: hash(tuple(x)),0,a))
# output: 1 loops, best of 3: 677 ms per loop

Try using my hashlib.md5 and tostring solution instead and you should gain some on that time.
@deinonychusaur : I completely agree that the python-builtin hash is slower ... but I don't want to steal ideas from other solutions ;) ... beside that I still wonder whether he wants 'nice' integers in the matrix or some 'ugly' hash integers

CollectivesTM on Stack Overflow

Generate unique values based on rows in a numpy array

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related