Finding permutations efficiently

Question 1

I have the following:

A set of K time-series in a numpy array with dimensions T x K.
A set of P permuted approximation of them in a numpy array with dimensions P times T.

I need a dictionary that tells me which is the most probable permutation. For that I've created the following function, but I would like to know if can be done in a more efficient way and with less code to do this.

def find_permutation(true, permuted):
 """
 Finds the most probable permutation of true time series in between permuted time series
 :param true: true ordered time series of shape T times X
 :param permuted: Permuted time series of shape P times T. P > K
 :return: A dict containing {true idx: permuted idx}
 """
 N = true.shape[1]
 max_comps = permuted.shape[0]
 permutation_dict = {}
 used_comps = []
 corr_matrix = np.zeros((N, max_comps))
 # Find correlations
 for i in range(N):
 for j in range(max_comps):
 corr_matrix[i, j] = np.corrcoef(true[:, i], permuted[j, :])[0, 1]
 # Find best order
 per_matrix = np.argsort(-np.abs(corr_matrix), axis=1)
 for i in range(N):
 for j in per_matrix[i, :]:
 if j in used_comps:
 continue
 else:
 permutation_dict[i] = j
 used_comps.append(j)
 break
 return permutation_dict
if __name__ == "__main__":
 import numpy as np
 a = np.array([1, 2, 3, 4.])
 b = np.array([4, 8, 9, 12.])
 c = np.array([9, 5, 8, 9.])
 true = np.vstack([a, b, c]).transpose()
 permuted = np.vstack([b*0.2, c*0.5, a*0.7])
 print(find_permutation(true, permuted))
 # {0: 2, 1: 0, 2: 1}

Here a Cython version

# C imports first
cimport numpy as np
# other imports
import numpy as np
import cython
# Type declarations
DTYPE = np.float
ctypedef np.float_t DTYPE_t
@cython.boundscheck(False) # Deactivate bounds checking
@cython.wraparound(False) # Deactivate negative indexing.
def find_permutation(np.ndarray[DTYPE_t, ndim=2] true, np.ndarray[DTYPE_t, ndim=2] permuted):
 """
 Finds the most probable permutation of true time series in between permuted time series
 :param true: true ordered time series of shape T times X
 :param permuted: Permuted time series of shape P times T. P > K
 :return: A dict containing {true idx: permuted idx}
 """
 cdef unsigned int N = true.shape[1]
 cdef unsigned int max_comps = permuted.shape[0]
 cdef dict permutation_dict = {}
 cdef list used_comps = []
 cdef np.ndarray[DTYPE_t, ndim=2] corr_matrix
 corr_matrix = np.zeros((N, max_comps))
 cdef Py_ssize_t i
 cdef Py_ssize_t j
 # Find correlations
 for i in range(N):
 for j in range(max_comps):
 corr_matrix[i, j] = np.corrcoef(true[:, i], permuted[j, :])[0, 1]
 # Find best order
 cdef np.ndarray[long, ndim=2] per_matrix
 per_matrix = np.argsort(-np.abs(corr_matrix), axis=1)
 for i in range(N):
 for j in per_matrix[i, :]:
 if j in used_comps:
 continue
 else:
 permutation_dict[i] = j
 used_comps.append(j)
 break
 return permutation_dict

Any suggestion is more than welcome.

Question 2

do you have some (dummy) sample data with which you call this?

Question 3

Added in the python code part :)

Question 4

The cython code as is will not work for me. There are some imports and typedefs missing. Can you include thos?

Question 5

In your docstring you mention P >K, but in your example P == K and is every time series represented at least once? or better, is there a 1 in each row of corr_matrix

Question 6

Added type definitions, and imports. I hope I didn't miss anything. The objective of the function is to find among P times series those that are more correlated with K. That is way only K are returned. In the ideal case, would be at least one 1 for each row, but can happen that two series in permuted are more similar to only one in true, that's prevented by tracking them using used_comps.

Question 7

Pythonic

I rewrote a few of the loops to prevent looping over the index. I also changed used_comps to a set which has O(1) containment checks. For smaller arrays this will not matter a lot, for larger ones this can make a difference.

I also moved the permutation_dict and used_comps definitions closer to the place they are used.

def find_permutation2(true, permuted):
 """
 Finds the most probable permutation of true time series in between permuted time series
 :param true: true ordered time series of shape T times X
 :param permuted: Permuted time series of shape P times T. P > K
 :return: A dict containing {true idx: permuted idx}
 """
 corr_matrix = np.zeros((true.shape[1], permuted.shape[0]))
 # Find correlations
 for i, column in enumerate(true.T):
 for j, row in enumerate(permuted):
 corr_matrix[i, j] = np.corrcoef(column, row)[0, 1]
 # Find best order
 per_matrix = np.argsort(-np.abs(corr_matrix), axis=1)
 permutation_dict = {}
 used_comps = set()
 for i, row in enumerate(per_matrix):
 for j in row:
 if j in used_comps:
 continue
 permutation_dict[i] = j
 used_comps.add(j)
 break
 return permutation_dict

numba

You can use numba, which compiles the python to llvm. I'm no expert, but I got it to work with these settings.

m_jith = numba.jit(find_permutation2, looplift=False, forceobj=True)
m_jith(true, permuted)

`np.setdiff1d`

You can use np.setdiff1d. This will be slower for smaller arrays, but might be faster for larger arrays.

def find_permutation3(true, permuted):
 """
 Finds the most probable permutation of true time series in between permuted time series
 :param true: true ordered time series of shape T times X
 :param permuted: Permuted time series of shape P times T. P > K
 :return: A dict containing {true idx: permuted idx}
 """
 corr_matrix = np.zeros((true.shape[1], permuted.shape[0]))
 # Find correlations
 for i, column in enumerate(true.T):
 for j, row in enumerate(permuted):
 corr_matrix[i, j] = np.corrcoef(column, row)[0, 1]
 # Find best order
 per_matrix = np.argsort(-np.abs(corr_matrix))
 permutation_dict = {}
 used_comps = set()
 for i, row in enumerate(per_matrix):
 j = np.setdiff1d(row, used_comps, assume_unique=True)[0]
 permutation_dict[i] = j
 used_comps.add(j)
 return permutation_dict

timings

All these things have very little effect on the speed of the algorithm

%timeit find_permutation(true, permuted)

950 μs ± 23.3 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit find_permutation2(true, permuted)

978 μs ± 55.5 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit find_permutation3(true, permuted)

1.05 ms ± 58.9 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit find_permutation_jit(true, permuted)

1.08 ms ± 139 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit find_permutation_cython(true, permuted)

1.06 ms ± 135 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

But this can change with a larger dataset.

This close timing is probably because the python is not the bottleneck, but the numpy operations, most likely the corrcoef, but you'll need to do some profiling to see whether this is true.

Maarten Fabré Maarten Fabré 9,3901 gold badge15 silver badges27 bronze badges · Accepted Answer · 2020-04-07 16:04:38Z

Pythonic

I rewrote a few of the loops to prevent looping over the index. I also changed used_comps to a set which has O(1) containment checks. For smaller arrays this will not matter a lot, for larger ones this can make a difference.

I also moved the permutation_dict and used_comps definitions closer to the place they are used.

def find_permutation2(true, permuted):
 """
 Finds the most probable permutation of true time series in between permuted time series
 :param true: true ordered time series of shape T times X
 :param permuted: Permuted time series of shape P times T. P > K
 :return: A dict containing {true idx: permuted idx}
 """
 corr_matrix = np.zeros((true.shape[1], permuted.shape[0]))
 # Find correlations
 for i, column in enumerate(true.T):
 for j, row in enumerate(permuted):
 corr_matrix[i, j] = np.corrcoef(column, row)[0, 1]
 # Find best order
 per_matrix = np.argsort(-np.abs(corr_matrix), axis=1)
 permutation_dict = {}
 used_comps = set()
 for i, row in enumerate(per_matrix):
 for j in row:
 if j in used_comps:
 continue
 permutation_dict[i] = j
 used_comps.add(j)
 break
 return permutation_dict

numba

You can use numba, which compiles the python to llvm. I'm no expert, but I got it to work with these settings.

m_jith = numba.jit(find_permutation2, looplift=False, forceobj=True)
m_jith(true, permuted)

`np.setdiff1d`

You can use np.setdiff1d. This will be slower for smaller arrays, but might be faster for larger arrays.

def find_permutation3(true, permuted):
 """
 Finds the most probable permutation of true time series in between permuted time series
 :param true: true ordered time series of shape T times X
 :param permuted: Permuted time series of shape P times T. P > K
 :return: A dict containing {true idx: permuted idx}
 """
 corr_matrix = np.zeros((true.shape[1], permuted.shape[0]))
 # Find correlations
 for i, column in enumerate(true.T):
 for j, row in enumerate(permuted):
 corr_matrix[i, j] = np.corrcoef(column, row)[0, 1]
 # Find best order
 per_matrix = np.argsort(-np.abs(corr_matrix))
 permutation_dict = {}
 used_comps = set()
 for i, row in enumerate(per_matrix):
 j = np.setdiff1d(row, used_comps, assume_unique=True)[0]
 permutation_dict[i] = j
 used_comps.add(j)
 return permutation_dict

timings

All these things have very little effect on the speed of the algorithm

%timeit find_permutation(true, permuted)

950 μs ± 23.3 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit find_permutation2(true, permuted)

978 μs ± 55.5 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit find_permutation3(true, permuted)

1.05 ms ± 58.9 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit find_permutation_jit(true, permuted)

1.08 ms ± 139 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit find_permutation_cython(true, permuted)

1.06 ms ± 135 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

But this can change with a larger dataset.

This close timing is probably because the python is not the bottleneck, but the numpy operations, most likely the corrcoef, but you'll need to do some profiling to see whether this is true.

Stack Exchange Network

Finding permutations efficiently

1 Answer 1

Pythonic

numba

`np.setdiff1d`

timings

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Finding permutations efficiently

1 Answer 1

Pythonic

numba

np.setdiff1d

timings

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions

`np.setdiff1d`