3
\$\begingroup\$

I have the following:

  1. A set of K time-series in a numpy array with dimensions T x K.
  2. A set of P permuted approximation of them in a numpy array with dimensions P times T.

I need a dictionary that tells me which is the most probable permutation. For that I've created the following function, but I would like to know if can be done in a more efficient way and with less code to do this.

def find_permutation(true, permuted):
 """
 Finds the most probable permutation of true time series in between permuted time series
 :param true: true ordered time series of shape T times X
 :param permuted: Permuted time series of shape P times T. P > K
 :return: A dict containing {true idx: permuted idx}
 """
 N = true.shape[1]
 max_comps = permuted.shape[0]
 permutation_dict = {}
 used_comps = []
 corr_matrix = np.zeros((N, max_comps))
 # Find correlations
 for i in range(N):
 for j in range(max_comps):
 corr_matrix[i, j] = np.corrcoef(true[:, i], permuted[j, :])[0, 1]
 # Find best order
 per_matrix = np.argsort(-np.abs(corr_matrix), axis=1)
 for i in range(N):
 for j in per_matrix[i, :]:
 if j in used_comps:
 continue
 else:
 permutation_dict[i] = j
 used_comps.append(j)
 break
 return permutation_dict
if __name__ == "__main__":
 import numpy as np
 a = np.array([1, 2, 3, 4.])
 b = np.array([4, 8, 9, 12.])
 c = np.array([9, 5, 8, 9.])
 true = np.vstack([a, b, c]).transpose()
 permuted = np.vstack([b*0.2, c*0.5, a*0.7])
 print(find_permutation(true, permuted))
 # {0: 2, 1: 0, 2: 1}

Here a Cython version

# C imports first
cimport numpy as np
# other imports
import numpy as np
import cython
# Type declarations
DTYPE = np.float
ctypedef np.float_t DTYPE_t
@cython.boundscheck(False) # Deactivate bounds checking
@cython.wraparound(False) # Deactivate negative indexing.
def find_permutation(np.ndarray[DTYPE_t, ndim=2] true, np.ndarray[DTYPE_t, ndim=2] permuted):
 """
 Finds the most probable permutation of true time series in between permuted time series
 :param true: true ordered time series of shape T times X
 :param permuted: Permuted time series of shape P times T. P > K
 :return: A dict containing {true idx: permuted idx}
 """
 cdef unsigned int N = true.shape[1]
 cdef unsigned int max_comps = permuted.shape[0]
 cdef dict permutation_dict = {}
 cdef list used_comps = []
 cdef np.ndarray[DTYPE_t, ndim=2] corr_matrix
 corr_matrix = np.zeros((N, max_comps))
 cdef Py_ssize_t i
 cdef Py_ssize_t j
 # Find correlations
 for i in range(N):
 for j in range(max_comps):
 corr_matrix[i, j] = np.corrcoef(true[:, i], permuted[j, :])[0, 1]
 # Find best order
 cdef np.ndarray[long, ndim=2] per_matrix
 per_matrix = np.argsort(-np.abs(corr_matrix), axis=1)
 for i in range(N):
 for j in per_matrix[i, :]:
 if j in used_comps:
 continue
 else:
 permutation_dict[i] = j
 used_comps.append(j)
 break
 return permutation_dict

Any suggestion is more than welcome.

asked Apr 6, 2020 at 14:37
\$\endgroup\$
5
  • \$\begingroup\$ do you have some (dummy) sample data with which you call this? \$\endgroup\$ Commented Apr 7, 2020 at 10:04
  • \$\begingroup\$ Added in the python code part :) \$\endgroup\$ Commented Apr 7, 2020 at 12:17
  • \$\begingroup\$ The cython code as is will not work for me. There are some imports and typedefs missing. Can you include thos? \$\endgroup\$ Commented Apr 7, 2020 at 13:08
  • \$\begingroup\$ In your docstring you mention P >K, but in your example P == K and is every time series represented at least once? or better, is there a 1 in each row of corr_matrix \$\endgroup\$ Commented Apr 7, 2020 at 13:14
  • \$\begingroup\$ Added type definitions, and imports. I hope I didn't miss anything. The objective of the function is to find among P times series those that are more correlated with K. That is way only K are returned. In the ideal case, would be at least one 1 for each row, but can happen that two series in permuted are more similar to only one in true, that's prevented by tracking them using used_comps. \$\endgroup\$ Commented Apr 7, 2020 at 14:44

1 Answer 1

1
\$\begingroup\$

Pythonic

I rewrote a few of the loops to prevent looping over the index. I also changed used_comps to a set which has O(1) containment checks. For smaller arrays this will not matter a lot, for larger ones this can make a difference.

I also moved the permutation_dict and used_comps definitions closer to the place they are used.

def find_permutation2(true, permuted):
 """
 Finds the most probable permutation of true time series in between permuted time series
 :param true: true ordered time series of shape T times X
 :param permuted: Permuted time series of shape P times T. P > K
 :return: A dict containing {true idx: permuted idx}
 """
 corr_matrix = np.zeros((true.shape[1], permuted.shape[0]))
 # Find correlations
 for i, column in enumerate(true.T):
 for j, row in enumerate(permuted):
 corr_matrix[i, j] = np.corrcoef(column, row)[0, 1]
 # Find best order
 per_matrix = np.argsort(-np.abs(corr_matrix), axis=1)
 permutation_dict = {}
 used_comps = set()
 for i, row in enumerate(per_matrix):
 for j in row:
 if j in used_comps:
 continue
 permutation_dict[i] = j
 used_comps.add(j)
 break
 return permutation_dict

numba

You can use numba, which compiles the python to llvm. I'm no expert, but I got it to work with these settings.

m_jith = numba.jit(find_permutation2, looplift=False, forceobj=True)
m_jith(true, permuted)

np.setdiff1d

You can use np.setdiff1d. This will be slower for smaller arrays, but might be faster for larger arrays.

def find_permutation3(true, permuted):
 """
 Finds the most probable permutation of true time series in between permuted time series
 :param true: true ordered time series of shape T times X
 :param permuted: Permuted time series of shape P times T. P > K
 :return: A dict containing {true idx: permuted idx}
 """
 corr_matrix = np.zeros((true.shape[1], permuted.shape[0]))
 # Find correlations
 for i, column in enumerate(true.T):
 for j, row in enumerate(permuted):
 corr_matrix[i, j] = np.corrcoef(column, row)[0, 1]
 # Find best order
 per_matrix = np.argsort(-np.abs(corr_matrix))
 permutation_dict = {}
 used_comps = set()
 for i, row in enumerate(per_matrix):
 j = np.setdiff1d(row, used_comps, assume_unique=True)[0]
 permutation_dict[i] = j
 used_comps.add(j)
 return permutation_dict

timings

All these things have very little effect on the speed of the algorithm

%timeit find_permutation(true, permuted)
950 μs ± 23.3 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit find_permutation2(true, permuted)
978 μs ± 55.5 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit find_permutation3(true, permuted)
1.05 ms ± 58.9 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit find_permutation_jit(true, permuted)
1.08 ms ± 139 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit find_permutation_cython(true, permuted)
1.06 ms ± 135 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

But this can change with a larger dataset.

This close timing is probably because the python is not the bottleneck, but the numpy operations, most likely the corrcoef, but you'll need to do some profiling to see whether this is true.

answered Apr 7, 2020 at 16:04
\$\endgroup\$
0

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.