I have the following:
- A set of
K
time-series in a numpy array with dimensionsT x K
. - A set of
P
permuted approximation of them in a numpy array with dimensionsP times T
.
I need a dictionary that tells me which is the most probable permutation. For that I've created the following function, but I would like to know if can be done in a more efficient way and with less code to do this.
def find_permutation(true, permuted):
"""
Finds the most probable permutation of true time series in between permuted time series
:param true: true ordered time series of shape T times X
:param permuted: Permuted time series of shape P times T. P > K
:return: A dict containing {true idx: permuted idx}
"""
N = true.shape[1]
max_comps = permuted.shape[0]
permutation_dict = {}
used_comps = []
corr_matrix = np.zeros((N, max_comps))
# Find correlations
for i in range(N):
for j in range(max_comps):
corr_matrix[i, j] = np.corrcoef(true[:, i], permuted[j, :])[0, 1]
# Find best order
per_matrix = np.argsort(-np.abs(corr_matrix), axis=1)
for i in range(N):
for j in per_matrix[i, :]:
if j in used_comps:
continue
else:
permutation_dict[i] = j
used_comps.append(j)
break
return permutation_dict
if __name__ == "__main__":
import numpy as np
a = np.array([1, 2, 3, 4.])
b = np.array([4, 8, 9, 12.])
c = np.array([9, 5, 8, 9.])
true = np.vstack([a, b, c]).transpose()
permuted = np.vstack([b*0.2, c*0.5, a*0.7])
print(find_permutation(true, permuted))
# {0: 2, 1: 0, 2: 1}
Here a Cython version
# C imports first
cimport numpy as np
# other imports
import numpy as np
import cython
# Type declarations
DTYPE = np.float
ctypedef np.float_t DTYPE_t
@cython.boundscheck(False) # Deactivate bounds checking
@cython.wraparound(False) # Deactivate negative indexing.
def find_permutation(np.ndarray[DTYPE_t, ndim=2] true, np.ndarray[DTYPE_t, ndim=2] permuted):
"""
Finds the most probable permutation of true time series in between permuted time series
:param true: true ordered time series of shape T times X
:param permuted: Permuted time series of shape P times T. P > K
:return: A dict containing {true idx: permuted idx}
"""
cdef unsigned int N = true.shape[1]
cdef unsigned int max_comps = permuted.shape[0]
cdef dict permutation_dict = {}
cdef list used_comps = []
cdef np.ndarray[DTYPE_t, ndim=2] corr_matrix
corr_matrix = np.zeros((N, max_comps))
cdef Py_ssize_t i
cdef Py_ssize_t j
# Find correlations
for i in range(N):
for j in range(max_comps):
corr_matrix[i, j] = np.corrcoef(true[:, i], permuted[j, :])[0, 1]
# Find best order
cdef np.ndarray[long, ndim=2] per_matrix
per_matrix = np.argsort(-np.abs(corr_matrix), axis=1)
for i in range(N):
for j in per_matrix[i, :]:
if j in used_comps:
continue
else:
permutation_dict[i] = j
used_comps.append(j)
break
return permutation_dict
Any suggestion is more than welcome.
1 Answer 1
Pythonic
I rewrote a few of the loops to prevent looping over the index. I also changed used_comps
to a set
which has O(1)
containment checks. For smaller arrays this will not matter a lot, for larger ones this can make a difference.
I also moved the permutation_dict
and used_comps
definitions closer to the place they are used.
def find_permutation2(true, permuted):
"""
Finds the most probable permutation of true time series in between permuted time series
:param true: true ordered time series of shape T times X
:param permuted: Permuted time series of shape P times T. P > K
:return: A dict containing {true idx: permuted idx}
"""
corr_matrix = np.zeros((true.shape[1], permuted.shape[0]))
# Find correlations
for i, column in enumerate(true.T):
for j, row in enumerate(permuted):
corr_matrix[i, j] = np.corrcoef(column, row)[0, 1]
# Find best order
per_matrix = np.argsort(-np.abs(corr_matrix), axis=1)
permutation_dict = {}
used_comps = set()
for i, row in enumerate(per_matrix):
for j in row:
if j in used_comps:
continue
permutation_dict[i] = j
used_comps.add(j)
break
return permutation_dict
numba
You can use numba
, which compiles the python to llvm. I'm no expert, but I got it to work with these settings.
m_jith = numba.jit(find_permutation2, looplift=False, forceobj=True)
m_jith(true, permuted)
np.setdiff1d
You can use np.setdiff1d
. This will be slower for smaller arrays, but might be faster for larger arrays.
def find_permutation3(true, permuted):
"""
Finds the most probable permutation of true time series in between permuted time series
:param true: true ordered time series of shape T times X
:param permuted: Permuted time series of shape P times T. P > K
:return: A dict containing {true idx: permuted idx}
"""
corr_matrix = np.zeros((true.shape[1], permuted.shape[0]))
# Find correlations
for i, column in enumerate(true.T):
for j, row in enumerate(permuted):
corr_matrix[i, j] = np.corrcoef(column, row)[0, 1]
# Find best order
per_matrix = np.argsort(-np.abs(corr_matrix))
permutation_dict = {}
used_comps = set()
for i, row in enumerate(per_matrix):
j = np.setdiff1d(row, used_comps, assume_unique=True)[0]
permutation_dict[i] = j
used_comps.add(j)
return permutation_dict
timings
All these things have very little effect on the speed of the algorithm
%timeit find_permutation(true, permuted)
950 μs ± 23.3 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit find_permutation2(true, permuted)
978 μs ± 55.5 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit find_permutation3(true, permuted)
1.05 ms ± 58.9 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit find_permutation_jit(true, permuted)
1.08 ms ± 139 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit find_permutation_cython(true, permuted)
1.06 ms ± 135 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
But this can change with a larger dataset.
This close timing is probably because the python is not the bottleneck, but the numpy
operations, most likely the corrcoef
, but you'll need to do some profiling to see whether this is true.
P >K
, but in your exampleP == K
and is every time series represented at least once? or better, is there a 1 in each row ofcorr_matrix
\$\endgroup\$P
times series those that are more correlated withK
. That is way onlyK
are returned. In the ideal case, would be at least one 1 for each row, but can happen that two series inpermuted
are more similar to only one intrue
, that's prevented by tracking them usingused_comps
. \$\endgroup\$