Iterate through two arrays calculating a statistic from indexes of zeros

Question 1

I write a function which takes as input 2 arrays of zeros and ones ~8000 elements per array. Input array size is variable, can have much higher number of elements. The density of ones is expected to be higher than zeros for most cases, but outliers may exist. My function eps calculates a statistic on these array pairs and returns the output. This statistic is just a way to find if 2 timeseries data correlate with each other. Relatively trivial operations of checking for 0 and noting the index where 0 is found in array and then some calculations. I tried my best to optimize for speed but the best I could get is 4.5 ~5 seconds (for 18k array pairs) using timeit (1000 runs) library. Fast execution is important as I need to run this function on billions of array pairs.

UPDATE: The solution I tried from the answers below improved runtime for larger inputs than my original approach. However, I still face the problem of running this statistical function a 1000 times per array pair. The background for this is to perform a permutation test on the statistic with 'x' (1000) number of permutations to arrive at a significance level at p= .01 Some asked below why I need to run this function billions of times. So my data is a timeseries data for events: 'E1,E2,E3.....En' I need to find combinations of all possible event pairs from this data i.e. nCr. (E1,E2), (E2,E3)....nCr times. This leads to a large number of event pairs about billions of event pairs for my data. Add to that the time complexity of a permutation test (1000x times/pair). Any suggestions if any other parts like permutation tests can be improved. Thanks
UPDATE2: Is size always ~8000? No it depends on input data, array size can be between 2000 ~ 10,000. What's your tau value? tau is a random positive integer or float and we jus test for different tau values, so I cannot given a certain Tau.

#e.g. inputs
#ts_1 = [0,1,1,0,0,1,1,0,......] #timeseries1
#ts_2 = [1,1,1,1,1,1,1,0,......] #timeseries2
# tau = positive integer/float #time lag
import numpy as np
from itertools import product
 def eps(ts_1, ts_2, tau): 
 Q_tau = 0
 q_tau = 0
 
 event_index1, = np.where(np.array(ts_1) == 0)
 n1 = event_index1.shape[0]
 
 event_index2, = np.where(np.array(ts_2) == 0)
 n2 = event_index2.shape[0]
 
 if (n1 != 0 and n2 != 0):
 matching_idx = set(event_index1).intersection(event_index2)
 c_ij = c_ji = 0.5 *len(matching_idx)
 
 for x,y in product(event_index1,event_index2):
 if x-y > 0 and (x-y)<= tau:
 c_ij += 1
 elif y-x > 0 and (y-x) <= tau:
 c_ji += 1 
 
 Q_tau = (c_ij+c_ji)/math.sqrt( n1 * n2 )
 q_tau = (c_ij - c_ji)/math.sqrt( n1 * n2 )
 
 return Q_tau, q_tau
#permutation test
 def eps_permutationtest(ts_1,ts_2,permutations=1000,Q_tau): #Q_tau= original statistic to evaluate significance for
 
 lst_Q_tau = [] # list to hold Q_tau values from each shuffle 
 
 for i in range(0,permutations):
 np.random.shuffle(ts_1) #shuffle ts_1 everytime
 Q_tau = eps(ts_1 = ts_1, ts_2 = ts_2 , tau= 5) 
 lst_Q_tau.append(Q_tau)
 lst_Q_tau= np.array(lst_Q_tau)
 
 p_val = len(np.where(lst_Q_tau>= [Q_tau])[0])/permutations
 
 
 return p_val # significant if p < .01 alpha

Question 2

Welcome to CodeReview@SE. See How do I ask a Good Question? on how to title your question. Please include clarifications in your post instead of commenting comments. (And keep everything DRY.)

Question 3

What are your actual densities? Is size always ~8000? What's your tau value? Do you actually need the billions of results from the billions of array pairs, or are you actually after something else, which might be doable without doing all those pairs?

Question 4

Since the statistic calculation is the slow part, I think it would make sense for you to explain what it actually computes. It saves us having to figure out your intentions from the code.

Question 5

Why are you running it 1000x per pair? What changes between each of these runs? When you say "nCr", do you really mean "nC2", that is, you have n arrays, and you want to calculate this statistic for all possible pairs of arrays? Are all n arrays the same size, even if that size varies? Your edit provided some information, but it's still too terse for us to tell what you're actually doing.

Question 6

Hi. I mentioned this loop per pair of 1000 times is called permutation test. The ts_1 is shuffled each time and the statistic is calculated again. This is so as to understand if results are significant, and did not merely occur by chance. This could be called as bootstrapping/resampling. nCr where n is total events and r = 2. This statistic needs to be calculated for all array pairs.

Question 7

First, some basics of your implementation:

Type-hint your arguments; I had to read and investigate a bunch to deduce that tau is a positive integer
(y-x) <= tau does not need parens
save math.sqrt( n1 * n2 ) to a variable instead of repeating it

Full matrix implementation

This is an n**2 operation, which is on the order of 64,000,000 elements if your arrays are 8,000 elements long. That's beyond the point where I would have given up and used C and something like BLAS. If you actually care about performance, now is the time.

Your solution scales according to event density, being slower as event density increases. There is a solution that is constant-time in density:

Don't call np.where
Don't use set intersection; use vectorized comparison
Recognize that you can represent your c_ij summation as the sum over a partial triangular matrix (ij upper, ji lower) stopping at the diagonal set by tau
The matrix itself is formed by the broadcast of the two event vectors, and receives a mask for the upper and lower halves
The initial values for c_** are the halved trace (diagonal sum) of this broadcast matrix
For a given tau and vector lengths, you could save time by pre-calculating the masks (not shown in the reference code). This is very important and can reduce execution time of this method by about two thirds. Many kinds of datasets in the wild have constant dimensions and can benefit from this.

Somewhere in around 85% ones-density, this method has equivalent performance to the original (for the test vector length of 2000) if you don't pre-calculate the mask.

Sparse approach

A library that supports sparse matrices should do better than the above. Scipy has them. Things to note about this approach:

The input data need to be inverted - 1 for events, 0 everywhere else
Be careful not to call np.* methods; exclusively use sparse methods
Rather than forming the triangular bands and then summing at the end, it's faster to perform a difference-of-sums using tau
This scales with density, unlike the full-matrix approach, and uses less memory

Sorted logarithmic traversal

Riffing on @vpn's idea a little:

This doesn't need to be n**2 if you tell Numpy to correctly apply a binary search
This binary search can be vectorized via searchsorted
Rather than your Python-native set intersection, you can use intersect1d

For very high ones-densities, this method remains the fastest; but has poorer scaling characteristics than the full-matrix trace-only method for lower ones-densities (higher event densities).

Full or sparse matrix, traces only

(Yet another) approach is:

For either sparse or full matrix
non-masked
based on diagonal trace sum only, with no calls to the triangular methods
constant time in density

Particularly for the full matrix, this method is very fast: always faster than sparse, masked broadcast and original; and faster than logarithmic traversal for densities below ~70%.

Cumulative sum

For completeness the reference source below includes @KellyBundy's cumulative sum approach, which is competitive for all but very high densities.

Output

All method durations (ms)
 d% trace sorted cumsum old bcast sparse sparsetr
50.0 0.70 4.12 0.21 491.82 41.73 80.62 21.62 
55.0 0.64 3.27 0.18 411.49 40.09 65.46 18.25 
60.0 0.58 2.69 0.18 311.74 40.69 50.05 14.60 
65.0 0.58 2.27 0.23 250.12 41.50 35.52 13.45 
70.0 1.03 1.84 0.18 182.37 41.84 24.03 10.20 
75.0 1.34 1.40 0.18 113.49 42.36 16.37 7.31 
80.0 0.61 1.06 0.18 77.04 40.99 11.78 6.11 
85.0 0.81 0.69 0.17 44.68 52.77 10.30 5.95 
90.0 1.56 0.56 0.22 24.50 46.26 6.32 3.94 
95.0 1.50 0.26 0.18 6.27 40.60 4.80 3.48 
96.0 1.77 0.21 0.18 3.71 41.01 4.25 3.31 
96.5 1.79 0.17 0.18 1.89 40.69 4.16 3.24 
97.0 1.79 0.16 0.18 1.75 43.89 4.69 3.30 
97.5 1.87 0.14 0.18 1.16 43.34 4.31 3.22 
98.0 2.10 0.15 0.19 1.22 41.16 4.10 3.33 
98.5 1.75 0.10 0.18 0.42 40.75 3.94 3.15 
99.0 1.73 0.09 0.22 0.16 40.71 3.97 3.11 
99.5 1.78 0.08 0.18 0.09 41.29 4.09 3.42 
Fast method durations (us)
 d% trace sorted cumsum
50.0 604 3993 124 
55.0 556 3504 124 
60.0 577 2895 125 
65.0 569 2440 130 
70.0 556 2041 126 
75.0 574 1372 125 
80.0 556 1078 123 
85.0 562 686 126 
90.0 563 477 128 
95.0 561 238 126 
96.0 558 176 126 
96.5 569 139 124 
97.0 557 137 131 
97.5 568 104 124 
98.0 601 116 128 
98.5 565 61 128 
99.0 567 51 129 
99.5 566 43 127

Reference source

import math
from numbers import Real
from timeit import timeit
from typing import Tuple
import numpy as np
from itertools import product
import scipy.sparse
from scipy.sparse import spmatrix
def old_eps(ts_1, ts_2, tau):
 Q_tau = 0
 q_tau = 0
 event_index1, = np.where(np.array(ts_1) == 0)
 n1 = event_index1.shape[0]
 event_index2, = np.where(np.array(ts_2) == 0)
 n2 = event_index2.shape[0]
 if (n1 != 0 and n2 != 0):
 matching_idx = set(event_index1).intersection(event_index2)
 c_ij = c_ji = 0.5 *len(matching_idx)
 for x,y in product(event_index1,event_index2):
 if x-y > 0 and (x-y)<= tau:
 c_ij += 1
 elif y-x > 0 and (y-x) <= tau:
 c_ji += 1
 Q_tau = (c_ij+c_ji)/math.sqrt( n1 * n2 )
 q_tau = (c_ij - c_ji)/math.sqrt( n1 * n2 )
 return Q_tau, q_tau
def bcast_eps(ts_1: np.ndarray, ts_2: np.ndarray, tau: int) -> Tuple[Real, Real]:
 if ts_1.shape != ts_2.shape or len(ts_1.shape) != 1:
 raise ValueError('Vectors must be flat and of the same length')
 N, = ts_1.shape
 events_1 = ts_1 == 0
 events_2 = ts_2 == 0
 n1 = np.sum(events_1)
 n2 = np.sum(events_2)
 if n1 == 0 or n2 == 0:
 return 0, 0
 all_true = np.ones((N, N), dtype=bool)
 upper_mask = np.logical_or(np.tril(all_true), np.triu(all_true, k=+1+tau))
 lower_mask = np.logical_or(np.triu(all_true), np.tril(all_true, k=-1-tau))
 matches = np.logical_and(events_1[np.newaxis, :], events_2[:, np.newaxis])
 matches_u = np.ma.array(matches, mask=upper_mask)
 matches_l = np.ma.array(matches, mask=lower_mask)
 n_matches = np.trace(matches)
 c_ij = c_ji = n_matches / 2
 c_ij += np.sum(matches_u)
 c_ji += np.sum(matches_l)
 den = math.sqrt(n1 * n2)
 Q_tau = (c_ij + c_ji) / den
 q_tau = (c_ij - c_ji) / den
 return Q_tau, q_tau
def trace_eps(ts_1: np.ndarray, ts_2: np.ndarray, tau: int) -> Tuple[Real, Real]:
 if ts_1.shape != ts_2.shape or len(ts_1.shape) != 1:
 raise ValueError('Vectors must be flat and of the same length')
 events_1 = ts_1 == 0
 events_2 = ts_2 == 0
 n1 = np.sum(events_1)
 n2 = np.sum(events_2)
 if n1 == 0 or n2 == 0:
 return 0, 0
 matches = np.logical_and(events_1[np.newaxis, :], events_2[:, np.newaxis])
 n_matches = np.trace(matches)
 c_ij = c_ji = n_matches / 2
 for k in range(1, tau+1):
 c_ij += np.trace(matches, k)
 c_ji += np.trace(matches, -k)
 den = math.sqrt(n1 * n2)
 Q_tau = (c_ij + c_ji) / den
 q_tau = (c_ij - c_ji) / den
 return Q_tau, q_tau
def sorted_eps(ts_1: np.ndarray, ts_2: np.ndarray, tau: int) -> Tuple[Real, Real]:
 if ts_1.shape != ts_2.shape or len(ts_1.shape) != 1:
 raise ValueError('Vectors must be flat and of the same length')
 event_index1, = np.where(np.array(ts_1) == 0)
 event_index2, = np.where(np.array(ts_2) == 0)
 n1, = event_index1.shape
 n2, = event_index2.shape
 if n1 == 0 or n2 == 0:
 return 0, 0
 n_matches, = np.intersect1d(
 event_index1, event_index2, assume_unique=True,
 ).shape
 c_ij = c_ji = n_matches/2
 insertions = np.searchsorted(
 a=event_index1, # array to pretend insertion into
 v=event_index2, # values to insert
 )
 for insertion, y in zip(insertions, event_index2):
 i1 = insertion
 if i1 < n1:
 if y == event_index1[i1]:
 i1 += 1
 for x in event_index1[i1:]:
 if x - y > tau:
 break
 c_ij += 1
 i2 = insertion - 1
 if i2 >= 0:
 for x in event_index1[i2::-1]:
 if y - x > tau:
 break
 c_ji += 1
 den = math.sqrt(n1 * n2)
 Q_tau = (c_ij + c_ji) / den
 q_tau = (c_ij - c_ji) / den
 return Q_tau, q_tau
def sparse_eps(ts_1: spmatrix, ts_2: spmatrix, tau: int) -> Tuple[Real, Real]:
 if ts_1.shape != ts_2.shape or len(ts_1.shape) != 2 or ts_1.shape[0] != 1:
 raise ValueError('Vectors must be flat and of the same length')
 n1 = ts_1.sum()
 n2 = ts_2.sum()
 if n1 == 0 or n2 == 0:
 return 0, 0
 matches = ts_2.T * ts_1
 matches_u = scipy.sparse.triu(matches, +1).sum() - scipy.sparse.triu(matches, k=+1+tau).sum()
 matches_l = scipy.sparse.tril(matches, -1).sum() - scipy.sparse.tril(matches, k=-1-tau).sum()
 n_matches = matches.diagonal().sum()
 c_ij = c_ji = n_matches / 2
 c_ij += matches_u
 c_ji += matches_l
 den = math.sqrt(n1 * n2)
 Q_tau = (c_ij + c_ji) / den
 q_tau = (c_ij - c_ji) / den
 return Q_tau, q_tau
def sparsetr_eps(ts_1: spmatrix, ts_2: spmatrix, tau: int) -> Tuple[Real, Real]:
 if ts_1.shape != ts_2.shape or len(ts_1.shape) != 2 or ts_1.shape[0] != 1:
 raise ValueError('Vectors must be flat and of the same length')
 n1 = ts_1.sum()
 n2 = ts_2.sum()
 if n1 == 0 or n2 == 0:
 return 0, 0
 matches = ts_2.T * ts_1
 n_matches = matches.diagonal().sum()
 c_ij = c_ji = n_matches / 2
 for k in range(1, tau+1):
 c_ij += matches.diagonal(+k).sum()
 c_ji += matches.diagonal(-k).sum()
 den = math.sqrt(n1 * n2)
 Q_tau = (c_ij + c_ji) / den
 q_tau = (c_ij - c_ji) / den
 return Q_tau, q_tau
def cumsum_eps(ts_1: spmatrix, ts_2: spmatrix, tau: int) -> Tuple[Real, Real]:
 ts_1 = 1 - ts_1
 ts_2 = 1 - ts_2
 n1 = ts_1.sum()
 n2 = ts_2.sum()
 if n1 == 0 or n2 == 0:
 return 0, 0
 cs1 = np.pad(ts_1.cumsum(), (0, tau), 'edge')
 cs2 = np.pad(ts_2.cumsum(), (0, tau), 'edge')
 c_ij = c_ji = 0.5 * (ts_1 * ts_2).sum()
 c_ij += ((cs1[tau:] - cs1[:-tau]) * ts_2).sum()
 c_ji += ((cs2[tau:] - cs2[:-tau]) * ts_1).sum()
 sqrt = math.sqrt(n1 * n2)
 Q_tau = (c_ij + c_ji) / sqrt
 q_tau = (c_ij - c_ji) / sqrt
 return Q_tau, q_tau
def test() -> None:
 shape = 2, 2000
 full_ts = np.ones(shape, dtype=np.uint8)
 sparse_ts = scipy.sparse.lil_matrix(shape, dtype=np.uint8)
 density = 0.9
 rand = np.random.default_rng(seed=0).random
 events = rand(shape) > density
 full_ts[events] = 0
 sparse_ts[events] = 1
 # Add some interesting values to test boundary conditions: on the diagonal
 full_ts[:, 5] = 0
 sparse_ts[:, 5] = 1
 Q_exp = 1.9446638724075895
 q_exp = 0.026566446344365977
 methods = (
 (old_eps, full_ts),
 (bcast_eps, full_ts),
 (trace_eps, full_ts),
 (sorted_eps, full_ts),
 (sparse_eps, sparse_ts),
 (sparsetr_eps, sparse_ts),
 (cumsum_eps, full_ts),
 )
 for eps, ts in methods:
 Q_tau, q_tau = eps(*ts, tau=10)
 assert math.isclose(Q_tau, Q_exp, abs_tol=1e-12, rel_tol=0)
 assert math.isclose(q_tau, q_exp, abs_tol=1e-12, rel_tol=0)
def compare(fast: bool) -> None:
 shape = 2, 2000
 rand = np.random.default_rng(seed=0).random
 n = 400 if fast else 1
 methods = (
 trace_eps,
 sorted_eps,
 cumsum_eps,
 )
 if fast:
 print('Fast method durations (us)')
 factor = 1e6
 fmt = '{:8.0f}'
 else:
 print('All method durations (ms)')
 factor = 1e3
 fmt = '{:8.2f}'
 methods += (
 old_eps,
 bcast_eps,
 sparse_eps,
 sparsetr_eps,
 )
 print(' d%', ' '.join(
 f'{method.__name__.removesuffix("_eps"):>8}'
 for method in methods
 ))
 densities = np.hstack((
 np.linspace(0.50, 0.95, 10),
 np.linspace(0.960, 0.995, 8),
 ))
 for density in densities:
 print(f'{100*density:4.1f}', end=' ')
 full_ts = np.ones(shape, dtype=np.uint8)
 sparse_ts = scipy.sparse.lil_matrix(shape, dtype=np.uint8)
 events = rand(shape) > density
 full_ts[events] = 0
 sparse_ts[events] = 1
 inputs = (
 full_ts,
 full_ts,
 full_ts,
 )
 if not fast:
 inputs += (
 full_ts,
 full_ts,
 sparse_ts,
 sparse_ts,
 )
 for eps, ts in zip(methods, inputs):
 t = timeit(lambda: eps(*ts, tau=10), number=n)
 print(fmt.format(t/n*factor), end=' ')
 print()
 print()
if __name__ == '__main__':
 test()
 compare(False)
 compare(True)

Question 8

Thank you for the detailed explanation and the performance tests. I liked the sorted solution more for the straightforwardness, but will be deploying all on the real data for a more hands-on performance.

Question 9

Thanks. Much cleaner display, too :-). Now I wonder what the OP's tau is really like. I think your trace solution gets faster with smaller tau, while mine gets faster with larger tau. At very high densities, the sorted solution approaches linear time, like mine always is, so they should differ by a constant factor there. Maybe either could be improved further to tweak this constant factor, but meh. I suspect bigger further improvement at a higher level, not by handling each vector pair faster but by reducing the number of pairs to do at all (or maybe it shouldn't even be done pairwise).

Question 10

for x,y in product(event_index1,event_index2) looks like an efficiency killer. If the lengths of event_index1, event_index2 are L1, L2 respectively, this loop has O(L1 * L2) time complexity. You may get away with O(L1 * log L2) (or O(L2 * log L1), depending on which one is larger.

Notice that for any given x the c_ij is incremented by the number of ys in the [x - tau, x)range. Two binary searches over event_index2 would give you this range, along the lines of

 for x in event_index1:
 c_ij += bisect_left(event_index2, x) - bisect_left(event_index2, x - tau)

and ditto for c_ji. Of course, from bisect import bisect_left.

Question 11

In my testing, your binary search suggestion is by far the fastest (though I used Numpy's search_sorted implementation rather than native Python).

Question 12

@Reinderien search_sorted shall indeed be better than native bisect. I am not too versed in numpy to suggest it. Thanks.

Question 13

@vnp I really like this solution. I never even thought it could be done using sorting instead of loop iteration.

Question 14

You have quadratic runtime, we can reduce it to linear. You want to count the numbers of zeros in certain ranges. Let's flip zeros and ones and count the numbers of ones in ranges by subtracting prefix sums:

def Kelly_eps(ts_1, ts_2, tau):
 ts_1 = 1 - ts_1
 ts_2 = 1 - ts_2
 n1 = ts_1.sum()
 n2 = ts_2.sum()
 if n1 == 0 or n2 == 0:
 return 0, 0
 cs1 = np.pad(ts_1.cumsum(), (0, tau), 'edge')
 cs2 = np.pad(ts_2.cumsum(), (0, tau), 'edge')
 c_ij = c_ji = 0.5 * (ts_1 * ts_2).sum()
 c_ij += ((cs1[tau:] - cs1[:-tau]) * ts_2).sum()
 c_ji += ((cs2[tau:] - cs2[:-tau]) * ts_1).sum()
 sqrt = math.sqrt(n1 * n2)
 Q_tau = (c_ij + c_ji) / sqrt
 q_tau = (c_ij - c_ji) / sqrt
 return Q_tau, q_tau

Using Reinderien's script to benchmark against its two fastest solutions:

d=50% trace_eps: 1.37 ms sorted_eps: 8.33 ms Kelly_eps: 0.19 ms 
d=55% trace_eps: 0.80 ms sorted_eps: 7.10 ms Kelly_eps: 0.17 ms 
d=61% trace_eps: 1.20 ms sorted_eps: 5.79 ms Kelly_eps: 0.19 ms 
d=66% trace_eps: 1.25 ms sorted_eps: 4.20 ms Kelly_eps: 0.22 ms 
d=72% trace_eps: 0.98 ms sorted_eps: 2.50 ms Kelly_eps: 0.19 ms 
d=77% trace_eps: 0.73 ms sorted_eps: 1.62 ms Kelly_eps: 0.13 ms 
d=83% trace_eps: 0.73 ms sorted_eps: 1.23 ms Kelly_eps: 0.12 ms 
d=88% trace_eps: 0.72 ms sorted_eps: 0.65 ms Kelly_eps: 0.12 ms 
d=94% trace_eps: 0.71 ms sorted_eps: 0.37 ms Kelly_eps: 0.12 ms 
d=99% trace_eps: 0.69 ms sorted_eps: 0.07 ms Kelly_eps: 0.12 ms

Question 15

Thank you for this solution. On multiple tests, your solution gave higher runtimes than original and Reinderien solution. For a particular set runtimes were: Original : 434 secs Reinderien : 424 secs Kelly : 598 secs

Question 16

@skynaive Well it very much depends on the parameters/context. I asked about them under your question, but so far you ignored that.

Question 17

(This might not be 100% relevant since it's not about Python, and it's about just the intersection part, not the harder within-threshold-distance matching part. But maybe at least food for thought.)

matching_idx is the list of all places where both inputs have a 0, I think? You don't actually need that list, just the count of how many positions where ts_1[i] and ts_2[i] are both zero. It would be very fast to compute that from the original inputs, like count_zeros( ts_1 | ts_2 ), instead of using intersection on two lists of indices.

That wouldn't share any work with finding nearby indices, though, so you'd probably still need to generate the lists of indices where the inputs are zero. Still, it might be cheaper than the intersection part.

(IDK how to get numpy to do this (I don't know numpy), but it's straightforward in C with SIMD intrinsics, or any other way of getting a programming language to compile or JIT to machine code with SIMD instructions that gives you a similar level of control.)

To make it efficient, do the OR on the fly (one SIMD vector at a time, not storing the result in a new array), and the count_zeros probably do as len - count_ones(ts1|ts2).

If you store your input elements as packed bitmaps, then you can OR 128 or 256 bits per 16 or 32-byte SIMD vector, and efficiently popcount (https://github.com/WojciechMula/sse-popcount/). Some SIMD popcount algorithms would work equally efficiently to count zeros directly, in which case you can do that, but out of the box you'll generally find optimized code to count ones, hence my suggestion.

Or if you store your input arrays as unpacked 8-bit or 32-bit integer elements, you can just sum += ts_1[i] | ts_2[i] (ideally SIMD vectorized). To efficiently vectorize a sum of 8-bit 0 / 1 elements, see https://stackoverflow.com/questions/54541129/how-to-count-character-occurrences-using-simd - very similar problem to accumulating _mm256_cmpeq_epi8 results (0 / -1).

For 32-bit elements, you don't need to widen to avoid overflow when summing the OR results, but your data density would be very low (4x or 32x lower than possible), so hopefully you aren't doing that.

8000 bits is only 31.25 SIMD vectors of 256 bits each, or 1000 bytes, so this should be blazing fast, like maybe 100 clock cycles or so with the data already hot in L1d cache. (https://github.com/WojciechMula/sse-popcount/blob/master/results/skylake/skylake-i7-6700-gcc5.3.0-avx2.rst#input-size-1024b)

Question 18

Thank you for posting your answer. Your answer comments about things I've never heard of, and you're talking about things I can't think of a way to implement in Python. Your answer is the first compelling argument I've seen against Python (wrt being a high-level language). However I don't think your answer is inappropriate as using C is an accepted practice when you need speed Python can't get you. I'd say it's idiomatic Python, but given I, like many other Python programmers, can't program in C, it may be a stretch to say idiomatic...

Question 19

@Peilonrayz: Just to be clear, count_zeros( ts_1 | ts_2 ) is pseudo-code, not something you can literally write in C. You can't OR whole arrays, you need to loop over them. (You could do basically that with C++ std::bitset<8000> - it does overload the | operator and has a .count() method to count ones which hopefully uses an optimized popcnt under the hood.)

Question 20

FWIW I found that to be clear. The reason I mentioned C specifically, is because C is a common way to speed up Python. Such as in much of std-lib like bisect/_bisect or numpy which are coincidentally mentioned in the above answers.

Question 21

The closest you come to documentation is telling us that you are calculating a statistic. What is this statistic? I tried searching for "Q statistic", "tau statistic" and "eps statistic", but didn't find anything that seemed to match what you've done. You may think you know what it's doing and would not forget, but you will completely forget in six months time, and when you come back to this you'll have as much knowledge as us. Your explanation of the input variables might as well go into a docstring.

Intersecting the event indices and subtracting x and y suggests that the input arrays have the same length. That should be documented if true.

You should know what a variable is just by its name, without looking back at how you defined it. eps, Q_tau, q_tau, and tau might pass that test (depending on what "this statistic" actually is). ts_1, ts_2, event_index1, event_index2, n1, n2, matching_idx, c_ij, c_ji, x and y don't. I would suggest naming them input_array_1, input_array_2, zeros_indices_of_1, zero_indices_of_2, num_zeros_1, num_zeros_2, zero_indices_of_both, above_diagonal, below_diagonal, index_1, and index_2. (Did you notice that you knew what all of those should be except for the diagonals?)

It appears you aren't using pandas, so let's not import it.

I like to return early if possible. You could have if ( n1 == 0 or n2 == 0 ): return 0,0 This would free you from declaring Q_tau and q_tau at the beginning and allow you to remove a level of indentation at the end.

Reinderien Reinderien 70.9k5 gold badges76 silver badges256 bronze badges · Accepted Answer · 2021-08-17 00:07:47Z

First, some basics of your implementation:

Type-hint your arguments; I had to read and investigate a bunch to deduce that tau is a positive integer
(y-x) <= tau does not need parens
save math.sqrt( n1 * n2 ) to a variable instead of repeating it

Full matrix implementation

This is an n**2 operation, which is on the order of 64,000,000 elements if your arrays are 8,000 elements long. That's beyond the point where I would have given up and used C and something like BLAS. If you actually care about performance, now is the time.

Your solution scales according to event density, being slower as event density increases. There is a solution that is constant-time in density:

Don't call np.where
Don't use set intersection; use vectorized comparison
Recognize that you can represent your c_ij summation as the sum over a partial triangular matrix (ij upper, ji lower) stopping at the diagonal set by tau
The matrix itself is formed by the broadcast of the two event vectors, and receives a mask for the upper and lower halves
The initial values for c_** are the halved trace (diagonal sum) of this broadcast matrix
For a given tau and vector lengths, you could save time by pre-calculating the masks (not shown in the reference code). This is very important and can reduce execution time of this method by about two thirds. Many kinds of datasets in the wild have constant dimensions and can benefit from this.

Somewhere in around 85% ones-density, this method has equivalent performance to the original (for the test vector length of 2000) if you don't pre-calculate the mask.

Sparse approach

A library that supports sparse matrices should do better than the above. Scipy has them. Things to note about this approach:

The input data need to be inverted - 1 for events, 0 everywhere else
Be careful not to call np.* methods; exclusively use sparse methods
Rather than forming the triangular bands and then summing at the end, it's faster to perform a difference-of-sums using tau
This scales with density, unlike the full-matrix approach, and uses less memory

Sorted logarithmic traversal

Riffing on @vpn's idea a little:

This doesn't need to be n**2 if you tell Numpy to correctly apply a binary search
This binary search can be vectorized via searchsorted
Rather than your Python-native set intersection, you can use intersect1d

For very high ones-densities, this method remains the fastest; but has poorer scaling characteristics than the full-matrix trace-only method for lower ones-densities (higher event densities).

Full or sparse matrix, traces only

(Yet another) approach is:

For either sparse or full matrix
non-masked
based on diagonal trace sum only, with no calls to the triangular methods
constant time in density

Particularly for the full matrix, this method is very fast: always faster than sparse, masked broadcast and original; and faster than logarithmic traversal for densities below ~70%.

Cumulative sum

For completeness the reference source below includes @KellyBundy's cumulative sum approach, which is competitive for all but very high densities.

Output

All method durations (ms)
 d% trace sorted cumsum old bcast sparse sparsetr
50.0 0.70 4.12 0.21 491.82 41.73 80.62 21.62 
55.0 0.64 3.27 0.18 411.49 40.09 65.46 18.25 
60.0 0.58 2.69 0.18 311.74 40.69 50.05 14.60 
65.0 0.58 2.27 0.23 250.12 41.50 35.52 13.45 
70.0 1.03 1.84 0.18 182.37 41.84 24.03 10.20 
75.0 1.34 1.40 0.18 113.49 42.36 16.37 7.31 
80.0 0.61 1.06 0.18 77.04 40.99 11.78 6.11 
85.0 0.81 0.69 0.17 44.68 52.77 10.30 5.95 
90.0 1.56 0.56 0.22 24.50 46.26 6.32 3.94 
95.0 1.50 0.26 0.18 6.27 40.60 4.80 3.48 
96.0 1.77 0.21 0.18 3.71 41.01 4.25 3.31 
96.5 1.79 0.17 0.18 1.89 40.69 4.16 3.24 
97.0 1.79 0.16 0.18 1.75 43.89 4.69 3.30 
97.5 1.87 0.14 0.18 1.16 43.34 4.31 3.22 
98.0 2.10 0.15 0.19 1.22 41.16 4.10 3.33 
98.5 1.75 0.10 0.18 0.42 40.75 3.94 3.15 
99.0 1.73 0.09 0.22 0.16 40.71 3.97 3.11 
99.5 1.78 0.08 0.18 0.09 41.29 4.09 3.42 
Fast method durations (us)
 d% trace sorted cumsum
50.0 604 3993 124 
55.0 556 3504 124 
60.0 577 2895 125 
65.0 569 2440 130 
70.0 556 2041 126 
75.0 574 1372 125 
80.0 556 1078 123 
85.0 562 686 126 
90.0 563 477 128 
95.0 561 238 126 
96.0 558 176 126 
96.5 569 139 124 
97.0 557 137 131 
97.5 568 104 124 
98.0 601 116 128 
98.5 565 61 128 
99.0 567 51 129 
99.5 566 43 127

Reference source

import math
from numbers import Real
from timeit import timeit
from typing import Tuple
import numpy as np
from itertools import product
import scipy.sparse
from scipy.sparse import spmatrix
def old_eps(ts_1, ts_2, tau):
 Q_tau = 0
 q_tau = 0
 event_index1, = np.where(np.array(ts_1) == 0)
 n1 = event_index1.shape[0]
 event_index2, = np.where(np.array(ts_2) == 0)
 n2 = event_index2.shape[0]
 if (n1 != 0 and n2 != 0):
 matching_idx = set(event_index1).intersection(event_index2)
 c_ij = c_ji = 0.5 *len(matching_idx)
 for x,y in product(event_index1,event_index2):
 if x-y > 0 and (x-y)<= tau:
 c_ij += 1
 elif y-x > 0 and (y-x) <= tau:
 c_ji += 1
 Q_tau = (c_ij+c_ji)/math.sqrt( n1 * n2 )
 q_tau = (c_ij - c_ji)/math.sqrt( n1 * n2 )
 return Q_tau, q_tau
def bcast_eps(ts_1: np.ndarray, ts_2: np.ndarray, tau: int) -> Tuple[Real, Real]:
 if ts_1.shape != ts_2.shape or len(ts_1.shape) != 1:
 raise ValueError('Vectors must be flat and of the same length')
 N, = ts_1.shape
 events_1 = ts_1 == 0
 events_2 = ts_2 == 0
 n1 = np.sum(events_1)
 n2 = np.sum(events_2)
 if n1 == 0 or n2 == 0:
 return 0, 0
 all_true = np.ones((N, N), dtype=bool)
 upper_mask = np.logical_or(np.tril(all_true), np.triu(all_true, k=+1+tau))
 lower_mask = np.logical_or(np.triu(all_true), np.tril(all_true, k=-1-tau))
 matches = np.logical_and(events_1[np.newaxis, :], events_2[:, np.newaxis])
 matches_u = np.ma.array(matches, mask=upper_mask)
 matches_l = np.ma.array(matches, mask=lower_mask)
 n_matches = np.trace(matches)
 c_ij = c_ji = n_matches / 2
 c_ij += np.sum(matches_u)
 c_ji += np.sum(matches_l)
 den = math.sqrt(n1 * n2)
 Q_tau = (c_ij + c_ji) / den
 q_tau = (c_ij - c_ji) / den
 return Q_tau, q_tau
def trace_eps(ts_1: np.ndarray, ts_2: np.ndarray, tau: int) -> Tuple[Real, Real]:
 if ts_1.shape != ts_2.shape or len(ts_1.shape) != 1:
 raise ValueError('Vectors must be flat and of the same length')
 events_1 = ts_1 == 0
 events_2 = ts_2 == 0
 n1 = np.sum(events_1)
 n2 = np.sum(events_2)
 if n1 == 0 or n2 == 0:
 return 0, 0
 matches = np.logical_and(events_1[np.newaxis, :], events_2[:, np.newaxis])
 n_matches = np.trace(matches)
 c_ij = c_ji = n_matches / 2
 for k in range(1, tau+1):
 c_ij += np.trace(matches, k)
 c_ji += np.trace(matches, -k)
 den = math.sqrt(n1 * n2)
 Q_tau = (c_ij + c_ji) / den
 q_tau = (c_ij - c_ji) / den
 return Q_tau, q_tau
def sorted_eps(ts_1: np.ndarray, ts_2: np.ndarray, tau: int) -> Tuple[Real, Real]:
 if ts_1.shape != ts_2.shape or len(ts_1.shape) != 1:
 raise ValueError('Vectors must be flat and of the same length')
 event_index1, = np.where(np.array(ts_1) == 0)
 event_index2, = np.where(np.array(ts_2) == 0)
 n1, = event_index1.shape
 n2, = event_index2.shape
 if n1 == 0 or n2 == 0:
 return 0, 0
 n_matches, = np.intersect1d(
 event_index1, event_index2, assume_unique=True,
 ).shape
 c_ij = c_ji = n_matches/2
 insertions = np.searchsorted(
 a=event_index1, # array to pretend insertion into
 v=event_index2, # values to insert
 )
 for insertion, y in zip(insertions, event_index2):
 i1 = insertion
 if i1 < n1:
 if y == event_index1[i1]:
 i1 += 1
 for x in event_index1[i1:]:
 if x - y > tau:
 break
 c_ij += 1
 i2 = insertion - 1
 if i2 >= 0:
 for x in event_index1[i2::-1]:
 if y - x > tau:
 break
 c_ji += 1
 den = math.sqrt(n1 * n2)
 Q_tau = (c_ij + c_ji) / den
 q_tau = (c_ij - c_ji) / den
 return Q_tau, q_tau
def sparse_eps(ts_1: spmatrix, ts_2: spmatrix, tau: int) -> Tuple[Real, Real]:
 if ts_1.shape != ts_2.shape or len(ts_1.shape) != 2 or ts_1.shape[0] != 1:
 raise ValueError('Vectors must be flat and of the same length')
 n1 = ts_1.sum()
 n2 = ts_2.sum()
 if n1 == 0 or n2 == 0:
 return 0, 0
 matches = ts_2.T * ts_1
 matches_u = scipy.sparse.triu(matches, +1).sum() - scipy.sparse.triu(matches, k=+1+tau).sum()
 matches_l = scipy.sparse.tril(matches, -1).sum() - scipy.sparse.tril(matches, k=-1-tau).sum()
 n_matches = matches.diagonal().sum()
 c_ij = c_ji = n_matches / 2
 c_ij += matches_u
 c_ji += matches_l
 den = math.sqrt(n1 * n2)
 Q_tau = (c_ij + c_ji) / den
 q_tau = (c_ij - c_ji) / den
 return Q_tau, q_tau
def sparsetr_eps(ts_1: spmatrix, ts_2: spmatrix, tau: int) -> Tuple[Real, Real]:
 if ts_1.shape != ts_2.shape or len(ts_1.shape) != 2 or ts_1.shape[0] != 1:
 raise ValueError('Vectors must be flat and of the same length')
 n1 = ts_1.sum()
 n2 = ts_2.sum()
 if n1 == 0 or n2 == 0:
 return 0, 0
 matches = ts_2.T * ts_1
 n_matches = matches.diagonal().sum()
 c_ij = c_ji = n_matches / 2
 for k in range(1, tau+1):
 c_ij += matches.diagonal(+k).sum()
 c_ji += matches.diagonal(-k).sum()
 den = math.sqrt(n1 * n2)
 Q_tau = (c_ij + c_ji) / den
 q_tau = (c_ij - c_ji) / den
 return Q_tau, q_tau
def cumsum_eps(ts_1: spmatrix, ts_2: spmatrix, tau: int) -> Tuple[Real, Real]:
 ts_1 = 1 - ts_1
 ts_2 = 1 - ts_2
 n1 = ts_1.sum()
 n2 = ts_2.sum()
 if n1 == 0 or n2 == 0:
 return 0, 0
 cs1 = np.pad(ts_1.cumsum(), (0, tau), 'edge')
 cs2 = np.pad(ts_2.cumsum(), (0, tau), 'edge')
 c_ij = c_ji = 0.5 * (ts_1 * ts_2).sum()
 c_ij += ((cs1[tau:] - cs1[:-tau]) * ts_2).sum()
 c_ji += ((cs2[tau:] - cs2[:-tau]) * ts_1).sum()
 sqrt = math.sqrt(n1 * n2)
 Q_tau = (c_ij + c_ji) / sqrt
 q_tau = (c_ij - c_ji) / sqrt
 return Q_tau, q_tau
def test() -> None:
 shape = 2, 2000
 full_ts = np.ones(shape, dtype=np.uint8)
 sparse_ts = scipy.sparse.lil_matrix(shape, dtype=np.uint8)
 density = 0.9
 rand = np.random.default_rng(seed=0).random
 events = rand(shape) > density
 full_ts[events] = 0
 sparse_ts[events] = 1
 # Add some interesting values to test boundary conditions: on the diagonal
 full_ts[:, 5] = 0
 sparse_ts[:, 5] = 1
 Q_exp = 1.9446638724075895
 q_exp = 0.026566446344365977
 methods = (
 (old_eps, full_ts),
 (bcast_eps, full_ts),
 (trace_eps, full_ts),
 (sorted_eps, full_ts),
 (sparse_eps, sparse_ts),
 (sparsetr_eps, sparse_ts),
 (cumsum_eps, full_ts),
 )
 for eps, ts in methods:
 Q_tau, q_tau = eps(*ts, tau=10)
 assert math.isclose(Q_tau, Q_exp, abs_tol=1e-12, rel_tol=0)
 assert math.isclose(q_tau, q_exp, abs_tol=1e-12, rel_tol=0)
def compare(fast: bool) -> None:
 shape = 2, 2000
 rand = np.random.default_rng(seed=0).random
 n = 400 if fast else 1
 methods = (
 trace_eps,
 sorted_eps,
 cumsum_eps,
 )
 if fast:
 print('Fast method durations (us)')
 factor = 1e6
 fmt = '{:8.0f}'
 else:
 print('All method durations (ms)')
 factor = 1e3
 fmt = '{:8.2f}'
 methods += (
 old_eps,
 bcast_eps,
 sparse_eps,
 sparsetr_eps,
 )
 print(' d%', ' '.join(
 f'{method.__name__.removesuffix("_eps"):>8}'
 for method in methods
 ))
 densities = np.hstack((
 np.linspace(0.50, 0.95, 10),
 np.linspace(0.960, 0.995, 8),
 ))
 for density in densities:
 print(f'{100*density:4.1f}', end=' ')
 full_ts = np.ones(shape, dtype=np.uint8)
 sparse_ts = scipy.sparse.lil_matrix(shape, dtype=np.uint8)
 events = rand(shape) > density
 full_ts[events] = 0
 sparse_ts[events] = 1
 inputs = (
 full_ts,
 full_ts,
 full_ts,
 )
 if not fast:
 inputs += (
 full_ts,
 full_ts,
 sparse_ts,
 sparse_ts,
 )
 for eps, ts in zip(methods, inputs):
 t = timeit(lambda: eps(*ts, tau=10), number=n)
 print(fmt.format(t/n*factor), end=' ')
 print()
 print()
if __name__ == '__main__':
 test()
 compare(False)
 compare(True)

Thank you for the detailed explanation and the performance tests. I liked the sorted solution more for the straightforwardness, but will be deploying all on the real data for a more hands-on performance.
Thanks. Much cleaner display, too :-). Now I wonder what the OP's tau is really like. I think your trace solution gets faster with smaller tau, while mine gets faster with larger tau. At very high densities, the sorted solution approaches linear time, like mine always is, so they should differ by a constant factor there. Maybe either could be improved further to tweak this constant factor, but meh. I suspect bigger further improvement at a higher level, not by handling each vector pair faster but by reducing the number of pairs to do at all (or maybe it shouldn't even be done pairwise).

Stack Exchange Network

Iterate through two arrays calculating a statistic from indexes of zeros

5 Answers 5

Full matrix implementation

Sparse approach

Sorted logarithmic traversal

Full or sparse matrix, traces only

Cumulative sum

Output

Reference source

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Iterate through two arrays calculating a statistic from indexes of zeros

5 Answers 5

Full matrix implementation

Sparse approach

Sorted logarithmic traversal

Full or sparse matrix, traces only

Cumulative sum

Output

Reference source

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions