Given 2 numpy arrays
a1
anda2
(composed only of 0 and 1's), find index locations of all 0's ina1
anda2
.Find matching index positions if any between array
a1
anda2
.Calculate a metric.
a1
array shape is 2161
a2
array shape is 2161
e.g.
a1 = [0,1,1,0,1,0,1]
a2 = [1,1,0,0,1,1,0]
The indices of all 0's in a1
are 0, 3, and 5.
The indices of all 0's in a2
are 2, 3, and 6.
The only common index between a1
and a2
is thus 3.
function_1
performs step1 and step2 and step3
function1_iterations
repeats function_1
after randomly shuffling a1
and repeating metric calculation 1000 times. This is for the purpose of finding if metric is statistically significant.
I perform below code on 100 million array pairs, multiprocessed on 256 cores. Best runtime for 100 million array pairs is about 40 mins. Is there any way I can make it significantly efficient? I need to be running this on billions of array pairs.
My code below is the fastest that I could come up with some help from people from codereview earlier:
def function_1(self, a1, a2):
event_index1, = np.where(a1 == 0)
event_index2, = np.where(a2 == 0)
n1, = event_index1.shape
n2, = event_index2.shape
if n1 == 0 or n2 == 0:
return 0, 0
n_matches, = np.intersect1d(event_index1, event_index2, assume_unique=True,).shape
c_ij = c_ji = n_matches/2
metric_1= (c_ij + c_ji) / math.sqrt(n1 * n2)
return metric_1
def function1_iterations(self,a1,a2,repeat = 1000,original_metric1):
list_metric1 = []
a1_copy = copy.deepcopy(a1)
for i in range(0, repeat):
np.random.shuffle(a1_copy) # shuffle bits in array and recalculate 1000 times
metric_1 = self.function_1(a1= a1_copy, a2 = a2)
list_metric1.append(metric_1)
list_metric1= np.array(list_metric1)
significance_val = len(np.where(list_metric1>= [original_metric1])[0])/repeat
return significance_val
2 Answers 2
Ignoring your function1_iterations
: function_1
should call neither where
nor intersect1d
. Your metric expression numerator just evaluates to n_matches
. A faster vectorised approach will be, either in the integer or boolean domain, perform the equivalent of sum(!a1 & !a2)
:
import math
from timeit import timeit
import numpy as np
def function_1_old(a1, a2):
event_index1, = np.where(a1 == 0)
event_index2, = np.where(a2 == 0)
n1, = event_index1.shape
n2, = event_index2.shape
if n1 == 0 or n2 == 0:
return 0, 0
n_matches, = np.intersect1d(event_index1, event_index2, assume_unique=True, ).shape
c_ij = c_ji = n_matches / 2
metric_1 = (c_ij + c_ji) / math.sqrt(n1 * n2)
return metric_1
def function_1_arith(a1: np.ndarray, a2: np.ndarray) -> float:
zeros1 = 1 - a1
zeros2 = 1 - a2
n1 = np.sum(zeros1)
n2 = np.sum(zeros2)
if n1 == 0 or n2 == 0:
return 0
n_matches = np.sum(zeros1 * zeros2)
return n_matches / np.sqrt(n1 * n2)
def function_1_bool(a1: np.ndarray, a2: np.ndarray) -> float:
zeros1 = np.logical_not(a1)
zeros2 = np.logical_not(a2)
n1 = np.sum(zeros1)
n2 = np.sum(zeros2)
if n1 == 0 or n2 == 0:
return 0
n_matches = np.sum(np.logical_and(zeros1, zeros2))
return n_matches / np.sqrt(n1 * n2)
def function_1_demorgan(a1: np.ndarray, a2: np.ndarray) -> float:
n1 = a1.size - np.sum(a1)
n2 = a2.size - np.sum(a2)
if n1 == 0 or n2 == 0:
return 0
n_matches = np.sum(np.logical_not(np.logical_or(a1, a2)))
return n_matches / np.sqrt(n1 * n2)
def function_1_count(a1: np.ndarray, a2: np.ndarray) -> float:
n1 = a1.size - np.count_nonzero(a1)
n2 = a2.size - np.count_nonzero(a2)
if n1 == 0 or n2 == 0:
return 0
n_matches = np.count_nonzero(np.logical_not(np.logical_or(a1, a2)))
return n_matches / np.sqrt(n1 * n2)
def function_1_binary(a1: np.ndarray, a2: np.ndarray) -> float:
n1 = a1.size - np.count_nonzero(a1)
n2 = a2.size - np.count_nonzero(a2)
if n1 == 0 or n2 == 0:
return 0
n_matches = np.count_nonzero(~(a1 | a2))
return n_matches / np.sqrt(n1 * n2)
def test() -> None:
methods = (function_1_old, function_1_arith, function_1_bool, function_1_demorgan, function_1_count,
function_1_binary)
a1 = np.array((0, 1, 1, 0, 1, 0, 1), dtype=bool)
a2 = np.array((1, 1, 0, 0, 1, 1, 0), dtype=bool)
# Testing against expected results, with OP's original sample arrays
for method in methods:
assert np.isclose(1/3, method(a1, a2), rtol=0, atol=1e-12)
rand = np.random.default_rng(seed=0) # for reproducible results
def sample():
return rand.integers(low=0, high=2, size=(10_000,), dtype=bool)
a1, a2 = sample(), sample()
# Test for regression with bigger sample arrays
reference = None
for method in methods:
result = method(a1, a2)
if method is function_1_old:
reference = result
else:
assert np.isclose(reference, result, rtol=0, atol=1e-12)
# Performance measurement with those arrays
n = 1_000
for method in methods:
dur = timeit(lambda: method(a1, a2), number=n)
print(f'{method.__name__}: {dur/n*1e6:.1f} us')
'''
function_1_old: 386.6 us
function_1_arith: 54.8 us
function_1_bool: 44.2 us
function_1_demorgan: 43.1 us
function_1_count: 6.9 us
function_1_binary: 6.7 us
'''
if __name__ == '__main__':
test()
-
\$\begingroup\$ Thanks for this comparison. I went ahead with the binary bool array version. I was wondering how to approach such efficiency problems in the future? Like to know binary version is much faster than the one I was using naively. \$\endgroup\$skynaive– skynaive2022年01月09日 13:03:57 +00:00Commented Jan 9, 2022 at 13:03
-
1\$\begingroup\$ A mix of knowledge (full-matrix vectorisation is generally faster than indexing) and guessing (are bitwise operators faster than boolean operators?) backed up by measurement. Always, always measure. \$\endgroup\$Reinderien– Reinderien2022年01月09日 13:59:55 +00:00Commented Jan 9, 2022 at 13:59
I'm not 100% confident that I understand your intent, but looking at your code it seems like all we care about is the number of zeroes in each list and the number of times those zeroes line up by index. The actual indexes themselves dont seem to matter.
I found that just ignoring the arrays, and treating the whole thing as a simple list problem, was more than twice as fast.
def function_1_new(a1, a2):
n_matches = 0
n_zeroes_a1 = 0
n_zeroes_a2 =0
for sub_a1, sub_a2 in zip(a1, a2):
if sub_a1 == sub_a2 == 0:
n_matches += 1
n_zeroes_a1 += 1
n_zeroes_a2 += 1
else:
if sub_a1 == 0:
n_zeroes_a1 += 1
if sub_a2 == 0:
n_zeroes_a2 += 1
c_ij = c_ji = n_matches/2
metric_1= (c_ij + c_ji) / math.sqrt(n_zeroes_a1 * n_zeroes_a2)
return metric_1
Running your original function 1,000,000 times took me 26.5 seconds on average. The more naïve approach above averaged around 11 seconds. And here's how I tested to make sure the new code produces the same results as the old code.
def function1_iterations_sameness(a1,a2, original_metric1, repeat = 1000000):
a1_copy = copy.deepcopy(a1)
for i in range(0, repeat):
np.random.shuffle(a1_copy) # shuffle bits in array and recalculate 1000 times
metric_1_old = function_1(a1= a1_copy, a2 = a2)
metric_1_new = function_1_new(a1_copy, a2)
assert metric_1_old == metric_1_new
-
2\$\begingroup\$ This implementation is logically correct but in terms of performance it's a disaster. I encourage you to revisit your performance check. Where the fastest method I evaluated completes in 7.8 microseconds, yours took almost four thousand times (!) as long. \$\endgroup\$Reinderien– Reinderien2022年01月07日 20:21:43 +00:00Commented Jan 7, 2022 at 20:21
run
into an answer, and this comment thread needs to continue on that answer and not on the question. \$\endgroup\$bool
. Otherwise, bitwise operations will produce different results. \$\endgroup\$