I have a function in my program that works out a correlation coefficient. It takes two flat (one dimensional) numpy arrays and performs the necessary calculations on them to work out the correlation between the two lists of numbers (they are currencies, of type float). This function is executed 136 times per loop, taking around 0.05 seconds per loop, for as long as the program runs. The following code calculates the coefficient as expected:
def CC(a, b):
a = a - np.mean(a)
b = b - np.mean(b)
ab = np.sum(a*b)
asq = np.sum(a**2)
bsq = np.sum(b**2)
cc = round(ab / sqrt(asq * bsq), 5)
return cc
However it eventually results in a memory leak. A solution to this memory leak is to change the function to:
def CC(a, b):
cc = round(np.sum((a - np.mean(a)) * (b - np.mean(b))) / sqrt(np.sum(a**2) * np.sum(b**2)), 5)
return cc
Which works it all out in one line and doesn't create any new lists, hence saving memory and avoiding the leak.
However, for some bizarre reason, when using method 2, the returned value starts at 0.1 ish and then trends down to 0 over the course of about 20 seconds and then stays at 0 from then on. This happens every time without fail. I've tried alternatives to method 2 as well i.e. 1 or 2 extra calculation steps - same result. I've isolated all possible sources of error by process of elimination and it's all boiled down to what's happening inside the function itself, so it has to be a problem there. What on earth could be causing this? It's as though the function CC disregards the inputs it's given... if it's setup a certain way..?
1 Answer 1
Your code isn't equal, the first one reassigns a and b in the first step:
a = a - np.mean(a)
b = b - np.mean(b)
and all subsequent operations use the updated a and b. However your second approach just ignores these in the sqrt-term:
sqrt(np.sum(a**2) * np.sum(b**2))
it should be the same with:
sqrt(np.sum((a-a.mean())**2) * np.sum((b-b.mean())**2))
Some additional comments:
Which works it all out in one line and doesn't create any new lists, hence saving memory.
That's not true (at least not always), it will still produce new arrays. But I can see two places where you could avoid creating an intermediate array:
np.subtract(a, a.mean(), out=a)
# instead of "a = a - np.mean(a)"
# possible also "a -= a" should work without temporary array, but I'm not 100% sure.
The same for b = b - np.mean(b)
However it eventually results in a memory leak.
I can't find any evidence for a memory leak in the first function.
If you care about intermediate arrays you can always do the operation yourself. I show it with numba but this can be easily ported to cython or similar (but I don't need to add the type annotations):
import numpy as np
import numba as nb
from math import sqrt
@nb.njit
def CC_helper(a, b):
sum_ab = 0.
sum_aa = 0.
sum_bb = 0.
for idx in range(a.size):
sum_ab += a[idx] * b[idx]
sum_aa += a[idx] * a[idx]
sum_bb += b[idx] * b[idx]
return sum_ab / sqrt(sum_aa * sum_bb)
def CC1(a, b):
np.subtract(a, a.mean(), out=a)
np.subtract(b, b.mean(), out=b)
res = CC_helper(a, b)
return round(res, 5)
and compared the performance to your two functions:
def CC2(a, b):
a = a - np.mean(a)
b = b - np.mean(b)
ab = np.sum(a*b)
asq = np.sum(a**2)
bsq = np.sum(b**2)
cc = round(ab / sqrt(asq * bsq), 5)
return cc
def CC3(a, b):
cc = round(np.sum((a - np.mean(a)) * (b - np.mean(b))) / sqrt(np.sum((a - np.mean(a))**2) * np.sum((b - np.mean(b))**2)), 5)
return cc
And made sure the results were the same and timed these:
a = np.random.random(100000)
b = np.random.random(100000)
assert CC1(arr1, arr2) == CC2(arr1, arr2)
assert CC1(arr1, arr2) == CC3(arr1, arr2)
%timeit CC1(arr1, arr2) # 100 loops, best of 3: 2.06 ms per loop
%timeit CC2(arr1, arr2) # 100 loops, best of 3: 5.98 ms per loop
%timeit CC3(arr1, arr2) # 100 loops, best of 3: 7.98 ms per loop