Python function using numpy array input is giving unexpected result

Question 1

I have a function in my program that works out a correlation coefficient. It takes two flat (one dimensional) numpy arrays and performs the necessary calculations on them to work out the correlation between the two lists of numbers (they are currencies, of type float). This function is executed 136 times per loop, taking around 0.05 seconds per loop, for as long as the program runs. The following code calculates the coefficient as expected:

def CC(a, b):
 a = a - np.mean(a)
 b = b - np.mean(b)
 ab = np.sum(a*b)
 asq = np.sum(a**2)
 bsq = np.sum(b**2)
 cc = round(ab / sqrt(asq * bsq), 5)
 return cc

However it eventually results in a memory leak. A solution to this memory leak is to change the function to:

def CC(a, b):
 cc = round(np.sum((a - np.mean(a)) * (b - np.mean(b))) / sqrt(np.sum(a**2) * np.sum(b**2)), 5)
 return cc

Which works it all out in one line and doesn't create any new lists, hence saving memory and avoiding the leak.

However, for some bizarre reason, when using method 2, the returned value starts at 0.1 ish and then trends down to 0 over the course of about 20 seconds and then stays at 0 from then on. This happens every time without fail. I've tried alternatives to method 2 as well i.e. 1 or 2 extra calculation steps - same result. I've isolated all possible sources of error by process of elimination and it's all boiled down to what's happening inside the function itself, so it has to be a problem there. What on earth could be causing this? It's as though the function CC disregards the inputs it's given... if it's setup a certain way..?

Question 2

Your code isn't equal, the first one reassigns a and b in the first step:

a = a - np.mean(a)
b = b - np.mean(b)

and all subsequent operations use the updated a and b. However your second approach just ignores these in the sqrt-term:

sqrt(np.sum(a**2) * np.sum(b**2))

it should be the same with:

sqrt(np.sum((a-a.mean())**2) * np.sum((b-b.mean())**2))

Some additional comments:

Which works it all out in one line and doesn't create any new lists, hence saving memory.

That's not true (at least not always), it will still produce new arrays. But I can see two places where you could avoid creating an intermediate array:

np.subtract(a, a.mean(), out=a) 
# instead of "a = a - np.mean(a)"
# possible also "a -= a" should work without temporary array, but I'm not 100% sure.

The same for b = b - np.mean(b)

However it eventually results in a memory leak.

I can't find any evidence for a memory leak in the first function.

If you care about intermediate arrays you can always do the operation yourself. I show it with numba but this can be easily ported to cython or similar (but I don't need to add the type annotations):

import numpy as np
import numba as nb
from math import sqrt
@nb.njit
def CC_helper(a, b):
 sum_ab = 0.
 sum_aa = 0.
 sum_bb = 0.
 for idx in range(a.size):
 sum_ab += a[idx] * b[idx]
 sum_aa += a[idx] * a[idx]
 sum_bb += b[idx] * b[idx]
 return sum_ab / sqrt(sum_aa * sum_bb)
def CC1(a, b):
 np.subtract(a, a.mean(), out=a) 
 np.subtract(b, b.mean(), out=b)
 res = CC_helper(a, b)
 return round(res, 5)

and compared the performance to your two functions:

def CC2(a, b):
 a = a - np.mean(a)
 b = b - np.mean(b)
 ab = np.sum(a*b)
 asq = np.sum(a**2)
 bsq = np.sum(b**2)
 cc = round(ab / sqrt(asq * bsq), 5)
 return cc
def CC3(a, b):
 cc = round(np.sum((a - np.mean(a)) * (b - np.mean(b))) / sqrt(np.sum((a - np.mean(a))**2) * np.sum((b - np.mean(b))**2)), 5)
 return cc

And made sure the results were the same and timed these:

a = np.random.random(100000)
b = np.random.random(100000)
assert CC1(arr1, arr2) == CC2(arr1, arr2)
assert CC1(arr1, arr2) == CC3(arr1, arr2)
%timeit CC1(arr1, arr2) # 100 loops, best of 3: 2.06 ms per loop
%timeit CC2(arr1, arr2) # 100 loops, best of 3: 5.98 ms per loop
%timeit CC3(arr1, arr2) # 100 loops, best of 3: 7.98 ms per loop

Question 3

That is epic, thanks so much for your response, such a stupid error missing out that extra (a-a.mean()) inside the squared bits. It's always something small but you get tunnel vision and miss it. Thanks a lot for all the memory stuff as well, I'm going to implement all of this now!

MSeifert 154k41 gold badges356 silver badges378 bronze badges · Accepted Answer · 2017-01-29 22:27:46Z

Your code isn't equal, the first one reassigns a and b in the first step:

a = a - np.mean(a)
b = b - np.mean(b)

and all subsequent operations use the updated a and b. However your second approach just ignores these in the sqrt-term:

sqrt(np.sum(a**2) * np.sum(b**2))

it should be the same with:

sqrt(np.sum((a-a.mean())**2) * np.sum((b-b.mean())**2))

Some additional comments:

Which works it all out in one line and doesn't create any new lists, hence saving memory.

That's not true (at least not always), it will still produce new arrays. But I can see two places where you could avoid creating an intermediate array:

np.subtract(a, a.mean(), out=a) 
# instead of "a = a - np.mean(a)"
# possible also "a -= a" should work without temporary array, but I'm not 100% sure.

The same for b = b - np.mean(b)

However it eventually results in a memory leak.

I can't find any evidence for a memory leak in the first function.

If you care about intermediate arrays you can always do the operation yourself. I show it with numba but this can be easily ported to cython or similar (but I don't need to add the type annotations):

import numpy as np
import numba as nb
from math import sqrt
@nb.njit
def CC_helper(a, b):
 sum_ab = 0.
 sum_aa = 0.
 sum_bb = 0.
 for idx in range(a.size):
 sum_ab += a[idx] * b[idx]
 sum_aa += a[idx] * a[idx]
 sum_bb += b[idx] * b[idx]
 return sum_ab / sqrt(sum_aa * sum_bb)
def CC1(a, b):
 np.subtract(a, a.mean(), out=a) 
 np.subtract(b, b.mean(), out=b)
 res = CC_helper(a, b)
 return round(res, 5)

and compared the performance to your two functions:

def CC2(a, b):
 a = a - np.mean(a)
 b = b - np.mean(b)
 ab = np.sum(a*b)
 asq = np.sum(a**2)
 bsq = np.sum(b**2)
 cc = round(ab / sqrt(asq * bsq), 5)
 return cc
def CC3(a, b):
 cc = round(np.sum((a - np.mean(a)) * (b - np.mean(b))) / sqrt(np.sum((a - np.mean(a))**2) * np.sum((b - np.mean(b))**2)), 5)
 return cc

And made sure the results were the same and timed these:

a = np.random.random(100000)
b = np.random.random(100000)
assert CC1(arr1, arr2) == CC2(arr1, arr2)
assert CC1(arr1, arr2) == CC3(arr1, arr2)
%timeit CC1(arr1, arr2) # 100 loops, best of 3: 2.06 ms per loop
%timeit CC2(arr1, arr2) # 100 loops, best of 3: 5.98 ms per loop
%timeit CC3(arr1, arr2) # 100 loops, best of 3: 7.98 ms per loop

That is epic, thanks so much for your response, such a stupid error missing out that extra (a-a.mean()) inside the squared bits. It's always something small but you get tunnel vision and miss it. Thanks a lot for all the memory stuff as well, I'm going to implement all of this now!

CollectivesTM on Stack Overflow

Python function using numpy array input is giving unexpected result

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related