optimize binary classification method for speed

Question 1

I have the following code for determining TP, TN, FP and FN values for binary classification given two sparse vectors as input (using the sparse library):

def confused(sys1, ann1):
 # True Positive (TP): we predict a label of 1 (positive), and the true label is 1.
 TP = np.sum(np.logical_and(ann1 == 1, sys1 == 1))
 # True Negative (TN): we predict a label of 0 (negative), and the true label is 0.
 TN = np.sum(np.logical_and(ann1 == 0, sys1 == 0))
 # False Positive (FP): we predict a label of 1 (positive), but the true label is 0.
 FP = np.sum(np.logical_and(ann1 == 0, sys1 == 1))
 # False Negative (FN): we predict a label of 0 (negative), but the true label is 1.
 FN = np.sum(np.logical_and(ann1 == 1, sys1 == 0))
 
 return TP, TN, FP, FN

I'm trying to find a way to optimize this for speed. This is based on how-to-compute-truefalse-positives-and-truefalse-negatives-in-python-for-binary-classification-problems where my addition was to add the sparse arrays to optimize for memory usage, since the input vectors for the current problem I am trying to solve have over 7.9 M elements, and the positive cases (i.e., 1), are few and far between wrt the negative cases (i.e., 0).

I've done profiling of my code and about half the time is spent in this method.

Question 2

If you compute the first 3 metrics, then the last can be a simple subtraction

Question 3

Nice! That shaved ~20 seconds off the processing time with no hit on memory.

Question 4

Well, an obvious improvement is not redoing work. You are currently doing twice as much work as needed because you don't save the results of the comparisons:

def confused(sys1, ann1):
 predicted_true, predicted_false = sys1 == 1, sys1 == 0
 true_true, true_false = ann1 == 1, ann1 == 0
 # True Positive (TP): we predict a label of 1 (positive), and the true label is 1.
 TP = np.sum(np.logical_and(true_true, predicted_true))
 # True Negative (TN): we predict a label of 0 (negative), and the true label is 0.
 TN = np.sum(np.logical_and(true_false, predicted_false))
 # False Positive (FP): we predict a label of 1 (positive), but the true label is 0.
 FP = np.sum(np.logical_and(true_false, predicted_true))
 # False Negative (FN): we predict a label of 0 (negative), but the true label is 1.
 FN = np.sum(np.logical_and(true_true, predicted_false))
return TP, TN, FP, FN

This should speed up the calculation, at the cost of keeping things in memory slightly longer. Make sure you have enough memory available.

I'm not sure I got your true and predicted labels right, which goes to show that ann1 and sys1 are really bad names. Something like true and predicted would be vastly more readable. And while you're at it, write out the other variables as well. Characters don't cost extra.

np.logical_and works perfectly fine with integers (at least on normal numpy vectors, you should check that this is also the case for sparse vectors), so as long as your vectors can only contain 0 or 1, you can directly use the input vectors and save on half the memory:

not_true = np.logical_not(true)
not_predicted = np.logical_not(predicted)
true_positive = np.sum(np.logical_and(true, predicted))
true_negative = np.sum(np.logical_and(not_true, not_predicted))
false_positive = np.sum(np.logical_and(not_true, predicted))
false_negative = np.sum(np.logical_and(true, not_predicted))

Note that you cannot use ~true and ~predicted here, since that does two's complement or something like that...

Finally, note that I could not find proof for or against np.logical_and (or np.logical_not for that matter) being implemented efficiently for sparse vectors, so maybe that is where you actually loose speed. In that case, go and implement it in C and/or Cython and write a pull request for sparse, I guess...

Question 5

By jove! This shared off another ~15 seconds in my test data. I did try using cython, btw, with no real increase in speed.

Graipher Graipher 41.6k7 gold badges70 silver badges134 bronze badges · Accepted Answer · 2021-01-18 09:16:03Z

Well, an obvious improvement is not redoing work. You are currently doing twice as much work as needed because you don't save the results of the comparisons:

def confused(sys1, ann1):
 predicted_true, predicted_false = sys1 == 1, sys1 == 0
 true_true, true_false = ann1 == 1, ann1 == 0
 # True Positive (TP): we predict a label of 1 (positive), and the true label is 1.
 TP = np.sum(np.logical_and(true_true, predicted_true))
 # True Negative (TN): we predict a label of 0 (negative), and the true label is 0.
 TN = np.sum(np.logical_and(true_false, predicted_false))
 # False Positive (FP): we predict a label of 1 (positive), but the true label is 0.
 FP = np.sum(np.logical_and(true_false, predicted_true))
 # False Negative (FN): we predict a label of 0 (negative), but the true label is 1.
 FN = np.sum(np.logical_and(true_true, predicted_false))
return TP, TN, FP, FN

This should speed up the calculation, at the cost of keeping things in memory slightly longer. Make sure you have enough memory available.

I'm not sure I got your true and predicted labels right, which goes to show that ann1 and sys1 are really bad names. Something like true and predicted would be vastly more readable. And while you're at it, write out the other variables as well. Characters don't cost extra.

np.logical_and works perfectly fine with integers (at least on normal numpy vectors, you should check that this is also the case for sparse vectors), so as long as your vectors can only contain 0 or 1, you can directly use the input vectors and save on half the memory:

not_true = np.logical_not(true)
not_predicted = np.logical_not(predicted)
true_positive = np.sum(np.logical_and(true, predicted))
true_negative = np.sum(np.logical_and(not_true, not_predicted))
false_positive = np.sum(np.logical_and(not_true, predicted))
false_negative = np.sum(np.logical_and(true, not_predicted))

Note that you cannot use ~true and ~predicted here, since that does two's complement or something like that...

Finally, note that I could not find proof for or against np.logical_and (or np.logical_not for that matter) being implemented efficiently for sparse vectors, so maybe that is where you actually loose speed. In that case, go and implement it in C and/or Cython and write a pull request for sparse, I guess...

By jove! This shared off another ~15 seconds in my test data. I did try using cython, btw, with no real increase in speed.

Stack Exchange Network

optimize binary classification method for speed

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

optimize binary classification method for speed

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions