I have the following code for determining TP, TN, FP and FN values for binary classification given two sparse vectors as input (using the sparse
library):
def confused(sys1, ann1):
# True Positive (TP): we predict a label of 1 (positive), and the true label is 1.
TP = np.sum(np.logical_and(ann1 == 1, sys1 == 1))
# True Negative (TN): we predict a label of 0 (negative), and the true label is 0.
TN = np.sum(np.logical_and(ann1 == 0, sys1 == 0))
# False Positive (FP): we predict a label of 1 (positive), but the true label is 0.
FP = np.sum(np.logical_and(ann1 == 0, sys1 == 1))
# False Negative (FN): we predict a label of 0 (negative), but the true label is 1.
FN = np.sum(np.logical_and(ann1 == 1, sys1 == 0))
return TP, TN, FP, FN
I'm trying to find a way to optimize this for speed. This is based on how-to-compute-truefalse-positives-and-truefalse-negatives-in-python-for-binary-classification-problems where my addition was to add the sparse arrays to optimize for memory usage, since the input vectors for the current problem I am trying to solve have over 7.9 M elements, and the positive cases (i.e., 1), are few and far between wrt the negative cases (i.e., 0).
I've done profiling of my code and about half the time is spent in this method.
-
2\$\begingroup\$ If you compute the first 3 metrics, then the last can be a simple subtraction \$\endgroup\$Ted Brownlow– Ted Brownlow2021年01月18日 09:46:12 +00:00Commented Jan 18, 2021 at 9:46
-
\$\begingroup\$ Nice! That shaved ~20 seconds off the processing time with no hit on memory. \$\endgroup\$horcle_buzz– horcle_buzz2021年01月18日 16:04:18 +00:00Commented Jan 18, 2021 at 16:04
1 Answer 1
Well, an obvious improvement is not redoing work. You are currently doing twice as much work as needed because you don't save the results of the comparisons:
def confused(sys1, ann1):
predicted_true, predicted_false = sys1 == 1, sys1 == 0
true_true, true_false = ann1 == 1, ann1 == 0
# True Positive (TP): we predict a label of 1 (positive), and the true label is 1.
TP = np.sum(np.logical_and(true_true, predicted_true))
# True Negative (TN): we predict a label of 0 (negative), and the true label is 0.
TN = np.sum(np.logical_and(true_false, predicted_false))
# False Positive (FP): we predict a label of 1 (positive), but the true label is 0.
FP = np.sum(np.logical_and(true_false, predicted_true))
# False Negative (FN): we predict a label of 0 (negative), but the true label is 1.
FN = np.sum(np.logical_and(true_true, predicted_false))
return TP, TN, FP, FN
This should speed up the calculation, at the cost of keeping things in memory slightly longer. Make sure you have enough memory available.
I'm not sure I got your true and predicted labels right, which goes to show that ann1
and sys1
are really bad names. Something like true
and predicted
would be vastly more readable. And while you're at it, write out the other variables as well. Characters don't cost extra.
np.logical_and
works perfectly fine with integers (at least on normal numpy
vectors, you should check that this is also the case for sparse vectors), so as long as your vectors can only contain 0
or 1
, you can directly use the input vectors and save on half the memory:
not_true = np.logical_not(true)
not_predicted = np.logical_not(predicted)
true_positive = np.sum(np.logical_and(true, predicted))
true_negative = np.sum(np.logical_and(not_true, not_predicted))
false_positive = np.sum(np.logical_and(not_true, predicted))
false_negative = np.sum(np.logical_and(true, not_predicted))
Note that you cannot use ~true
and ~predicted
here, since that does two's complement or something like that...
Finally, note that I could not find proof for or against np.logical_and
(or np.logical_not
for that matter) being implemented efficiently for sparse
vectors, so maybe that is where you actually loose speed. In that case, go and implement it in C and/or Cython and write a pull request for sparse
, I guess...
-
1\$\begingroup\$ By jove! This shared off another ~15 seconds in my test data. I did try using cython, btw, with no real increase in speed. \$\endgroup\$horcle_buzz– horcle_buzz2021年01月18日 16:33:36 +00:00Commented Jan 18, 2021 at 16:33