Improve Performance of Comparing two Numpy Arrays

Question 1

I had a code challenge for a class I'm taking that built a NN algorithm. I got it to work but I used really basic methods for solving it. There are two 1D NP Arrays that have values 0-2 in them, both equal length. They represent two different trains and test data The output is a confusion matrix that shows which received the right predictions and which received the wrong (doesn't matter ;).

This code is correct - I just feel I took the lazy way out working with lists and then turning those lists into a ndarray. I would love to see if people have some tips on maybe utilizing Numpy for this? Anything Clever?

import numpy as np
x = [0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 1, 0, 2, 0, 0, 0, 0, 0, 1, 0]
y = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
testy = np.array(x)
testy_fit = np.array(y)
row_no = [0,0,0]
row_dh = [0,0,0]
row_sl = [0,0,0]
# Code for the first row - NO
for i in range(len(testy)):
 if testy.item(i) == 0 and testy_fit.item(i) == 0:
 row_no[0] += 1
 elif testy.item(i) == 0 and testy_fit.item(i) == 1:
 row_no[1] += 1
 elif testy.item(i) == 0 and testy_fit.item(i) == 2:
 row_no[2] += 1
# Code for the second row - DH
for i in range(len(testy)):
 if testy.item(i) == 1 and testy_fit.item(i) == 0:
 row_dh[0] += 1
 elif testy.item(i) == 1 and testy_fit.item(i) == 1:
 row_dh[1] += 1
 elif testy.item(i) == 1 and testy_fit.item(i) == 2:
 row_dh[2] += 1
# Code for the third row - SL
for i in range(len(testy)):
 if testy.item(i) == 2 and testy_fit.item(i) == 0:
 row_sl[0] += 1
 elif testy.item(i) == 2 and testy_fit.item(i) == 1:
 row_sl[1] += 1
 elif testy.item(i) == 2 and testy_fit.item(i) == 2:
 row_sl[2] += 1
confusion = np.array([row_no,row_dh,row_sl])
print(confusion)

the result of the print is correct as follow:

[[16 10 0]
 [ 2 10 0]
 [ 2 0 22]]

Question 2

Good thing this got an answer on SO before it was moved. Performance questions for numpy are routine on SO.

Question 3

This can be implemented concisely by using numpy.add.at:

In [2]: c = np.zeros((3, 3), dtype=int) 
In [3]: np.add.at(c, (x, y), 1) 
In [4]: c 
Out[4]: 
array([[16, 10, 0],
 [ 2, 10, 0],
 [ 2, 0, 22]])

Question 4

Oh my! I thought there would be something better but i didn't think 1 line of code! Wow. So glad I asked and thank you!

Question 5

Rule #1 of numpy is if you want to do something, check the docs first to check for a 1 line solution.

Question 6

For now disregarding that there is a (way) better numpy solution to this, as explained in the answer by @WarrenWeckesser, here is a short code review of your actual code.

testy.item(i) is a very unusual way to say testy[i]. It is probably also slower as it involves an attribute lookup.

Don't repeat yourself. You test e.g. if testy.item(i) == 0 three times, each time with a different second condition. Just nest them in an if block:

for i in range(len(testy)):
 if testy[i] == 0:
 if testy_fit[i] == 0:
 row_no[0] += 1
 elif testy_fit[i] == 1:
 row_no[1] += 1
 elif testy_fit[i] == 2:
 row_no[2] += 1

Loop like a native. Don't iterate over the indices of iterables, iterate over the iterable(s)! You can also use the fact that the value encodes the position you want to increment:
```
for test, fit in zip(testy, testy_fit):
 if test == 0 and fit in {0, 1, 2}:
 row_no[fit] += 1
```

You can even use the fact that the first value encodes the list you want to use and iterate only once. Or even better, make it a list of lists right away:

n = 3
confusion_matrix = [[0] * n for _ in range(n)]
for test, fit in zip(testy, testy_fit):
 confusion_matrix[test][fit] += 1
print(np.array(confusion_matrix))

Don't put everything into the global space, to be run whenever you interact with the script at all. Put your code into functions, document them with a docstring, and execute them under a if __name__ == "__main__": guard, which allows you to import from this script from another script without your code running:

def confusion_matrix(x, y):
 """Return the confusion matrix for two vectors `x` and `y`.
 x and y must only have values from 0 to n and 0 to m, respectively.
 """
 n, m = np.max(x) + 1, np.max(y) + 1
 matrix = [[0] * m for _ in range(n)]
 for a, b in zip(x, y):
 matrix[a][b] += 1
 return matrix
if __name__ == "__main__":
 x = ...
 y = ...
 print(np.array(confusion_matrix(x, y)))

Once you have come this far, you can just swap the implementation of this function to the faster numpy one without changing anything (except that it then directly returns a numpy.array instead of a list of lists).

Warren WeckesserWarren Weckesser · Accepted Answer · 2019-05-05 23:41:43Z

5

\$\begingroup\$

This can be implemented concisely by using numpy.add.at:

In [2]: c = np.zeros((3, 3), dtype=int) 
In [3]: np.add.at(c, (x, y), 1) 
In [4]: c 
Out[4]: 
array([[16, 10, 0],
 [ 2, 10, 0],
 [ 2, 0, 22]])

Share

answered May 5, 2019 at 23:41

Warren WeckesserWarren Weckesser

\$\endgroup\$

2

\$\begingroup\$ Oh my! I thought there would be something better but i didn't think 1 line of code! Wow. So glad I asked and thank you! \$\endgroup\$

broepke
– broepke

2019年05月06日 02:04:51 +00:00
Commented May 6, 2019 at 2:04
2

\$\begingroup\$ Rule #1 of numpy is if you want to do something, check the docs first to check for a 1 line solution. \$\endgroup\$

Oscar Smith
– Oscar Smith

2019年05月06日 05:39:09 +00:00
Commented May 6, 2019 at 5:39

Add a comment |

Stack Exchange Network

Improve Performance of Comparing two Numpy Arrays

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Improve Performance of Comparing two Numpy Arrays

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions