I have 2 matrices. I have to calculate the euclidean distance between every row of matrix A and every row of matrix B.
In the first solution I loop over the rows of the two matrices and for each row in matrix A, I take a row from B, take the square of the element wise subtraction, then sum it and take the square root.
I end up with a matrix of the size #rows in A x #rows in B:
import numpy as np
A = np.random.random((50, 3072))
B = np.random.random((500, 3072))
# First solution
d = np.zeros((50, 500))
for i in range(50):
for j in range (500):
d[i,j] = np.sqrt(np.sum(np.square(B[j] - A[i])))
In the second solution I did this by broadcasting, and it works. But when I increase the amount of rows in A and B ... it becomes very very slow. Is there a faster way without looping?
Solution 2:
# Second solution
test = np.sqrt(np.sum(np.square(A[:,np.newaxis]-B),axis=2))
#print check
print np.unique(test==d)
2 Answers 2
It's not pretty, but this gives a factor-3 speed improvement:
d = (A**2).sum(axis=-1)[:, np.newaxis] + (B**2).sum(axis=-1)
d -= 2 * np.squeeze(A.dot(B[..., np.newaxis]), axis=-1)
d **= 0.5
This is based off of the fact
$$ (a - b)^2 = a^2 + b^2 - 2ab $$
and so, ignoring the fudging with indices,
$$ \sum(a - b)^2 = \sum a^2 + \sum b^2 - 2\sum ab $$
The squared terms are just
(A**2).sum(axis=-1)[:, np.newaxis] + (B**2).sum(axis=-1)
and \$\sum ab = \vec A \cdot \vec B\$. This can be broadcast with a bit of fudging the axes:
np.squeeze(A.dot(B[..., np.newaxis]), axis=-1)
You may also try sklearn.metrics.pairwise_distances
:
basically here's what it will look like:
import numpy as np
from sklearn.metrics import pairwise_distances
A = np.random.random((50, 3072))
B = np.random.random((500, 3072))
d = pairwise_distances(A, B, metric='euclidean')