Below is a naive algorithm to find nearest neighbours for a point in some n-dimensional space.
import numpy as np
import scipy.sparse as ss
# brute force method to get k nearest neighbours to point p
def knn_naive(k, p, X):
stacked_p = ss.vstack([p for n in range(X.get_shape()[0])])
D = X - stacked_p
D = np.sqrt(D.multiply(D).sum(1))
result = sorted(D)[:k]
return result
Types for arguments to knn_naive(...)
are:
# type(k): int
# type(p): <class 'scipy.sparse.csr.csr_matrix'>
# type(X): <class 'scipy.sparse.csr.csr_matrix'>
An example of dimensions would be:
X.shape: (100, 3004)
p.shape: (1, 3004)
Are there techniques to improve the above implementation in terms of computational time and/or memory?
1 Answer 1
I wasn't aware that scipy's sparse matrices did not support broadcasting. What a shame that you can't simply do D = X - p
! Your construction of stacked_p
can be sped up quite a bit, but you need to access the internals of the CSR matrix directly. If you aren't aware of the details, the wikipedia article has a good description of the format.
In any case, you can construct stacked_p
as follows:
rows, cols = X.shape
stacked_p = ss.csr_matrix((np.tile(p.data, rows), np.tile(p.indices, rows),
np.arange(0, rows*p.nnz + 1, p.nnz)), shape=X.shape)
With some dummy data:
import numpy as np
import scipy.sparse as sps
p = sps.rand(1, 1000, density=0.01, format='csr')
In [15]: %timeit sps.vstack([p for n in xrange(1000)]).tocsr()
10 loops, best of 3: 173 ms per loop
In [16]: %timeit sps.csr_matrix((np.tile(p.data, 1000), np.tile(p.indices, 1000),
np.arange(0, 1000*p.nnz + 1, p.nnz)), (1000, 1000))
10000 loops, best of 3: 94.1 μs per loop
That is close to 2000x times faster...
Explore related questions
See similar questions with these tags.