Calculating Euclidean norm for each vector in a sparse matrix

Question 1

Below is a naive algorithm to find nearest neighbours for a point in some n-dimensional space.

import numpy as np
import scipy.sparse as ss
# brute force method to get k nearest neighbours to point p
def knn_naive(k, p, X):
 stacked_p = ss.vstack([p for n in range(X.get_shape()[0])])
 D = X - stacked_p
 D = np.sqrt(D.multiply(D).sum(1))
 result = sorted(D)[:k]
 return result

Types for arguments to knn_naive(...) are:

# type(k): int
# type(p): <class 'scipy.sparse.csr.csr_matrix'>
# type(X): <class 'scipy.sparse.csr.csr_matrix'>

An example of dimensions would be:

X.shape: (100, 3004)
p.shape: (1, 3004)

Are there techniques to improve the above implementation in terms of computational time and/or memory?

Question 2

I wasn't aware that scipy's sparse matrices did not support broadcasting. What a shame that you can't simply do D = X - p! Your construction of stacked_p can be sped up quite a bit, but you need to access the internals of the CSR matrix directly. If you aren't aware of the details, the wikipedia article has a good description of the format.

In any case, you can construct stacked_p as follows:

rows, cols = X.shape
stacked_p = ss.csr_matrix((np.tile(p.data, rows), np.tile(p.indices, rows),
 np.arange(0, rows*p.nnz + 1, p.nnz)), shape=X.shape)

With some dummy data:

import numpy as np
import scipy.sparse as sps
p = sps.rand(1, 1000, density=0.01, format='csr')
In [15]: %timeit sps.vstack([p for n in xrange(1000)]).tocsr()
10 loops, best of 3: 173 ms per loop
In [16]: %timeit sps.csr_matrix((np.tile(p.data, 1000), np.tile(p.indices, 1000),
 np.arange(0, 1000*p.nnz + 1, p.nnz)), (1000, 1000))
10000 loops, best of 3: 94.1 μs per loop

That is close to 2000x times faster...

Jaime Jaime 6,62721 silver badges30 bronze badges · Accepted Answer · 2014-04-13 02:57:56Z

I wasn't aware that scipy's sparse matrices did not support broadcasting. What a shame that you can't simply do D = X - p! Your construction of stacked_p can be sped up quite a bit, but you need to access the internals of the CSR matrix directly. If you aren't aware of the details, the wikipedia article has a good description of the format.

In any case, you can construct stacked_p as follows:

rows, cols = X.shape
stacked_p = ss.csr_matrix((np.tile(p.data, rows), np.tile(p.indices, rows),
 np.arange(0, rows*p.nnz + 1, p.nnz)), shape=X.shape)

With some dummy data:

import numpy as np
import scipy.sparse as sps
p = sps.rand(1, 1000, density=0.01, format='csr')
In [15]: %timeit sps.vstack([p for n in xrange(1000)]).tocsr()
10 loops, best of 3: 173 ms per loop
In [16]: %timeit sps.csr_matrix((np.tile(p.data, 1000), np.tile(p.indices, 1000),
 np.arange(0, 1000*p.nnz + 1, p.nnz)), (1000, 1000))
10000 loops, best of 3: 94.1 μs per loop

That is close to 2000x times faster...

Stack Exchange Network

Calculating Euclidean norm for each vector in a sparse matrix

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Calculating Euclidean norm for each vector in a sparse matrix

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions