Using Python Scipy, I am trying to divide all numbers in all columns of a sparse matrix (400K ×ばつ 500K, density 0.0005), by the sum of the squares of all numbers in a column.
If a column is [ [ 0 ] , [ 2 ] , [ 4 ] ]
, the sum of the squares is 20, so after computation the column should be [ [ 0 ] , [ 0.1 ] , [ 0.2 ] ]
.
This was my first attempt:
# Loading the sparse matrix
csc = np.load('sparse_matrix.npz')
csc = sp.csc_matrix((csc['data'], csc['indices'], csc['indptr']), shape = csc['shape'], dtype=np.float)
# Computing sum of squares, per column
maxv = np.zeros((csc.shape[1]))
for i in xrange(csc.shape[1]) :
maxv[i] = sum(np.square(csc[:,i].data))
# Division of non-zero elements by the corresponding sum
for i in xrange(csc.shape[1]) :
x,y = csc[:,i].nonzero()
del y
if x.shape[0] > 0 :
csc[x,i] = np.array(csc[x,i].todense()) / maxv[i]
However, this seemed to take forever. I improved the second part (using SciPy sparse: optimize computation on non-zero elements of a sparse matrix (for tf-idf)):
csc = np.load('sparse_matrix.npz')
csc = sp.csc_matrix((csc['data'], csc['indices'], csc['indptr']), shape = csc['shape'], dtype=np.float)
# THIS PART is slow
# Computing sum of squares, per column
maxv = np.zeros((csc.shape[1]))
for i in xrange(csc.shape[1]) :
maxv[i] = sum(np.square(csc[:,i].data))
# Division of non-zero elements by the corresponding sum
csc = sp.csr_matrix(csc)
xs,ys = csc.nonzero()
csc.data /= maxv[ys]
csc = sp.csc_matrix(csc)
... but I wonder if the computation of squares part can be improved further.
1 Answer 1
When trying to speed up Numpy code, it's important to scrutinize every loop. A loop that has to run in the Python interpreter can be hundreds of times slower than a vectorized loop running inside Numpy.
Also, when trying to improve the performance of code, there's no substitute for measurement. So let's set up a test case. The matrix csc
here has the same density as yours, but is one-hundredth the size (to make the runtimes tractable):
import numpy as np
from scipy.sparse import csc_matrix
from timeit import timeit
shape = 40000, 50000
density = 0.0005
n = np.prod(shape) * density
rowi, coli = [np.random.randint(s, size=n) for s in shape]
csc = csc_matrix((np.random.rand(n), (rowi, coli)), shape=shape)
Here's your sum-of-squares-by-column algorithm:
def sum_squares_by_column_1():
maxv = np.zeros((csc.shape[1]))
for i in range(csc.shape[1]) :
maxv[i] = sum(np.square(csc[:,i].data))
return maxv
>>> timeit(sum_squares_by_column_1, number=1)
19.718280024942942
So how can we get rid of that for
loop? Well, if this were an ordinary Numpy array then we could write:
np.sum(np.square(csc), axis=1)
but this doesn't work here because scipy.sparse.csc_matrix
is a matrix rather than an array (see the answers to this Stack Overflow question) and so np.square
doesn't work. That's because np.square(a)
just multiplies a
by itself (as if you'd written a * a
), and this is matrix multiplication when a
is a matrix.
So what we need to do instead is to read carefully through the scipy.sparse.csc_matrix
documentation to see if there are methods for the operations we need. There's no square
method, but there is a multiply
method that does "point-wise multiplication" and a sum
method that "sums the matrix over the given axis". So, putting these together:
def sum_squares_by_column_2():
return csc.multiply(csc).sum(1)
>>> timeit(sum_squares_by_column_2, number=1)
0.04036429896950722
which is about 500 times faster.
-
\$\begingroup\$ Thanks! One query, summing over axis 1 gives me sum over rows, summing over axis 0 gives sum over columns. Isnt axis 0 supposed to be over rows, or have I misunderstood the concept. \$\endgroup\$Avisek– Avisek2015年02月12日 21:16:02 +00:00Commented Feb 12, 2015 at 21:16
-
\$\begingroup\$ Yes, you want
.sum(0)
rather than.sum(1)
. Sorry about that. \$\endgroup\$Gareth Rees– Gareth Rees2015年02月12日 21:21:30 +00:00Commented Feb 12, 2015 at 21:21
Explore related questions
See similar questions with these tags.