6
\$\begingroup\$

Using Python Scipy, I am trying to divide all numbers in all columns of a sparse matrix (400K ×ばつ 500K, density 0.0005), by the sum of the squares of all numbers in a column.

If a column is [ [ 0 ] , [ 2 ] , [ 4 ] ], the sum of the squares is 20, so after computation the column should be [ [ 0 ] , [ 0.1 ] , [ 0.2 ] ].

This was my first attempt:

# Loading the sparse matrix
csc = np.load('sparse_matrix.npz')
csc = sp.csc_matrix((csc['data'], csc['indices'], csc['indptr']), shape = csc['shape'], dtype=np.float)
# Computing sum of squares, per column
maxv = np.zeros((csc.shape[1]))
for i in xrange(csc.shape[1]) :
 maxv[i] = sum(np.square(csc[:,i].data))
# Division of non-zero elements by the corresponding sum
for i in xrange(csc.shape[1]) :
 x,y = csc[:,i].nonzero()
 del y
 if x.shape[0] > 0 :
 csc[x,i] = np.array(csc[x,i].todense()) / maxv[i]

However, this seemed to take forever. I improved the second part (using SciPy sparse: optimize computation on non-zero elements of a sparse matrix (for tf-idf)):

csc = np.load('sparse_matrix.npz')
csc = sp.csc_matrix((csc['data'], csc['indices'], csc['indptr']), shape = csc['shape'], dtype=np.float)
# THIS PART is slow
# Computing sum of squares, per column
maxv = np.zeros((csc.shape[1]))
for i in xrange(csc.shape[1]) :
 maxv[i] = sum(np.square(csc[:,i].data))
# Division of non-zero elements by the corresponding sum
csc = sp.csr_matrix(csc)
xs,ys = csc.nonzero()
csc.data /= maxv[ys]
csc = sp.csc_matrix(csc)

... but I wonder if the computation of squares part can be improved further.

asked Feb 11, 2015 at 11:57
\$\endgroup\$

1 Answer 1

4
\$\begingroup\$

When trying to speed up Numpy code, it's important to scrutinize every loop. A loop that has to run in the Python interpreter can be hundreds of times slower than a vectorized loop running inside Numpy.

Also, when trying to improve the performance of code, there's no substitute for measurement. So let's set up a test case. The matrix csc here has the same density as yours, but is one-hundredth the size (to make the runtimes tractable):

import numpy as np
from scipy.sparse import csc_matrix
from timeit import timeit
shape = 40000, 50000
density = 0.0005
n = np.prod(shape) * density
rowi, coli = [np.random.randint(s, size=n) for s in shape]
csc = csc_matrix((np.random.rand(n), (rowi, coli)), shape=shape)

Here's your sum-of-squares-by-column algorithm:

def sum_squares_by_column_1():
 maxv = np.zeros((csc.shape[1]))
 for i in range(csc.shape[1]) :
 maxv[i] = sum(np.square(csc[:,i].data))
 return maxv
>>> timeit(sum_squares_by_column_1, number=1)
19.718280024942942

So how can we get rid of that for loop? Well, if this were an ordinary Numpy array then we could write:

np.sum(np.square(csc), axis=1)

but this doesn't work here because scipy.sparse.csc_matrix is a matrix rather than an array (see the answers to this Stack Overflow question) and so np.square doesn't work. That's because np.square(a) just multiplies a by itself (as if you'd written a * a), and this is matrix multiplication when a is a matrix.

So what we need to do instead is to read carefully through the scipy.sparse.csc_matrix documentation to see if there are methods for the operations we need. There's no square method, but there is a multiply method that does "point-wise multiplication" and a sum method that "sums the matrix over the given axis". So, putting these together:

def sum_squares_by_column_2():
 return csc.multiply(csc).sum(1)
>>> timeit(sum_squares_by_column_2, number=1)
0.04036429896950722

which is about 500 times faster.

answered Feb 12, 2015 at 15:46
\$\endgroup\$
2
  • \$\begingroup\$ Thanks! One query, summing over axis 1 gives me sum over rows, summing over axis 0 gives sum over columns. Isnt axis 0 supposed to be over rows, or have I misunderstood the concept. \$\endgroup\$ Commented Feb 12, 2015 at 21:16
  • \$\begingroup\$ Yes, you want .sum(0) rather than .sum(1). Sorry about that. \$\endgroup\$ Commented Feb 12, 2015 at 21:21

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.