Divide all elements in the column of a sparse matrix

Question 1

Using Python Scipy, I am trying to divide all numbers in all columns of a sparse matrix (400K ×ばつ 500K, density 0.0005), by the sum of the squares of all numbers in a column.

If a column is [ [ 0 ] , [ 2 ] , [ 4 ] ], the sum of the squares is 20, so after computation the column should be [ [ 0 ] , [ 0.1 ] , [ 0.2 ] ].

This was my first attempt:

# Loading the sparse matrix
csc = np.load('sparse_matrix.npz')
csc = sp.csc_matrix((csc['data'], csc['indices'], csc['indptr']), shape = csc['shape'], dtype=np.float)
# Computing sum of squares, per column
maxv = np.zeros((csc.shape[1]))
for i in xrange(csc.shape[1]) :
 maxv[i] = sum(np.square(csc[:,i].data))
# Division of non-zero elements by the corresponding sum
for i in xrange(csc.shape[1]) :
 x,y = csc[:,i].nonzero()
 del y
 if x.shape[0] > 0 :
 csc[x,i] = np.array(csc[x,i].todense()) / maxv[i]

However, this seemed to take forever. I improved the second part (using SciPy sparse: optimize computation on non-zero elements of a sparse matrix (for tf-idf)):

csc = np.load('sparse_matrix.npz')
csc = sp.csc_matrix((csc['data'], csc['indices'], csc['indptr']), shape = csc['shape'], dtype=np.float)
# THIS PART is slow
# Computing sum of squares, per column
maxv = np.zeros((csc.shape[1]))
for i in xrange(csc.shape[1]) :
 maxv[i] = sum(np.square(csc[:,i].data))
# Division of non-zero elements by the corresponding sum
csc = sp.csr_matrix(csc)
xs,ys = csc.nonzero()
csc.data /= maxv[ys]
csc = sp.csc_matrix(csc)

... but I wonder if the computation of squares part can be improved further.

Question 2

When trying to speed up Numpy code, it's important to scrutinize every loop. A loop that has to run in the Python interpreter can be hundreds of times slower than a vectorized loop running inside Numpy.

Also, when trying to improve the performance of code, there's no substitute for measurement. So let's set up a test case. The matrix csc here has the same density as yours, but is one-hundredth the size (to make the runtimes tractable):

import numpy as np
from scipy.sparse import csc_matrix
from timeit import timeit
shape = 40000, 50000
density = 0.0005
n = np.prod(shape) * density
rowi, coli = [np.random.randint(s, size=n) for s in shape]
csc = csc_matrix((np.random.rand(n), (rowi, coli)), shape=shape)

Here's your sum-of-squares-by-column algorithm:

def sum_squares_by_column_1():
 maxv = np.zeros((csc.shape[1]))
 for i in range(csc.shape[1]) :
 maxv[i] = sum(np.square(csc[:,i].data))
 return maxv
>>> timeit(sum_squares_by_column_1, number=1)
19.718280024942942

So how can we get rid of that for loop? Well, if this were an ordinary Numpy array then we could write:

np.sum(np.square(csc), axis=1)

but this doesn't work here because scipy.sparse.csc_matrix is a matrix rather than an array (see the answers to this Stack Overflow question) and so np.square doesn't work. That's because np.square(a) just multiplies a by itself (as if you'd written a * a), and this is matrix multiplication when a is a matrix.

So what we need to do instead is to read carefully through the scipy.sparse.csc_matrix documentation to see if there are methods for the operations we need. There's no square method, but there is a multiply method that does "point-wise multiplication" and a sum method that "sums the matrix over the given axis". So, putting these together:

def sum_squares_by_column_2():
 return csc.multiply(csc).sum(1)
>>> timeit(sum_squares_by_column_2, number=1)
0.04036429896950722

which is about 500 times faster.

Question 3

Thanks! One query, summing over axis 1 gives me sum over rows, summing over axis 0 gives sum over columns. Isnt axis 0 supposed to be over rows, or have I misunderstood the concept.

Question 4

Yes, you want .sum(0) rather than .sum(1). Sorry about that.

Gareth Rees Gareth Rees 50.1k3 gold badges130 silver badges210 bronze badges · Accepted Answer · 2015-02-12 15:46:53Z

When trying to speed up Numpy code, it's important to scrutinize every loop. A loop that has to run in the Python interpreter can be hundreds of times slower than a vectorized loop running inside Numpy.

Also, when trying to improve the performance of code, there's no substitute for measurement. So let's set up a test case. The matrix csc here has the same density as yours, but is one-hundredth the size (to make the runtimes tractable):

import numpy as np
from scipy.sparse import csc_matrix
from timeit import timeit
shape = 40000, 50000
density = 0.0005
n = np.prod(shape) * density
rowi, coli = [np.random.randint(s, size=n) for s in shape]
csc = csc_matrix((np.random.rand(n), (rowi, coli)), shape=shape)

Here's your sum-of-squares-by-column algorithm:

def sum_squares_by_column_1():
 maxv = np.zeros((csc.shape[1]))
 for i in range(csc.shape[1]) :
 maxv[i] = sum(np.square(csc[:,i].data))
 return maxv
>>> timeit(sum_squares_by_column_1, number=1)
19.718280024942942

So how can we get rid of that for loop? Well, if this were an ordinary Numpy array then we could write:

np.sum(np.square(csc), axis=1)

but this doesn't work here because scipy.sparse.csc_matrix is a matrix rather than an array (see the answers to this Stack Overflow question) and so np.square doesn't work. That's because np.square(a) just multiplies a by itself (as if you'd written a * a), and this is matrix multiplication when a is a matrix.

So what we need to do instead is to read carefully through the scipy.sparse.csc_matrix documentation to see if there are methods for the operations we need. There's no square method, but there is a multiply method that does "point-wise multiplication" and a sum method that "sums the matrix over the given axis". So, putting these together:

def sum_squares_by_column_2():
 return csc.multiply(csc).sum(1)
>>> timeit(sum_squares_by_column_2, number=1)
0.04036429896950722

which is about 500 times faster.

Thanks! One query, summing over axis 1 gives me sum over rows, summing over axis 0 gives sum over columns. Isnt axis 0 supposed to be over rows, or have I misunderstood the concept.
Yes, you want .sum(0) rather than .sum(1). Sorry about that.

Stack Exchange Network

Divide all elements in the column of a sparse matrix

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Divide all elements in the column of a sparse matrix

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions