SciPy sparse: optimize computation on non-zero elements of a sparse matrix (for tf-idf)

Question 1

I have a sparse matrix (term-document) containing integers (word counts/tf) and I am trying to compute the tf-idf, for every non-zero value in the sparse-matrix.

The formula for tf-idf I am using is:

log(1 + tf) * log(N / (1 + df)) # N is the number of coloumns of the matrix
# tf is the value at a cell of the matrix
# df is the number of non-zero elements in a row

So for a matrix csr, at an index [i,j] with a non-zero value, I want to compute:

csr[i,j] = log(1 + csr[i, j]) * log(csr.shape[1] / (1 + sum(csr[i] != 0))

Since I have a large matrix, I am using sparse matrices from scipy.sparse. Is it possible to do the tf-idf computation more efficiently?

import numpy as np
import scipy.sparse
import scipy.io
csr = scipy.sparse.csr_matrix(scipy.io.mmread('thedata'))
for iter1 in xrange(csr.shape[0]) :
 # Finding indices of non-zero data in the matrix
 tmp,non_zero_indices = csr[iter1].nonzero()
 # dont need tmp
 df = len(non_zero_indices)
 if df > 0 :
 # This line takes a long time...
 csr[iter1,non_zero_indices] = np.log(1.0+csr[iter1,non_zero_indices].todense())*np.log((csr.shape[1])/(1.0+df))

Question 2

What are the approximate dimensions of csr and how dense is it? I'd like to be able to run this on my computer to test but I need to be able to mock thedata.

Question 3

The size of the matrix is (1M X 500K), and the density is 0.0002. I was testing this code with a random matrix of smaller size, like : coo_mat = scipy.sparse.rand(100000, 50000, density=0.0002, format='coo')

Question 4

I'm making a fair few assumptions about the internal format that may not be justified, but this works on the demo data I tried:

factors = csr.shape[1] / (1 + np.diff(csr.indptr))
xs, ys = csr.nonzero()
csr.data = np.log(csr.data + 1.0) * np.log(factors[xs])

All I do is work on the internal dense data structure directly.

Question 5

Wow, thanks, this is great! Took me some time to understand how this was working. In the first line, won't using 1.0 instead of 1 in (1 + np.diff(csr.indptr)) lead to more accuracy?

Question 6

@Avisek Depends on the version of Python you're using. The inaccurate part would be integer division but Python 3 does floating division by default (// does integer division).

Question 7

Oh i see, yes on Python 2.7 it does integer division by default.

Veedrac VeedracVeedrac 9,77323 silver badges38 bronze badges · Accepted Answer · 2014-12-29 11:32:44Z

2

\$\begingroup\$

I'm making a fair few assumptions about the internal format that may not be justified, but this works on the demo data I tried:

factors = csr.shape[1] / (1 + np.diff(csr.indptr))
xs, ys = csr.nonzero()
csr.data = np.log(csr.data + 1.0) * np.log(factors[xs])

All I do is work on the internal dense data structure directly.

Share

edited Dec 29, 2014 at 11:52

answered Dec 29, 2014 at 11:32

Veedrac's user avatar

Veedrac VeedracVeedrac

9,77323 silver badges38 bronze badges

\$\endgroup\$

3

\$\begingroup\$ Wow, thanks, this is great! Took me some time to understand how this was working. In the first line, won't using 1.0 instead of 1 in (1 + np.diff(csr.indptr)) lead to more accuracy? \$\endgroup\$

Avisek
– Avisek

2014年12月30日 08:15:27 +00:00
Commented Dec 30, 2014 at 8:15
\$\begingroup\$ @Avisek Depends on the version of Python you're using. The inaccurate part would be integer division but Python 3 does floating division by default (// does integer division). \$\endgroup\$

Veedrac
– Veedrac

2014年12月30日 08:18:47 +00:00
Commented Dec 30, 2014 at 8:18
\$\begingroup\$ Oh i see, yes on Python 2.7 it does integer division by default. \$\endgroup\$

Avisek
– Avisek

2014年12月30日 08:53:11 +00:00
Commented Dec 30, 2014 at 8:53

Add a comment |

Stack Exchange Network

SciPy sparse: optimize computation on non-zero elements of a sparse matrix (for tf-idf)

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

SciPy sparse: optimize computation on non-zero elements of a sparse matrix (for tf-idf)

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions