2
\$\begingroup\$

I have a sparse matrix (term-document) containing integers (word counts/tf) and I am trying to compute the tf-idf, for every non-zero value in the sparse-matrix.

The formula for tf-idf I am using is:

log(1 + tf) * log(N / (1 + df)) # N is the number of coloumns of the matrix
# tf is the value at a cell of the matrix
# df is the number of non-zero elements in a row

So for a matrix csr, at an index [i,j] with a non-zero value, I want to compute:

csr[i,j] = log(1 + csr[i, j]) * log(csr.shape[1] / (1 + sum(csr[i] != 0))

Since I have a large matrix, I am using sparse matrices from scipy.sparse. Is it possible to do the tf-idf computation more efficiently?

import numpy as np
import scipy.sparse
import scipy.io
csr = scipy.sparse.csr_matrix(scipy.io.mmread('thedata'))
for iter1 in xrange(csr.shape[0]) :
 # Finding indices of non-zero data in the matrix
 tmp,non_zero_indices = csr[iter1].nonzero()
 # dont need tmp
 df = len(non_zero_indices)
 if df > 0 :
 # This line takes a long time...
 csr[iter1,non_zero_indices] = np.log(1.0+csr[iter1,non_zero_indices].todense())*np.log((csr.shape[1])/(1.0+df))
asked Dec 28, 2014 at 14:25
\$\endgroup\$
2
  • \$\begingroup\$ What are the approximate dimensions of csr and how dense is it? I'd like to be able to run this on my computer to test but I need to be able to mock thedata. \$\endgroup\$ Commented Dec 29, 2014 at 3:50
  • \$\begingroup\$ The size of the matrix is (1M X 500K), and the density is 0.0002. I was testing this code with a random matrix of smaller size, like : coo_mat = scipy.sparse.rand(100000, 50000, density=0.0002, format='coo') \$\endgroup\$ Commented Dec 29, 2014 at 8:04

1 Answer 1

2
\$\begingroup\$

I'm making a fair few assumptions about the internal format that may not be justified, but this works on the demo data I tried:

factors = csr.shape[1] / (1 + np.diff(csr.indptr))
xs, ys = csr.nonzero()
csr.data = np.log(csr.data + 1.0) * np.log(factors[xs])

All I do is work on the internal dense data structure directly.

answered Dec 29, 2014 at 11:32
\$\endgroup\$
3
  • \$\begingroup\$ Wow, thanks, this is great! Took me some time to understand how this was working. In the first line, won't using 1.0 instead of 1 in (1 + np.diff(csr.indptr)) lead to more accuracy? \$\endgroup\$ Commented Dec 30, 2014 at 8:15
  • \$\begingroup\$ @Avisek Depends on the version of Python you're using. The inaccurate part would be integer division but Python 3 does floating division by default (// does integer division). \$\endgroup\$ Commented Dec 30, 2014 at 8:18
  • \$\begingroup\$ Oh i see, yes on Python 2.7 it does integer division by default. \$\endgroup\$ Commented Dec 30, 2014 at 8:53

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.