I have a sparse matrix (term-document) containing integers (word counts/tf) and I am trying to compute the tf-idf, for every non-zero value in the sparse-matrix.
The formula for tf-idf I am using is:
log(1 + tf) * log(N / (1 + df)) # N is the number of coloumns of the matrix
# tf is the value at a cell of the matrix
# df is the number of non-zero elements in a row
So for a matrix csr, at an index [i,j] with a non-zero value, I want to compute:
csr[i,j] = log(1 + csr[i, j]) * log(csr.shape[1] / (1 + sum(csr[i] != 0))
Since I have a large matrix, I am using sparse matrices from scipy.sparse
. Is it possible to do the tf-idf computation more efficiently?
import numpy as np
import scipy.sparse
import scipy.io
csr = scipy.sparse.csr_matrix(scipy.io.mmread('thedata'))
for iter1 in xrange(csr.shape[0]) :
# Finding indices of non-zero data in the matrix
tmp,non_zero_indices = csr[iter1].nonzero()
# dont need tmp
df = len(non_zero_indices)
if df > 0 :
# This line takes a long time...
csr[iter1,non_zero_indices] = np.log(1.0+csr[iter1,non_zero_indices].todense())*np.log((csr.shape[1])/(1.0+df))
1 Answer 1
I'm making a fair few assumptions about the internal format that may not be justified, but this works on the demo data I tried:
factors = csr.shape[1] / (1 + np.diff(csr.indptr))
xs, ys = csr.nonzero()
csr.data = np.log(csr.data + 1.0) * np.log(factors[xs])
All I do is work on the internal dense data structure directly.
-
\$\begingroup\$ Wow, thanks, this is great! Took me some time to understand how this was working. In the first line, won't using 1.0 instead of 1 in (1 + np.diff(csr.indptr)) lead to more accuracy? \$\endgroup\$Avisek– Avisek2014年12月30日 08:15:27 +00:00Commented Dec 30, 2014 at 8:15
-
\$\begingroup\$ @Avisek Depends on the version of Python you're using. The inaccurate part would be integer division but Python 3 does floating division by default (
//
does integer division). \$\endgroup\$Veedrac– Veedrac2014年12月30日 08:18:47 +00:00Commented Dec 30, 2014 at 8:18 -
\$\begingroup\$ Oh i see, yes on Python 2.7 it does integer division by default. \$\endgroup\$Avisek– Avisek2014年12月30日 08:53:11 +00:00Commented Dec 30, 2014 at 8:53
csr
and how dense is it? I'd like to be able to run this on my computer to test but I need to be able to mockthedata
. \$\endgroup\$