Return to Question

replaced http://codereview.stackexchange.com/ with https://codereview.stackexchange.com/

edited Apr 13, 2017 at 12:40

Using Python Scipy, I am trying to divide all numbers in all columns of a sparse matrix (400K ×ばつ 500K, density 0.0005), by the sum of the squares of all numbers in a column.

If a column is [ [ 0 ] , [ 2 ] , [ 4 ] ], the sum of the squares is 20, so after computation the column should be [ [ 0 ] , [ 0.1 ] , [ 0.2 ] ].

This was my first attempt:

# Loading the sparse matrix
csc = np.load('sparse_matrix.npz')
csc = sp.csc_matrix((csc['data'], csc['indices'], csc['indptr']), shape = csc['shape'], dtype=np.float)
# Computing sum of squares, per column
maxv = np.zeros((csc.shape[1]))
for i in xrange(csc.shape[1]) :
 maxv[i] = sum(np.square(csc[:,i].data))
# Division of non-zero elements by the corresponding sum
for i in xrange(csc.shape[1]) :
 x,y = csc[:,i].nonzero()
 del y
 if x.shape[0] > 0 :
 csc[x,i] = np.array(csc[x,i].todense()) / maxv[i]

However, this seemed to take forever. I improved the second part (using SciPy sparse: optimize computation on non-zero elements of a sparse matrix (for tf-idf) SciPy sparse: optimize computation on non-zero elements of a sparse matrix (for tf-idf)):

csc = np.load('sparse_matrix.npz')
csc = sp.csc_matrix((csc['data'], csc['indices'], csc['indptr']), shape = csc['shape'], dtype=np.float)
# THIS PART is slow
# Computing sum of squares, per column
maxv = np.zeros((csc.shape[1]))
for i in xrange(csc.shape[1]) :
 maxv[i] = sum(np.square(csc[:,i].data))
# Division of non-zero elements by the corresponding sum
csc = sp.csr_matrix(csc)
xs,ys = csc.nonzero()
csc.data /= maxv[ys]
csc = sp.csc_matrix(csc)

... but I wonder if the computation of squares part can be improved further.

Using Python Scipy, I am trying to divide all numbers in all columns of a sparse matrix (400K ×ばつ 500K, density 0.0005), by the sum of the squares of all numbers in a column.

If a column is [ [ 0 ] , [ 2 ] , [ 4 ] ], the sum of the squares is 20, so after computation the column should be [ [ 0 ] , [ 0.1 ] , [ 0.2 ] ].

This was my first attempt:

# Loading the sparse matrix
csc = np.load('sparse_matrix.npz')
csc = sp.csc_matrix((csc['data'], csc['indices'], csc['indptr']), shape = csc['shape'], dtype=np.float)
# Computing sum of squares, per column
maxv = np.zeros((csc.shape[1]))
for i in xrange(csc.shape[1]) :
 maxv[i] = sum(np.square(csc[:,i].data))
# Division of non-zero elements by the corresponding sum
for i in xrange(csc.shape[1]) :
 x,y = csc[:,i].nonzero()
 del y
 if x.shape[0] > 0 :
 csc[x,i] = np.array(csc[x,i].todense()) / maxv[i]

However, this seemed to take forever. I improved the second part (using SciPy sparse: optimize computation on non-zero elements of a sparse matrix (for tf-idf)):

csc = np.load('sparse_matrix.npz')
csc = sp.csc_matrix((csc['data'], csc['indices'], csc['indptr']), shape = csc['shape'], dtype=np.float)
# THIS PART is slow
# Computing sum of squares, per column
maxv = np.zeros((csc.shape[1]))
for i in xrange(csc.shape[1]) :
 maxv[i] = sum(np.square(csc[:,i].data))
# Division of non-zero elements by the corresponding sum
csc = sp.csr_matrix(csc)
xs,ys = csc.nonzero()
csc.data /= maxv[ys]
csc = sp.csc_matrix(csc)

... but I wonder if the computation of squares part can be improved further.

Using Python Scipy, I am trying to divide all numbers in all columns of a sparse matrix (400K ×ばつ 500K, density 0.0005), by the sum of the squares of all numbers in a column.

If a column is [ [ 0 ] , [ 2 ] , [ 4 ] ], the sum of the squares is 20, so after computation the column should be [ [ 0 ] , [ 0.1 ] , [ 0.2 ] ].

This was my first attempt:

# Loading the sparse matrix
csc = np.load('sparse_matrix.npz')
csc = sp.csc_matrix((csc['data'], csc['indices'], csc['indptr']), shape = csc['shape'], dtype=np.float)
# Computing sum of squares, per column
maxv = np.zeros((csc.shape[1]))
for i in xrange(csc.shape[1]) :
 maxv[i] = sum(np.square(csc[:,i].data))
# Division of non-zero elements by the corresponding sum
for i in xrange(csc.shape[1]) :
 x,y = csc[:,i].nonzero()
 del y
 if x.shape[0] > 0 :
 csc[x,i] = np.array(csc[x,i].todense()) / maxv[i]

However, this seemed to take forever. I improved the second part (using SciPy sparse: optimize computation on non-zero elements of a sparse matrix (for tf-idf)):

csc = np.load('sparse_matrix.npz')
csc = sp.csc_matrix((csc['data'], csc['indices'], csc['indptr']), shape = csc['shape'], dtype=np.float)
# THIS PART is slow
# Computing sum of squares, per column
maxv = np.zeros((csc.shape[1]))
for i in xrange(csc.shape[1]) :
 maxv[i] = sum(np.square(csc[:,i].data))
# Division of non-zero elements by the corresponding sum
csc = sp.csr_matrix(csc)
xs,ys = csc.nonzero()
csc.data /= maxv[ys]
csc = sp.csc_matrix(csc)

... but I wonder if the computation of squares part can be improved further.

deleted 40 characters in body

Source Link

edited Feb 12, 2015 at 18:16

200_success

edited Feb 12, 2015 at 18:16

200_success

145.6k
22
190
479

Using Python Scipy, I am trying to divide all numbers in all columns of a sparse matrix (400K X×ばつ 500K, density 0.0005), by the sum of the squares of all numbers in a column.

If a column is [ [ 0 ] , [ 2 ] , [ 4 ] ][ [ 0 ] , [ 2 ] , [ 4 ] ], the sum of the squares is 20, so after computation the column should be [ [ 0 ] , [ 0.1 ] , [ 0.2 ] ][ [ 0 ] , [ 0.1 ] , [ 0.2 ] ].

This is what I have been tryingwas my first attempt:

# Loading the sparse matrix
csc = np.load('sparse_matrix.npz')
csc = sp.csc_matrix((csc['data'], csc['indices'], csc['indptr']), shape = csc['shape'], dtype=np.float)
# Computing sum of squares, per column
maxv = np.zeros((csc.shape[1]))
for i in xrange(csc.shape[1]) :
 maxv[i] = sum(np.square(csc[:,i].data))
# Division of non-zero elements by the corresponding sum
for i in xrange(csc.shape[1]) :
 x,y = csc[:,i].nonzero()
 del y
 if x.shape[0] > 0 :
 csc[x,i] = np.array(csc[x,i].todense()) / maxv[i]

However, this seemsseemed to take forever. Is there a way to do this faster?

EDIT : I I improved the second part (using SciPy sparse: optimize computation on non-zero elements of a sparse matrix (for tf-idf)), but I wonder if the computation of squares part can be improved further -:

csc = np.load('sparse_matrix.npz')
csc = sp.csc_matrix((csc['data'], csc['indices'], csc['indptr']), shape = csc['shape'], dtype=np.float)
# THIS PART is slow
# Computing sum of squares, per column
maxv = np.zeros((csc.shape[1]))
for i in xrange(csc.shape[1]) :
 maxv[i] = sum(np.square(csc[:,i].data))
# Division of non-zero elements by the corresponding sum
csc = sp.csr_matrix(csc)
xs,ys = csc.nonzero()
csc.data /= maxv[ys]
csc = sp.csc_matrix(csc)

... but I wonder if the computation of squares part can be improved further.

Using Python Scipy, I am trying to divide all numbers in all columns of a sparse matrix (400K X 500K, density 0.0005), by the sum of the squares of all numbers in a column.

If a column is [ [ 0 ] , [ 2 ] , [ 4 ] ], the sum of the squares is 20, so after computation the column should be [ [ 0 ] , [ 0.1 ] , [ 0.2 ] ].

This is what I have been trying:

# Loading the sparse matrix
csc = np.load('sparse_matrix.npz')
csc = sp.csc_matrix((csc['data'], csc['indices'], csc['indptr']), shape = csc['shape'], dtype=np.float)
# Computing sum of squares, per column
maxv = np.zeros((csc.shape[1]))
for i in xrange(csc.shape[1]) :
 maxv[i] = sum(np.square(csc[:,i].data))
# Division of non-zero elements by the corresponding sum
for i in xrange(csc.shape[1]) :
 x,y = csc[:,i].nonzero()
 del y
 if x.shape[0] > 0 :
 csc[x,i] = np.array(csc[x,i].todense()) / maxv[i]

However, this seems to take forever. Is there a way to do this faster?

EDIT : I improved the second part (using SciPy sparse: optimize computation on non-zero elements of a sparse matrix (for tf-idf)), but I wonder if the computation of squares part can be improved further -

csc = np.load('sparse_matrix.npz')
csc = sp.csc_matrix((csc['data'], csc['indices'], csc['indptr']), shape = csc['shape'], dtype=np.float)
# THIS PART is slow
# Computing sum of squares, per column
maxv = np.zeros((csc.shape[1]))
for i in xrange(csc.shape[1]) :
 maxv[i] = sum(np.square(csc[:,i].data))
# Division of non-zero elements by the corresponding sum
csc = sp.csr_matrix(csc)
xs,ys = csc.nonzero()
csc.data /= maxv[ys]
csc = sp.csc_matrix(csc)

Using Python Scipy, I am trying to divide all numbers in all columns of a sparse matrix (400K ×ばつ 500K, density 0.0005), by the sum of the squares of all numbers in a column.

If a column is [ [ 0 ] , [ 2 ] , [ 4 ] ], the sum of the squares is 20, so after computation the column should be [ [ 0 ] , [ 0.1 ] , [ 0.2 ] ].

This was my first attempt:

# Loading the sparse matrix
csc = np.load('sparse_matrix.npz')
csc = sp.csc_matrix((csc['data'], csc['indices'], csc['indptr']), shape = csc['shape'], dtype=np.float)
# Computing sum of squares, per column
maxv = np.zeros((csc.shape[1]))
for i in xrange(csc.shape[1]) :
 maxv[i] = sum(np.square(csc[:,i].data))
# Division of non-zero elements by the corresponding sum
for i in xrange(csc.shape[1]) :
 x,y = csc[:,i].nonzero()
 del y
 if x.shape[0] > 0 :
 csc[x,i] = np.array(csc[x,i].todense()) / maxv[i]

However, this seemed to take forever. I improved the second part (using SciPy sparse: optimize computation on non-zero elements of a sparse matrix (for tf-idf)):

csc = np.load('sparse_matrix.npz')
csc = sp.csc_matrix((csc['data'], csc['indices'], csc['indptr']), shape = csc['shape'], dtype=np.float)
# THIS PART is slow
# Computing sum of squares, per column
maxv = np.zeros((csc.shape[1]))
for i in xrange(csc.shape[1]) :
 maxv[i] = sum(np.square(csc[:,i].data))
# Division of non-zero elements by the corresponding sum
csc = sp.csr_matrix(csc)
xs,ys = csc.nonzero()
csc.data /= maxv[ys]
csc = sp.csc_matrix(csc)

... but I wonder if the computation of squares part can be improved further.

added 638 characters in body

Source Link

edited Feb 12, 2015 at 8:17

Avisek

edited Feb 12, 2015 at 8:17

Avisek

Using Python Scipy, I am trying to divide all numbers in all columns of a sparse matrix (400K X 500K, density 0.0005), by the sum of the squares of all numbers in a column.

If a column is [ [ 0 ] , [ 2 ] , [ 4 ] ], the sum of the squares is 20, so after computation the column should be [ [ 0 ] , [ 0.1 ] , [ 0.2 ] ].

This is what I have been trying:

# Loading the sparse matrix
csc = np.load('sparse_matrix.npz')
csc = sp.csc_matrix((csc['data'], csc['indices'], csc['indptr']), shape = csc['shape'], dtype=np.float)
# Computing sum of squares, per column
maxv = np.zeros((csc.shape[1]))
for i in xrange(csc.shape[1]) :
 maxv[i] = sum(np.square(csc[:,i].data))
# Division of non-zero elements by the corresponding sum
for i in xrange(csc.shape[1]) :
 x,y = csc[:,i].nonzero()
 del y
 if x.shape[0] > 0 :
 csc[x,i] = np.array(csc[x,i].todense()) / maxv[i]

However, this seems to take forever. Is there a way to do this faster?

EDIT : I improved the second part (using SciPy sparse: optimize computation on non-zero elements of a sparse matrix (for tf-idf) ), but I wonder if the computation of squares part can be improved further -

csc = np.load('sparse_matrix.npz')
csc = sp.csc_matrix((csc['data'], csc['indices'], csc['indptr']), shape = csc['shape'], dtype=np.float)
# THIS PART is slow
# Computing sum of squares, per column
maxv = np.zeros((csc.shape[1]))
for i in xrange(csc.shape[1]) :
 maxv[i] = sum(np.square(csc[:,i].data))
# Division of non-zero elements by the corresponding sum
csc = sp.csr_matrix(csc)
xs,ys = csc.nonzero()
csc.data /= maxv[ys]
csc = sp.csc_matrix(csc)

Using Python Scipy, I am trying to divide all numbers in all columns of a sparse matrix (400K X 500K, density 0.0005), by the sum of the squares of all numbers in a column.

If a column is [ [ 0 ] , [ 2 ] , [ 4 ] ], the sum of the squares is 20, so after computation the column should be [ [ 0 ] , [ 0.1 ] , [ 0.2 ] ].

This is what I have been trying:

# Loading the sparse matrix
csc = np.load('sparse_matrix.npz')
csc = sp.csc_matrix((csc['data'], csc['indices'], csc['indptr']), shape = csc['shape'], dtype=np.float)
# Computing sum of squares, per column
maxv = np.zeros((csc.shape[1]))
for i in xrange(csc.shape[1]) :
 maxv[i] = sum(np.square(csc[:,i].data))
# Division of non-zero elements by the corresponding sum
for i in xrange(csc.shape[1]) :
 x,y = csc[:,i].nonzero()
 del y
 if x.shape[0] > 0 :
 csc[x,i] = np.array(csc[x,i].todense()) / maxv[i]

However, this seems to take forever. Is there a way to do this faster?

Using Python Scipy, I am trying to divide all numbers in all columns of a sparse matrix (400K X 500K, density 0.0005), by the sum of the squares of all numbers in a column.

If a column is [ [ 0 ] , [ 2 ] , [ 4 ] ], the sum of the squares is 20, so after computation the column should be [ [ 0 ] , [ 0.1 ] , [ 0.2 ] ].

This is what I have been trying:

# Loading the sparse matrix
csc = np.load('sparse_matrix.npz')
csc = sp.csc_matrix((csc['data'], csc['indices'], csc['indptr']), shape = csc['shape'], dtype=np.float)
# Computing sum of squares, per column
maxv = np.zeros((csc.shape[1]))
for i in xrange(csc.shape[1]) :
 maxv[i] = sum(np.square(csc[:,i].data))
# Division of non-zero elements by the corresponding sum
for i in xrange(csc.shape[1]) :
 x,y = csc[:,i].nonzero()
 del y
 if x.shape[0] > 0 :
 csc[x,i] = np.array(csc[x,i].todense()) / maxv[i]

However, this seems to take forever. Is there a way to do this faster?

EDIT : I improved the second part (using SciPy sparse: optimize computation on non-zero elements of a sparse matrix (for tf-idf) ), but I wonder if the computation of squares part can be improved further -

csc = np.load('sparse_matrix.npz')
csc = sp.csc_matrix((csc['data'], csc['indices'], csc['indptr']), shape = csc['shape'], dtype=np.float)
# THIS PART is slow
# Computing sum of squares, per column
maxv = np.zeros((csc.shape[1]))
for i in xrange(csc.shape[1]) :
 maxv[i] = sum(np.square(csc[:,i].data))
# Division of non-zero elements by the corresponding sum
csc = sp.csr_matrix(csc)
xs,ys = csc.nonzero()
csc.data /= maxv[ys]
csc = sp.csc_matrix(csc)