NumPy Scipy optimization

Question 1

I have an optimization issue and I'm not sure if I can improve the overall speed of my function.

The function draw_w is my actual implementation that gives me the right update for my vector w. You can see that the code is running for 1000 iterations in 9s. I had a first implementation (draw_w_fw) that was quiet good because, well-vectorized, that is an order of magnitude quicker by 0.8s. Unfortunately, this implementation is wrong because the cache[0] is not updated correctly.

I would like to know if someone knows a way to speed up draw_w from what I've wrote.

import scipy.sparse as sps
import numpy as np
import timeit
def draw_w(X, w, cache):
 X = X.tocsr()
 for i in xrange(w.shape[0]):
 x_li = X.getrow(i)
 Y = w[i] * x_li
 Y.data -= np.take(cache[0], Y.indices) 
 h = x_li.multiply(-Y)
 w_mean = h.sum()
 w_sigma_sqr = (x_li.multiply(x_li)).sum() 
 w_sigma_sqr = 1.0 / w_sigma_sqr
 w_mean = - w_sigma_sqr * w_mean
 # update w:
 w_old = np.copy(w[i])
 if np.isinf(w[i]):
 w[i] = 0
 elif np.isnan(w[i]):
 w[i] = 0
 else:
 w[i] = w_mean
 # update error:
 cache[0] -= (w_old - w[i]) * x_li
 #print 'draw_w', w
 #True result w [ 2.34125626 2.37726446 4.00792293 3.71059779 4.00792293 0.11100713
 # -0.28899287 -0.04393113 0.21429929]
####################
def draw_w_fw(X, w, cache):
 x_li = X.tocsr()
 nnz_per_row = np.diff(x_li.indptr)
 Y = sps.csr_matrix((x_li.data * np.repeat(w, nnz_per_row), x_li.indices, x_li.indptr), shape=x_li.shape)
 Y.data -= np.take(cache[0], Y.indices) #not good because cache[0] is updated...
 h = x_li.multiply(-Y)
 w_mean = np.asarray(h.sum(axis=1).transpose())[0]
 w_sigma_sqr = np.asarray(( x_li.multiply(x_li) ).sum(axis=1).transpose())[0]
 w_sigma_sqr = 1.0 / w_sigma_sqr
 w_mean = - w_sigma_sqr * w_mean
 # update w:
 w_old = np.copy(w)
 w[~np.isnan(w) & ~np.isinf(w)] = w_mean[~np.isnan(w) & ~np.isinf(w)]
 w[np.isinf(w)] = 0.0
 w[np.isnan(w)] = 0.0
 # update error:
 cache[0] -= (w_old - w) * x_li
 #print 'draw_w_fw', w
##########################
def test():
 data = [ 1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.]
 cols = [ 0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,9,9,10,10,11,11,12,12,13,13,14,14]
 rows = [0,5,1,5,2,5,3,5,4,5,0,6,1,6,2,6,3,6,4,6,1,7,3,7,0,8,2,8,4,8]
 X = sps.coo_matrix((data, (rows, cols)), shape = (9,15))
 X = X.tocsr()
 w = np.array([ 0.16243454, -0.06117564, -0.05281718, -0.10729686, 0.08654076, -0.23015387, 0.17448118, -0.07612069, 0.03190391])
 cache = np.zeros((2, 15))
 cache[0] = np.asarray([-5.06771933, -5.29132951, -4.28297104, -1.33745073, -2.14361311, -0.66308429,
 -0.88669446, -2.878336, -4.93281569, -4.73897806, -1.13729633, -5.18341755,
 -0.80566155, -5.02091327, -4.88155533])
 #draw_w(X, w, cache)
 draw_w_fw(X, w, cache)
if __name__ == '__main__':
 print(timeit.timeit("test()", setup="from __main__ import test", number=1000))
 #draw_w: 9.26 for 1000 iterations
 #draw_w_fw: 0.80 for 1000 iterations

Question 2

Before streamlining the code, it almost always pays off to streamline the math. The h you calculate is the dot product of two vectors, X[i] and Y[i] = cache[0] - w[i]*X[i], so we can rewrite it as

h = X[i].cache[0] - w[i]*X[i].X[i]

When you go on to calculate w_mean, you are actually negating h and then dividing it by a number that turns out to be X[i].X[i], so you can calculate w_mean as

w_mean = w[i] - X[i].cache[0] / X[i].X[i]

Setting aside the np.isnan and np.isinf checks you are doing, you then update cache[0] with a value where w[i] cancels out, and you are basically doing:

cache[0] -= X[i] * (X[i].cache[0] / X[i].X[i])

Putting it all together into a single function, I got this very compact code:

from numpy.lib.stride_tricks import as_strided
def draw_w_bis(X, w, cache):
 X = X.tocsr()
 x_rows_sqr = np.add.reduceat(X.data*X.data, X.indptr[:-1])
 rows, cols = X.shape
 row_start_stop = as_strided(X.indptr, shape=(rows, 2),
 strides=2*X.indptr.strides)
 for row, (start, stop) in enumerate(row_start_stop):
 data = X.data[start:stop]
 cols = X.indices[start:stop]
 delta = np.dot(data, cache[0, cols]) / x_rows_sqr[row]
 w[row] -= delta
 cache[0, cols] -= delta * data

If you are not comfortable with calling as_strided you could replace the definition of row_start_stop with:

row_start_stop = np.column_stack((X.indptr[:-1], X.indptr[1:]))

On my system your test code runs the 1000 iterations over an order of magnitude faster (6 sec vs. 0.4 sec) which seems to be what you were after. There still is a for loop in there, so if you are going to run this on large arrays, it is still going to be slow, and you may want to consider using things like Cython.

Question 3

There are 2 parts to your speedup. 1) the last 3 lines that streamline the inner loop. 2) iteration using as_strided(X.indptr...). The streamlining saves about 20% when plugged into my fastest pure python version. Using strided instead of lil in my cython code drops the time from 0.43 to 0.41.

Question 4

At least with your test X, draw_x_fw works with 2 calls. The rows of X fall into two groups, each with a unique set of columns. I don't know if that's a structural part of the problem, or just a random characteristic your chosen test.

 n = 5
 draw_w_fw(X[:n,:], w[:n], cache) # n items
 draw_w_fw(X[n:,:], w[n:], cache)

With simple slicing like this, w[:n] is a view, so changes to w inside the function are visible outside it. If we do a more general row selection, we have to explicitly set the new w values.

Question 5

This part is a coincidence. On real dataset, I can't just find some slicing like that but good point.

Question 6

This is a simplified version of draw_w, working with the data for one row.

def draw_np(Xdata, Xindices, w, cache):
 # Xdata, Xindices are from a csr row
 # w scalar
 # cache vector
 Ydata = w * Xdata
 Ydata -= np.take(cache, Xindices)
 h = - Xdata * Ydata
 w_mean = h.sum()
 w_sigma_sqr = (Xdata*Xdata).sum()
 w_mean = - w_mean / w_sigma_sqr
 cache[Xindices] -= (w - w_mean) * Xdata
 return w_mean
for i in xrange(w.size):
 xli = X.getrow(i)
 w[i] = draw_np(xli.data, xli.indices , w[i], cache[0])

Compared to the 9 sec for draw_w, this is 1.6 sec. I've omitted the isInf and isNan tests (which should apply to w_mean), but I don't think that's significant. My guess is the speed comes from working with the simpler numpy data vectors rather than the full sparse matrix.

I wrote this simplified Python version as a step toward a faster (hopefully) cython function:

def draw_cy(Xdata, Xindices, w, cache):
 cdef double w_mean
 cdef double w_sigma_sqr
 cdef double ydata
 cdef double value
 cdef int size
 cdef int ind
 size = Xdata.shape[0]
 w_mean = 0.0
 w_sigma_sqr = 0.0
 for i in range(size):
 value = Xdata[i]
 ind = Xindices[i]
 ydata = w * value
 ydata -= cache[ind]
 ydata = - value * ydata
 w_mean += ydata
 w_sigma_sqr += value * value
 w_mean = - w_mean / w_sigma_sqr
 for i in range(size):
 value = Xdata[i]
 ind = Xindices[i]
 cache[ind] -= (w - w_mean) * value
 return w_mean
# requires a compile to .c and .so and import
for i in xrange(w.size):
 xli = X.getrow(i)
 w[i] = draw_cy(xli.data, xli.indices , w[i], cache[0])

This times at 1.1 sec, nearly as good as the fully vectorized (but erroneous) draw_w_fw. But the speedup compared to draw_np isn't that great.

more on timings. I pulled out various pieces from this last iteration, and deduced that

bare loop (pass) took 0.27 sec (that includes setup)
9 calls to getrow took 0.64 sec
9 calls to draw_cy took 0.10 sec
9 calls to draw_np took 0.60 sec

It's significant that the X.getrow() took as long as the barebones numpy draw_np calculation. My impression is that scipy.sparse is a convenient way of setting up sparse matricies, and a good way of doing large linear algebra problems (e.g. FEM), but it is not a fast way of doing general matrix operations.

I tried a dense numpy iterator:

Xa = X.A
for i in xrange(w.size):
 x1 = Xa[i,:]; nz = x1.nonzero()[0]; x1=x1[nz]
 w[i] = cy.draw_cy(x1, nz, w[i], cache[0])

The time for this is 0.62 sec., 0.5 less than the getrow version. Using draw_np gives a similar time savings.

In other words, if X isn't so large that you can't store it as a dense array, it might be faster to skip the sparse module entirely. But I haven't explored how these timings vary with the size of X and w.

lil iteration is even a bit faster, since its data and rows are matching lists:

Xl = X.tolil()
for i, (xd, xr) in enumerate(zip(Xl.data, Xl.rows)):
 #w[i] = cy.draw_cy(np.array(xd), xr, w[i], cache[0])
 w[i] = cpy.draw_py(np.array(xd), xr , w[i], cache[0])

If the X.data are all 1 (as in the test case), the calculations can be simplified even more (only X.rows has useful information).

Question 7

I tested it on my computer and don't have the same speed up. I have 3.75 so it's definitely an improve but still too slow for what I need. I'm impress by the speedup for so little change. I'll definitely take a look at Cython and give you the bounty if nobody else answer.

Question 8

Further testing shows that X.getrow() takes a nontrivial amount of time.

Jaime Jaime 6,62721 silver badges30 bronze badges · Accepted Answer · 2013-11-02 15:40:34Z

Before streamlining the code, it almost always pays off to streamline the math. The h you calculate is the dot product of two vectors, X[i] and Y[i] = cache[0] - w[i]*X[i], so we can rewrite it as

h = X[i].cache[0] - w[i]*X[i].X[i]

When you go on to calculate w_mean, you are actually negating h and then dividing it by a number that turns out to be X[i].X[i], so you can calculate w_mean as

w_mean = w[i] - X[i].cache[0] / X[i].X[i]

Setting aside the np.isnan and np.isinf checks you are doing, you then update cache[0] with a value where w[i] cancels out, and you are basically doing:

cache[0] -= X[i] * (X[i].cache[0] / X[i].X[i])

Putting it all together into a single function, I got this very compact code:

from numpy.lib.stride_tricks import as_strided
def draw_w_bis(X, w, cache):
 X = X.tocsr()
 x_rows_sqr = np.add.reduceat(X.data*X.data, X.indptr[:-1])
 rows, cols = X.shape
 row_start_stop = as_strided(X.indptr, shape=(rows, 2),
 strides=2*X.indptr.strides)
 for row, (start, stop) in enumerate(row_start_stop):
 data = X.data[start:stop]
 cols = X.indices[start:stop]
 delta = np.dot(data, cache[0, cols]) / x_rows_sqr[row]
 w[row] -= delta
 cache[0, cols] -= delta * data

If you are not comfortable with calling as_strided you could replace the definition of row_start_stop with:

row_start_stop = np.column_stack((X.indptr[:-1], X.indptr[1:]))

On my system your test code runs the 1000 iterations over an order of magnitude faster (6 sec vs. 0.4 sec) which seems to be what you were after. There still is a for loop in there, so if you are going to run this on large arrays, it is still going to be slow, and you may want to consider using things like Cython.

There are 2 parts to your speedup. 1) the last 3 lines that streamline the inner loop. 2) iteration using as_strided(X.indptr...). The streamlining saves about 20% when plugged into my fastest pure python version. Using strided instead of lil in my cython code drops the time from 0.43 to 0.41.

Stack Exchange Network

NumPy Scipy optimization

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

NumPy Scipy optimization

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions