1
\$\begingroup\$

I have an optimization issue and I'm not sure if I can improve the overall speed of my function.

The function draw_w is my actual implementation that gives me the right update for my vector w. You can see that the code is running for 1000 iterations in 9s. I had a first implementation (draw_w_fw) that was quiet good because, well-vectorized, that is an order of magnitude quicker by 0.8s. Unfortunately, this implementation is wrong because the cache[0] is not updated correctly.

I would like to know if someone knows a way to speed up draw_w from what I've wrote.

import scipy.sparse as sps
import numpy as np
import timeit
def draw_w(X, w, cache):
 X = X.tocsr()
 for i in xrange(w.shape[0]):
 x_li = X.getrow(i)
 Y = w[i] * x_li
 Y.data -= np.take(cache[0], Y.indices) 
 h = x_li.multiply(-Y)
 w_mean = h.sum()
 w_sigma_sqr = (x_li.multiply(x_li)).sum() 
 w_sigma_sqr = 1.0 / w_sigma_sqr
 w_mean = - w_sigma_sqr * w_mean
 # update w:
 w_old = np.copy(w[i])
 if np.isinf(w[i]):
 w[i] = 0
 elif np.isnan(w[i]):
 w[i] = 0
 else:
 w[i] = w_mean
 # update error:
 cache[0] -= (w_old - w[i]) * x_li
 #print 'draw_w', w
 #True result w [ 2.34125626 2.37726446 4.00792293 3.71059779 4.00792293 0.11100713
 # -0.28899287 -0.04393113 0.21429929]
####################
def draw_w_fw(X, w, cache):
 x_li = X.tocsr()
 nnz_per_row = np.diff(x_li.indptr)
 Y = sps.csr_matrix((x_li.data * np.repeat(w, nnz_per_row), x_li.indices, x_li.indptr), shape=x_li.shape)
 Y.data -= np.take(cache[0], Y.indices) #not good because cache[0] is updated...
 h = x_li.multiply(-Y)
 w_mean = np.asarray(h.sum(axis=1).transpose())[0]
 w_sigma_sqr = np.asarray(( x_li.multiply(x_li) ).sum(axis=1).transpose())[0]
 w_sigma_sqr = 1.0 / w_sigma_sqr
 w_mean = - w_sigma_sqr * w_mean
 # update w:
 w_old = np.copy(w)
 w[~np.isnan(w) & ~np.isinf(w)] = w_mean[~np.isnan(w) & ~np.isinf(w)]
 w[np.isinf(w)] = 0.0
 w[np.isnan(w)] = 0.0
 # update error:
 cache[0] -= (w_old - w) * x_li
 #print 'draw_w_fw', w
##########################
def test():
 data = [ 1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.]
 cols = [ 0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,9,9,10,10,11,11,12,12,13,13,14,14]
 rows = [0,5,1,5,2,5,3,5,4,5,0,6,1,6,2,6,3,6,4,6,1,7,3,7,0,8,2,8,4,8]
 X = sps.coo_matrix((data, (rows, cols)), shape = (9,15))
 X = X.tocsr()
 w = np.array([ 0.16243454, -0.06117564, -0.05281718, -0.10729686, 0.08654076, -0.23015387, 0.17448118, -0.07612069, 0.03190391])
 cache = np.zeros((2, 15))
 cache[0] = np.asarray([-5.06771933, -5.29132951, -4.28297104, -1.33745073, -2.14361311, -0.66308429,
 -0.88669446, -2.878336, -4.93281569, -4.73897806, -1.13729633, -5.18341755,
 -0.80566155, -5.02091327, -4.88155533])
 #draw_w(X, w, cache)
 draw_w_fw(X, w, cache)
if __name__ == '__main__':
 print(timeit.timeit("test()", setup="from __main__ import test", number=1000))
 #draw_w: 9.26 for 1000 iterations
 #draw_w_fw: 0.80 for 1000 iterations
Jamal
35.2k13 gold badges134 silver badges238 bronze badges
asked Oct 14, 2013 at 4:18
\$\endgroup\$

3 Answers 3

2
+100
\$\begingroup\$

Before streamlining the code, it almost always pays off to streamline the math. The h you calculate is the dot product of two vectors, X[i] and Y[i] = cache[0] - w[i]*X[i], so we can rewrite it as

h = X[i].cache[0] - w[i]*X[i].X[i]

When you go on to calculate w_mean, you are actually negating h and then dividing it by a number that turns out to be X[i].X[i], so you can calculate w_mean as

w_mean = w[i] - X[i].cache[0] / X[i].X[i]

Setting aside the np.isnan and np.isinf checks you are doing, you then update cache[0] with a value where w[i] cancels out, and you are basically doing:

cache[0] -= X[i] * (X[i].cache[0] / X[i].X[i])

Putting it all together into a single function, I got this very compact code:

from numpy.lib.stride_tricks import as_strided
def draw_w_bis(X, w, cache):
 X = X.tocsr()
 x_rows_sqr = np.add.reduceat(X.data*X.data, X.indptr[:-1])
 rows, cols = X.shape
 row_start_stop = as_strided(X.indptr, shape=(rows, 2),
 strides=2*X.indptr.strides)
 for row, (start, stop) in enumerate(row_start_stop):
 data = X.data[start:stop]
 cols = X.indices[start:stop]
 delta = np.dot(data, cache[0, cols]) / x_rows_sqr[row]
 w[row] -= delta
 cache[0, cols] -= delta * data

If you are not comfortable with calling as_strided you could replace the definition of row_start_stop with:

row_start_stop = np.column_stack((X.indptr[:-1], X.indptr[1:]))

On my system your test code runs the 1000 iterations over an order of magnitude faster (6 sec vs. 0.4 sec) which seems to be what you were after. There still is a for loop in there, so if you are going to run this on large arrays, it is still going to be slow, and you may want to consider using things like Cython.

answered Nov 2, 2013 at 15:40
\$\endgroup\$
1
  • \$\begingroup\$ There are 2 parts to your speedup. 1) the last 3 lines that streamline the inner loop. 2) iteration using as_strided(X.indptr...). The streamlining saves about 20% when plugged into my fastest pure python version. Using strided instead of lil in my cython code drops the time from 0.43 to 0.41. \$\endgroup\$ Commented Nov 17, 2013 at 23:33
1
\$\begingroup\$

At least with your test X, draw_x_fw works with 2 calls. The rows of X fall into two groups, each with a unique set of columns. I don't know if that's a structural part of the problem, or just a random characteristic your chosen test.

 n = 5
 draw_w_fw(X[:n,:], w[:n], cache) # n items
 draw_w_fw(X[n:,:], w[n:], cache)

With simple slicing like this, w[:n] is a view, so changes to w inside the function are visible outside it. If we do a more general row selection, we have to explicitly set the new w values.

answered Oct 30, 2013 at 18:48
\$\endgroup\$
1
  • \$\begingroup\$ This part is a coincidence. On real dataset, I can't just find some slicing like that but good point. \$\endgroup\$ Commented Oct 31, 2013 at 11:33
1
\$\begingroup\$

This is a simplified version of draw_w, working with the data for one row.

def draw_np(Xdata, Xindices, w, cache):
 # Xdata, Xindices are from a csr row
 # w scalar
 # cache vector
 Ydata = w * Xdata
 Ydata -= np.take(cache, Xindices)
 h = - Xdata * Ydata
 w_mean = h.sum()
 w_sigma_sqr = (Xdata*Xdata).sum()
 w_mean = - w_mean / w_sigma_sqr
 cache[Xindices] -= (w - w_mean) * Xdata
 return w_mean
for i in xrange(w.size):
 xli = X.getrow(i)
 w[i] = draw_np(xli.data, xli.indices , w[i], cache[0])

Compared to the 9 sec for draw_w, this is 1.6 sec. I've omitted the isInf and isNan tests (which should apply to w_mean), but I don't think that's significant. My guess is the speed comes from working with the simpler numpy data vectors rather than the full sparse matrix.

I wrote this simplified Python version as a step toward a faster (hopefully) cython function:

def draw_cy(Xdata, Xindices, w, cache):
 cdef double w_mean
 cdef double w_sigma_sqr
 cdef double ydata
 cdef double value
 cdef int size
 cdef int ind
 size = Xdata.shape[0]
 w_mean = 0.0
 w_sigma_sqr = 0.0
 for i in range(size):
 value = Xdata[i]
 ind = Xindices[i]
 ydata = w * value
 ydata -= cache[ind]
 ydata = - value * ydata
 w_mean += ydata
 w_sigma_sqr += value * value
 w_mean = - w_mean / w_sigma_sqr
 for i in range(size):
 value = Xdata[i]
 ind = Xindices[i]
 cache[ind] -= (w - w_mean) * value
 return w_mean
# requires a compile to .c and .so and import
for i in xrange(w.size):
 xli = X.getrow(i)
 w[i] = draw_cy(xli.data, xli.indices , w[i], cache[0])

This times at 1.1 sec, nearly as good as the fully vectorized (but erroneous) draw_w_fw. But the speedup compared to draw_np isn't that great.


more on timings. I pulled out various pieces from this last iteration, and deduced that

bare loop (pass) took 0.27 sec (that includes setup)
9 calls to getrow took 0.64 sec
9 calls to draw_cy took 0.10 sec
9 calls to draw_np took 0.60 sec

It's significant that the X.getrow() took as long as the barebones numpy draw_np calculation. My impression is that scipy.sparse is a convenient way of setting up sparse matricies, and a good way of doing large linear algebra problems (e.g. FEM), but it is not a fast way of doing general matrix operations.

I tried a dense numpy iterator:

Xa = X.A
for i in xrange(w.size):
 x1 = Xa[i,:]; nz = x1.nonzero()[0]; x1=x1[nz]
 w[i] = cy.draw_cy(x1, nz, w[i], cache[0])

The time for this is 0.62 sec., 0.5 less than the getrow version. Using draw_np gives a similar time savings.

In other words, if X isn't so large that you can't store it as a dense array, it might be faster to skip the sparse module entirely. But I haven't explored how these timings vary with the size of X and w.


lil iteration is even a bit faster, since its data and rows are matching lists:

Xl = X.tolil()
for i, (xd, xr) in enumerate(zip(Xl.data, Xl.rows)):
 #w[i] = cy.draw_cy(np.array(xd), xr, w[i], cache[0])
 w[i] = cpy.draw_py(np.array(xd), xr , w[i], cache[0])

If the X.data are all 1 (as in the test case), the calculations can be simplified even more (only X.rows has useful information).

answered Oct 31, 2013 at 4:01
\$\endgroup\$
2
  • \$\begingroup\$ I tested it on my computer and don't have the same speed up. I have 3.75 so it's definitely an improve but still too slow for what I need. I'm impress by the speedup for so little change. I'll definitely take a look at Cython and give you the bounty if nobody else answer. \$\endgroup\$ Commented Oct 31, 2013 at 16:13
  • \$\begingroup\$ Further testing shows that X.getrow() takes a nontrivial amount of time. \$\endgroup\$ Commented Nov 1, 2013 at 4:26

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.