I have an optimization issue and I'm not sure if I can improve the overall speed of my function.
The function draw_w
is my actual implementation that gives me the right update for my vector w
. You can see that the code is running for 1000 iterations in 9s.
I had a first implementation (draw_w_fw
) that was quiet good because, well-vectorized, that is an order of magnitude quicker by 0.8s. Unfortunately, this implementation is wrong because the cache[0]
is not updated correctly.
I would like to know if someone knows a way to speed up draw_w
from what I've wrote.
import scipy.sparse as sps
import numpy as np
import timeit
def draw_w(X, w, cache):
X = X.tocsr()
for i in xrange(w.shape[0]):
x_li = X.getrow(i)
Y = w[i] * x_li
Y.data -= np.take(cache[0], Y.indices)
h = x_li.multiply(-Y)
w_mean = h.sum()
w_sigma_sqr = (x_li.multiply(x_li)).sum()
w_sigma_sqr = 1.0 / w_sigma_sqr
w_mean = - w_sigma_sqr * w_mean
# update w:
w_old = np.copy(w[i])
if np.isinf(w[i]):
w[i] = 0
elif np.isnan(w[i]):
w[i] = 0
else:
w[i] = w_mean
# update error:
cache[0] -= (w_old - w[i]) * x_li
#print 'draw_w', w
#True result w [ 2.34125626 2.37726446 4.00792293 3.71059779 4.00792293 0.11100713
# -0.28899287 -0.04393113 0.21429929]
####################
def draw_w_fw(X, w, cache):
x_li = X.tocsr()
nnz_per_row = np.diff(x_li.indptr)
Y = sps.csr_matrix((x_li.data * np.repeat(w, nnz_per_row), x_li.indices, x_li.indptr), shape=x_li.shape)
Y.data -= np.take(cache[0], Y.indices) #not good because cache[0] is updated...
h = x_li.multiply(-Y)
w_mean = np.asarray(h.sum(axis=1).transpose())[0]
w_sigma_sqr = np.asarray(( x_li.multiply(x_li) ).sum(axis=1).transpose())[0]
w_sigma_sqr = 1.0 / w_sigma_sqr
w_mean = - w_sigma_sqr * w_mean
# update w:
w_old = np.copy(w)
w[~np.isnan(w) & ~np.isinf(w)] = w_mean[~np.isnan(w) & ~np.isinf(w)]
w[np.isinf(w)] = 0.0
w[np.isnan(w)] = 0.0
# update error:
cache[0] -= (w_old - w) * x_li
#print 'draw_w_fw', w
##########################
def test():
data = [ 1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.]
cols = [ 0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,9,9,10,10,11,11,12,12,13,13,14,14]
rows = [0,5,1,5,2,5,3,5,4,5,0,6,1,6,2,6,3,6,4,6,1,7,3,7,0,8,2,8,4,8]
X = sps.coo_matrix((data, (rows, cols)), shape = (9,15))
X = X.tocsr()
w = np.array([ 0.16243454, -0.06117564, -0.05281718, -0.10729686, 0.08654076, -0.23015387, 0.17448118, -0.07612069, 0.03190391])
cache = np.zeros((2, 15))
cache[0] = np.asarray([-5.06771933, -5.29132951, -4.28297104, -1.33745073, -2.14361311, -0.66308429,
-0.88669446, -2.878336, -4.93281569, -4.73897806, -1.13729633, -5.18341755,
-0.80566155, -5.02091327, -4.88155533])
#draw_w(X, w, cache)
draw_w_fw(X, w, cache)
if __name__ == '__main__':
print(timeit.timeit("test()", setup="from __main__ import test", number=1000))
#draw_w: 9.26 for 1000 iterations
#draw_w_fw: 0.80 for 1000 iterations
3 Answers 3
Before streamlining the code, it almost always pays off to streamline the math. The h
you calculate is the dot product of two vectors, X[i]
and Y[i] = cache[0] - w[i]*X[i]
, so we can rewrite it as
h = X[i].cache[0] - w[i]*X[i].X[i]
When you go on to calculate w_mean
, you are actually negating h
and then dividing it by a number that turns out to be X[i].X[i]
, so you can calculate w_mean
as
w_mean = w[i] - X[i].cache[0] / X[i].X[i]
Setting aside the np.isnan
and np.isinf
checks you are doing, you then update cache[0]
with a value where w[i]
cancels out, and you are basically doing:
cache[0] -= X[i] * (X[i].cache[0] / X[i].X[i])
Putting it all together into a single function, I got this very compact code:
from numpy.lib.stride_tricks import as_strided
def draw_w_bis(X, w, cache):
X = X.tocsr()
x_rows_sqr = np.add.reduceat(X.data*X.data, X.indptr[:-1])
rows, cols = X.shape
row_start_stop = as_strided(X.indptr, shape=(rows, 2),
strides=2*X.indptr.strides)
for row, (start, stop) in enumerate(row_start_stop):
data = X.data[start:stop]
cols = X.indices[start:stop]
delta = np.dot(data, cache[0, cols]) / x_rows_sqr[row]
w[row] -= delta
cache[0, cols] -= delta * data
If you are not comfortable with calling as_strided
you could replace the definition of row_start_stop
with:
row_start_stop = np.column_stack((X.indptr[:-1], X.indptr[1:]))
On my system your test code runs the 1000 iterations over an order of magnitude faster (6 sec vs. 0.4 sec) which seems to be what you were after. There still is a for loop in there, so if you are going to run this on large arrays, it is still going to be slow, and you may want to consider using things like Cython.
-
\$\begingroup\$ There are 2 parts to your speedup. 1) the last 3 lines that streamline the inner loop. 2) iteration using
as_strided(X.indptr...)
. The streamlining saves about 20% when plugged into my fastest pure python version. Usingstrided
instead oflil
in mycython
code drops the time from0.43
to0.41
. \$\endgroup\$hpaulj– hpaulj2013年11月17日 23:33:18 +00:00Commented Nov 17, 2013 at 23:33
At least with your test X
, draw_x_fw
works with 2 calls. The rows of X
fall into two groups, each with a unique set of columns. I don't know if that's a structural part of the problem, or just a random characteristic your chosen test.
n = 5
draw_w_fw(X[:n,:], w[:n], cache) # n items
draw_w_fw(X[n:,:], w[n:], cache)
With simple slicing like this, w[:n]
is a view, so changes to w
inside the function are visible outside it. If we do a more general row selection, we have to explicitly set the new w
values.
-
\$\begingroup\$ This part is a coincidence. On real dataset, I can't just find some slicing like that but good point. \$\endgroup\$ThiS– ThiS2013年10月31日 11:33:05 +00:00Commented Oct 31, 2013 at 11:33
This is a simplified version of draw_w
, working with the data for one row.
def draw_np(Xdata, Xindices, w, cache):
# Xdata, Xindices are from a csr row
# w scalar
# cache vector
Ydata = w * Xdata
Ydata -= np.take(cache, Xindices)
h = - Xdata * Ydata
w_mean = h.sum()
w_sigma_sqr = (Xdata*Xdata).sum()
w_mean = - w_mean / w_sigma_sqr
cache[Xindices] -= (w - w_mean) * Xdata
return w_mean
for i in xrange(w.size):
xli = X.getrow(i)
w[i] = draw_np(xli.data, xli.indices , w[i], cache[0])
Compared to the 9 sec for draw_w
, this is 1.6 sec. I've omitted the isInf
and isNan
tests (which should apply to w_mean
), but I don't think that's significant. My guess is the speed comes from working with the simpler numpy
data vectors rather than the full sparse
matrix.
I wrote this simplified Python version as a step toward a faster (hopefully) cython
function:
def draw_cy(Xdata, Xindices, w, cache):
cdef double w_mean
cdef double w_sigma_sqr
cdef double ydata
cdef double value
cdef int size
cdef int ind
size = Xdata.shape[0]
w_mean = 0.0
w_sigma_sqr = 0.0
for i in range(size):
value = Xdata[i]
ind = Xindices[i]
ydata = w * value
ydata -= cache[ind]
ydata = - value * ydata
w_mean += ydata
w_sigma_sqr += value * value
w_mean = - w_mean / w_sigma_sqr
for i in range(size):
value = Xdata[i]
ind = Xindices[i]
cache[ind] -= (w - w_mean) * value
return w_mean
# requires a compile to .c and .so and import
for i in xrange(w.size):
xli = X.getrow(i)
w[i] = draw_cy(xli.data, xli.indices , w[i], cache[0])
This times at 1.1 sec, nearly as good as the fully vectorized (but erroneous) draw_w_fw
. But the speedup compared to draw_np
isn't that great.
more on timings. I pulled out various pieces from this last iteration, and deduced that
bare loop (pass) took 0.27 sec (that includes setup)
9 calls to getrow took 0.64 sec
9 calls to draw_cy took 0.10 sec
9 calls to draw_np took 0.60 sec
It's significant that the X.getrow()
took as long as the barebones numpy draw_np
calculation. My impression is that scipy.sparse
is a convenient way of setting up sparse matricies, and a good way of doing large linear algebra problems (e.g. FEM), but it is not a fast way of doing general matrix operations.
I tried a dense numpy
iterator:
Xa = X.A
for i in xrange(w.size):
x1 = Xa[i,:]; nz = x1.nonzero()[0]; x1=x1[nz]
w[i] = cy.draw_cy(x1, nz, w[i], cache[0])
The time for this is 0.62 sec., 0.5 less than the getrow
version. Using draw_np
gives a similar time savings.
In other words, if X
isn't so large that you can't store it as a dense array, it might be faster to skip the sparse
module entirely. But I haven't explored how these timings vary with the size of X
and w
.
lil
iteration is even a bit faster, since its data
and rows
are matching lists:
Xl = X.tolil()
for i, (xd, xr) in enumerate(zip(Xl.data, Xl.rows)):
#w[i] = cy.draw_cy(np.array(xd), xr, w[i], cache[0])
w[i] = cpy.draw_py(np.array(xd), xr , w[i], cache[0])
If the X.data
are all 1
(as in the test case), the calculations can be simplified even more (only X.rows
has useful information).
-
\$\begingroup\$ I tested it on my computer and don't have the same speed up. I have 3.75 so it's definitely an improve but still too slow for what I need. I'm impress by the speedup for so little change. I'll definitely take a look at Cython and give you the bounty if nobody else answer. \$\endgroup\$ThiS– ThiS2013年10月31日 16:13:24 +00:00Commented Oct 31, 2013 at 16:13
-
\$\begingroup\$ Further testing shows that
X.getrow()
takes a nontrivial amount of time. \$\endgroup\$hpaulj– hpaulj2013年11月01日 04:26:44 +00:00Commented Nov 1, 2013 at 4:26