I have written the same code in Python (NumPy) and in Matlab, and I tried to use the same structure for both language and follow the same procedure. Now my problem is that when I run my code in Python it's very very slow but it runs fast in Matlab.
How can I improve my Python code? I want to work on the Python version of my code.
In the code I am trying to create an MxM matrix (M can vary between 100 to 1000) and trying to fill this matrix 16 times, each time I add the previous value of M[i,j] with the current value. The operation inside the code is simple just some add and divide, and now I wonder why Matlab performs better than Python.
Python:
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime
startTime = datetime.now()
bb = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
bb = ['{0:04b}'.format(x) for x in bb]
step_prob=0.01
step = int(1/step_prob)
out_t = np.zeros((step,step), dtype=float)
out_f = np.zeros((step,step), dtype=float)
s = np.zeros((step,step), dtype=float)
improve = np.empty((step,step))
for ii in range(1, step+1):
print ii
pti=ii*step_prob
for jj in range(1, step+1):
pfi=jj*step_prob
improve[ii-1,jj-1]=0
si=(1-pfi+pti)/2
for k in range(len(bb)):
f1 = int(bb[k][0])
f2 = int(bb[k][1])
f3 = int(bb[k][2])
f4 = int(bb[k][3])
for i in range(1, step+1):
for j in range(1, step+1):
ptj=i*step_prob
pfj=j*step_prob
out_t[i-1,j-1] = f1*pti*ptj+f2*pti*(1-ptj)+f3*(1-pti)*ptj+f4*(1-pti)*(1-ptj)
out_f[i-1,j-1] = f1*pfi*pfj+f2*pfi*(1-pfj)+f3*(1-pfi)*pfj+f4*(1-pfi)*(1-pfj)
sj=(1-pfj+ptj)/2;
s[i-1,j-1] = ((1-out_f[i-1,j-1]+out_t[i-1,j-1])/2)-max(si,sj);
# temp=s*(s>0)*tril(ones(size(s)));
temp = s*(s>0)*np.tril(np.ones((len(s),len(s))).astype(int))
improve[ii-1,jj-1]=improve[ii-1,jj-1]+sum(sum(temp))
im = plt.imshow(improve*np.tril(np.ones((len(improve),len(improve))).astype(int)),interpolation='bilinear', aspect='auto') # s*[s>0]
# im = plt.imshow(improve, interpolation='bilinear', aspect='auto') # s*[s>0]
plt.colorbar(im, orientation='vertical')
plt.show()
Matlab:
close all
clear all
clc
bb = de2bi([0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15],4,'left-msb');
% bb = de2bi([1],4);
step_prob=0.01;
for ii=1:1*(1/step_prob)
ii
pti=ii*step_prob;
for jj=1:1*(1/step_prob)
pfi=jj*step_prob;
improve(ii,jj)=0;
si=(1-pfi+pti)/2;
for k=1:size(bb,1)
f1=bb(k,1);
f2=bb(k,2);
f3=bb(k,3);
f4=bb(k,4);
for i=1:1*(1/step_prob)
for j=1:1*(1/step_prob)
ptj=i*step_prob;
pfj=j*step_prob;
out_t(i,j)=f1*pti*ptj+f2*pti*(1-ptj)+f3*(1-pti)*ptj+f4*(1-pti)*(1-ptj);
out_f(i,j)=f1*pfi*pfj+f2*pfi*(1-pfj)+f3*(1-pfi)*pfj+f4*(1-pfi)*(1-pfj);
sj=(1-pfj+ptj)/2;
s(i,j)=((1-out_f(i,j)+out_t(i,j))/2)-max(si,sj);
end
end
temp=s.*(s>0).*tril(ones(size(s)));
improve(ii,jj)=improve(ii,jj)+sum(sum(temp));
clear f1 f2 f3 f4
end
end
end
scale=step_prob:step_prob:1;
imagesc(scale,scale,improve.*tril(ones(size(improve))))
xlabel('Pfj')
ylabel('Ptj')
axis xy
-
3\$\begingroup\$ Your title should state what your code is for, not that it is slow! What is it's purpose? \$\endgroup\$Adam– Adam2014年04月28日 22:00:02 +00:00Commented Apr 28, 2014 at 22:00
-
4\$\begingroup\$ To make life easier for reviewers, please add sufficient context to your question. The more you tell us about what your code does and what the purpose of doing that is, the easier it will be for reviewers to help you. \$\endgroup\$Simon Forsberg– Simon Forsberg2014年04月28日 23:03:20 +00:00Commented Apr 28, 2014 at 23:03
-
\$\begingroup\$ Coincidentally, I have the exact same problem, except the speeds of the two languages are switched for me... \$\endgroup\$Vedaad Shakib– Vedaad Shakib2015年07月14日 08:57:13 +00:00Commented Jul 14, 2015 at 8:57
2 Answers 2
You use very short variable names, so its hard to follow what your code is doing.
For speed in numpy, you need to use vector operations. So instead of writing
for ii in range(step):
out[ii] = ii * x + b
You should write:
out = numpy.arange(step)*x + b
Individually accessing the elements of numpy array is actually quite slow. You need to write your code to do everything in vector operations.
Here is my rewrite of your inner loop to be vectorized. You could vectorize the outer loop as well, but this should make the biggest difference.
f1 = int(bb[k][0])
f2 = int(bb[k][1])
f3 = int(bb[k][2])
f4 = int(bb[k][3])
ptj = np.arange(step_prob, 1.0 + step_prob, step_prob)[:, None]
pfj = np.arange(step_prob, 1.0 + step_prob, step_prob)[None, :]
out_t = f1*pti*ptj+f2*pti*(1-ptj)+f3*(1-pti)*ptj+f4*(1-pti)*(1-ptj)
out_f = f1*pfi*pfj+f2*pfi*(1-pfj)+f3*(1-pfi)*pfj+f4*(1-pfi)*(1-pfj)
sj = (1 - pfj + ptj) / 2
the_max = np.maximum(sj, [si])
s = ((1-out_f+out_t)/2)-the_max
temp = s*(s>0)*np.tril(np.ones((len(s),len(s))).astype(int))
improve[ii-1,jj-1]=improve[ii-1,jj-1]+temp.sum()
-
\$\begingroup\$ it's kinda impossible for me to do this, since I need to compute some "expected value" individually for different threshold. anyway I don't understand why this way of coding works find in Matlab but not in Numpy \$\endgroup\$Am1rr3zA– Am1rr3zA2014年04月30日 00:41:46 +00:00Commented Apr 30, 2014 at 0:41
-
5\$\begingroup\$ @Am1rr3zA, Python and Matlab are different languages. In Python, manual loops like you've written are rather slow. They used to be slow in Matlab too, but the implementation was improved. Furthermore, I doubt its actually impossible to vectorize, but since I have no idea what you are doing in this code it's hard to give you suggestions. \$\endgroup\$Winston Ewert– Winston Ewert2014年04月30日 00:52:21 +00:00Commented Apr 30, 2014 at 0:52
-
\$\begingroup\$ If I want to explain my code in a simple word I am trying to fill a MxM matrix (improve) each value of matrix is computed as follows: improve[i,j] is sum of 16 times (for all the element in bb array) calculating my formula ( out_t(i,j) and out_f(i,j) ), and my formula try to compute expected value of pti and pfi (outter loop) for all possible range of ptj and pfj (inner loop). \$\endgroup\$Am1rr3zA– Am1rr3zA2014年04月30日 01:02:59 +00:00Commented Apr 30, 2014 at 1:02
-
3\$\begingroup\$ @Am1rr3zA, the third parameter to linspace is a count, not a step size. If you want to use linspace, it should be np.linspace(0, 1, steps). But that's not quite what you were doing because your code skipped zero. You do need the [:, None] to make it work. I may well have not gotten it quite right, but hopefully it helps you see what you could do to speed up the poythn. \$\endgroup\$Winston Ewert– Winston Ewert2014年04月30日 02:59:17 +00:00Commented Apr 30, 2014 at 2:59
-
1\$\begingroup\$ @Am1rr3zA, as I noted, I may not have gotten it quite correct, but the general principle should work. If you fix my arange to refer to step_prob instead of step, I get the correct results at least for the first iteration. \$\endgroup\$Winston Ewert– Winston Ewert2014年05月01日 21:23:00 +00:00Commented May 1, 2014 at 21:23
My observations, in terms of implementing this in Python (without regards to improve the algorithm, that will be later):
- You are doing a LOT of unnecessary type conversions in Python that you aren't doing in MATLAB. Type conversions are always slow, so this is going to hurt your performance a lot.
- You shouldn't do
for k in range(len(bb)):
, always iterate over the sequence directly. You can use unpacking to get the values without needing to index in this case. - You are iterating over
range(1, foo+1)
, then subtracting one to put it back into Python 0-based indexing. This is an unnecessary mathematical operation, better to dorange(foo)
, or in your caseenumerate(range(1, foo+1))
. Better yet, pre-compute your values, then enumerate over those. numpy.unpackbits
is the equivalent ofde2bi
and will be much faster.- Python has in-place operations, which will be faster. So
a += 1
instead ofa = a + 1
. - Numpy sums work of flattened arrays by default. Further, summing is a method of numpy arrays, which is simpler and probably faster.
- Numpy arrays are floats by default if created using
ones
orzeros
, but there is adtype
argument that can set it to whatever you want. - Numpy has a
ones_like
andzeros_like
to create an array of the same shape as another. By default it also creates one with the same dtype, but you can override that. - PEP8 is the recommended style for python code.
Now, in terms of improving the algorithm:
si
andsj
are not very big (on the order of tens of megabytes, tops). You can vectorize those and re-use the values. In fact, they are identical, so you only need one.- You not only never use the other values of
out_t
andout_f
besides the current one, you overwrite them repeatedly. So these are better off as scalars. - You can vectorize many of the values for the two innermost loops.
- If you put the third loop as the outer loop, you can vectorize
out_t
andout_f
.
So here is my partially-vectorized version:
bb0 = np.arange(16, dtype='uint8')
bb = np.unpackbits(bb0[None], axis=0)[-4:, :].T
step_prob = 0.01
step = int(1/step_prob)
pxx = np.linspace(step_prob, 1, step)
pxxt = pxx[:, None]
sx = (1-pxx+pxxt)/2
improve = np.zeros((step,step))
for f4, f3, f2, f1 in bb:
out_x = f1*pxx*pxxt + f2*pxx*(1-pxxt) + f3*(1-pxx)*pxxt + f4*(1-pxx)*(1-pxxt)
for ii, (out_t, sxi) in enumerate(izip(out_x, sx)):
for jj, (out_f, si) in enumerate(izip(out_x, sxi)):
s = (1-out_f[:, None]+out_t) - np.maximum(si, sxi)
temp = s*(s>0)*np.tril(np.ones_like(s))
improve[ii, jj] = temp.sum()
It is orders of magnitude faster than your version.
-
\$\begingroup\$ your points all are on point \$\endgroup\$Am1rr3zA– Am1rr3zA2016年05月09日 14:35:02 +00:00Commented May 9, 2016 at 14:35
Explore related questions
See similar questions with these tags.