Creation of MxM matrix: Python version much slower than Matlab version

Question 1

I have written the same code in Python (NumPy) and in Matlab, and I tried to use the same structure for both language and follow the same procedure. Now my problem is that when I run my code in Python it's very very slow but it runs fast in Matlab.

How can I improve my Python code? I want to work on the Python version of my code.

In the code I am trying to create an MxM matrix (M can vary between 100 to 1000) and trying to fill this matrix 16 times, each time I add the previous value of M[i,j] with the current value. The operation inside the code is simple just some add and divide, and now I wonder why Matlab performs better than Python.

Python:

import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime
startTime = datetime.now()
bb = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
bb = ['{0:04b}'.format(x) for x in bb]
step_prob=0.01
step = int(1/step_prob)
out_t = np.zeros((step,step), dtype=float)
out_f = np.zeros((step,step), dtype=float)
s = np.zeros((step,step), dtype=float)
improve = np.empty((step,step))
for ii in range(1, step+1):
 print ii
 pti=ii*step_prob
 for jj in range(1, step+1):
 pfi=jj*step_prob
 improve[ii-1,jj-1]=0
 si=(1-pfi+pti)/2
 for k in range(len(bb)):
 f1 = int(bb[k][0])
 f2 = int(bb[k][1]) 
 f3 = int(bb[k][2])
 f4 = int(bb[k][3])
 for i in range(1, step+1):
 for j in range(1, step+1):
 ptj=i*step_prob
 pfj=j*step_prob
 out_t[i-1,j-1] = f1*pti*ptj+f2*pti*(1-ptj)+f3*(1-pti)*ptj+f4*(1-pti)*(1-ptj)
 out_f[i-1,j-1] = f1*pfi*pfj+f2*pfi*(1-pfj)+f3*(1-pfi)*pfj+f4*(1-pfi)*(1-pfj)
 sj=(1-pfj+ptj)/2;
 s[i-1,j-1] = ((1-out_f[i-1,j-1]+out_t[i-1,j-1])/2)-max(si,sj);
# temp=s*(s>0)*tril(ones(size(s)));
 temp = s*(s>0)*np.tril(np.ones((len(s),len(s))).astype(int))
 improve[ii-1,jj-1]=improve[ii-1,jj-1]+sum(sum(temp))
im = plt.imshow(improve*np.tril(np.ones((len(improve),len(improve))).astype(int)),interpolation='bilinear', aspect='auto') # s*[s>0]
# im = plt.imshow(improve, interpolation='bilinear', aspect='auto') # s*[s>0]
plt.colorbar(im, orientation='vertical')
plt.show()

Matlab:

close all
clear all
clc
bb = de2bi([0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15],4,'left-msb');
% bb = de2bi([1],4);
step_prob=0.01;
for ii=1:1*(1/step_prob)
 ii
 pti=ii*step_prob;
 for jj=1:1*(1/step_prob)
 pfi=jj*step_prob;
 improve(ii,jj)=0;
 si=(1-pfi+pti)/2;
 for k=1:size(bb,1)
 f1=bb(k,1);
 f2=bb(k,2);
 f3=bb(k,3);
 f4=bb(k,4);
 for i=1:1*(1/step_prob)
 for j=1:1*(1/step_prob)
 ptj=i*step_prob;
 pfj=j*step_prob;
 out_t(i,j)=f1*pti*ptj+f2*pti*(1-ptj)+f3*(1-pti)*ptj+f4*(1-pti)*(1-ptj);
 out_f(i,j)=f1*pfi*pfj+f2*pfi*(1-pfj)+f3*(1-pfi)*pfj+f4*(1-pfi)*(1-pfj);
 sj=(1-pfj+ptj)/2;
 s(i,j)=((1-out_f(i,j)+out_t(i,j))/2)-max(si,sj);
 end
 end
 temp=s.*(s>0).*tril(ones(size(s)));
 improve(ii,jj)=improve(ii,jj)+sum(sum(temp));
 clear f1 f2 f3 f4
 end
 end
end
scale=step_prob:step_prob:1;
imagesc(scale,scale,improve.*tril(ones(size(improve))))
xlabel('Pfj')
ylabel('Ptj')
axis xy

Question 2

Your title should state what your code is for, not that it is slow! What is it's purpose?

Question 3

To make life easier for reviewers, please add sufficient context to your question. The more you tell us about what your code does and what the purpose of doing that is, the easier it will be for reviewers to help you.

Question 4

Coincidentally, I have the exact same problem, except the speeds of the two languages are switched for me...

Question 5

You use very short variable names, so its hard to follow what your code is doing.

For speed in numpy, you need to use vector operations. So instead of writing

for ii in range(step):
 out[ii] = ii * x + b

You should write:

out = numpy.arange(step)*x + b

Individually accessing the elements of numpy array is actually quite slow. You need to write your code to do everything in vector operations.

Here is my rewrite of your inner loop to be vectorized. You could vectorize the outer loop as well, but this should make the biggest difference.

 f1 = int(bb[k][0])
 f2 = int(bb[k][1]) 
 f3 = int(bb[k][2])
 f4 = int(bb[k][3])
 ptj = np.arange(step_prob, 1.0 + step_prob, step_prob)[:, None]
 pfj = np.arange(step_prob, 1.0 + step_prob, step_prob)[None, :]
 out_t = f1*pti*ptj+f2*pti*(1-ptj)+f3*(1-pti)*ptj+f4*(1-pti)*(1-ptj)
 out_f = f1*pfi*pfj+f2*pfi*(1-pfj)+f3*(1-pfi)*pfj+f4*(1-pfi)*(1-pfj)
 sj = (1 - pfj + ptj) / 2
 the_max = np.maximum(sj, [si])
 s = ((1-out_f+out_t)/2)-the_max
 temp = s*(s>0)*np.tril(np.ones((len(s),len(s))).astype(int))
 improve[ii-1,jj-1]=improve[ii-1,jj-1]+temp.sum()

Question 6

it's kinda impossible for me to do this, since I need to compute some "expected value" individually for different threshold. anyway I don't understand why this way of coding works find in Matlab but not in Numpy

Question 7

@Am1rr3zA, Python and Matlab are different languages. In Python, manual loops like you've written are rather slow. They used to be slow in Matlab too, but the implementation was improved. Furthermore, I doubt its actually impossible to vectorize, but since I have no idea what you are doing in this code it's hard to give you suggestions.

Question 8

If I want to explain my code in a simple word I am trying to fill a MxM matrix (improve) each value of matrix is computed as follows: improve[i,j] is sum of 16 times (for all the element in bb array) calculating my formula ( out_t(i,j) and out_f(i,j) ), and my formula try to compute expected value of pti and pfi (outter loop) for all possible range of ptj and pfj (inner loop).

Question 9

@Am1rr3zA, the third parameter to linspace is a count, not a step size. If you want to use linspace, it should be np.linspace(0, 1, steps). But that's not quite what you were doing because your code skipped zero. You do need the [:, None] to make it work. I may well have not gotten it quite right, but hopefully it helps you see what you could do to speed up the poythn.

Question 10

@Am1rr3zA, as I noted, I may not have gotten it quite correct, but the general principle should work. If you fix my arange to refer to step_prob instead of step, I get the correct results at least for the first iteration.

Question 11

My observations, in terms of implementing this in Python (without regards to improve the algorithm, that will be later):

You are doing a LOT of unnecessary type conversions in Python that you aren't doing in MATLAB. Type conversions are always slow, so this is going to hurt your performance a lot.
You shouldn't do for k in range(len(bb)):, always iterate over the sequence directly. You can use unpacking to get the values without needing to index in this case.
You are iterating over range(1, foo+1), then subtracting one to put it back into Python 0-based indexing. This is an unnecessary mathematical operation, better to do range(foo), or in your case enumerate(range(1, foo+1)). Better yet, pre-compute your values, then enumerate over those.
numpy.unpackbits is the equivalent of de2bi and will be much faster.
Python has in-place operations, which will be faster. So a += 1 instead of a = a + 1.
Numpy sums work of flattened arrays by default. Further, summing is a method of numpy arrays, which is simpler and probably faster.
Numpy arrays are floats by default if created using ones or zeros, but there is a dtype argument that can set it to whatever you want.
Numpy has a ones_like and zeros_like to create an array of the same shape as another. By default it also creates one with the same dtype, but you can override that.
PEP8 is the recommended style for python code.

Now, in terms of improving the algorithm:

si and sj are not very big (on the order of tens of megabytes, tops). You can vectorize those and re-use the values. In fact, they are identical, so you only need one.
You not only never use the other values of out_t and out_f besides the current one, you overwrite them repeatedly. So these are better off as scalars.
You can vectorize many of the values for the two innermost loops.
If you put the third loop as the outer loop, you can vectorize out_t and out_f.

So here is my partially-vectorized version:

bb0 = np.arange(16, dtype='uint8')
bb = np.unpackbits(bb0[None], axis=0)[-4:, :].T
step_prob = 0.01
step = int(1/step_prob)
pxx = np.linspace(step_prob, 1, step)
pxxt = pxx[:, None]
sx = (1-pxx+pxxt)/2
improve = np.zeros((step,step))
for f4, f3, f2, f1 in bb:
 out_x = f1*pxx*pxxt + f2*pxx*(1-pxxt) + f3*(1-pxx)*pxxt + f4*(1-pxx)*(1-pxxt)
 for ii, (out_t, sxi) in enumerate(izip(out_x, sx)):
 for jj, (out_f, si) in enumerate(izip(out_x, sxi)):
 s = (1-out_f[:, None]+out_t) - np.maximum(si, sxi)
 temp = s*(s>0)*np.tril(np.ones_like(s))
 improve[ii, jj] = temp.sum()

It is orders of magnitude faster than your version.

Question 12

your points all are on point

score 9 · Answer 1 · 2014-04-29 23:59:24Z

9

\$\begingroup\$

You use very short variable names, so its hard to follow what your code is doing.

For speed in numpy, you need to use vector operations. So instead of writing

for ii in range(step):
 out[ii] = ii * x + b

You should write:

out = numpy.arange(step)*x + b

Individually accessing the elements of numpy array is actually quite slow. You need to write your code to do everything in vector operations.

Here is my rewrite of your inner loop to be vectorized. You could vectorize the outer loop as well, but this should make the biggest difference.

 f1 = int(bb[k][0])
 f2 = int(bb[k][1]) 
 f3 = int(bb[k][2])
 f4 = int(bb[k][3])
 ptj = np.arange(step_prob, 1.0 + step_prob, step_prob)[:, None]
 pfj = np.arange(step_prob, 1.0 + step_prob, step_prob)[None, :]
 out_t = f1*pti*ptj+f2*pti*(1-ptj)+f3*(1-pti)*ptj+f4*(1-pti)*(1-ptj)
 out_f = f1*pfi*pfj+f2*pfi*(1-pfj)+f3*(1-pfi)*pfj+f4*(1-pfi)*(1-pfj)
 sj = (1 - pfj + ptj) / 2
 the_max = np.maximum(sj, [si])
 s = ((1-out_f+out_t)/2)-the_max
 temp = s*(s>0)*np.tril(np.ones((len(s),len(s))).astype(int))
 improve[ii-1,jj-1]=improve[ii-1,jj-1]+temp.sum()

Share

edited May 1, 2014 at 21:22

answered Apr 29, 2014 at 23:59

Winston Ewert's user avatar

Winston Ewert Winston EwertWinston Ewert

30.7k4 gold badges52 silver badges79 bronze badges

\$\endgroup\$

10

\$\begingroup\$ it's kinda impossible for me to do this, since I need to compute some "expected value" individually for different threshold. anyway I don't understand why this way of coding works find in Matlab but not in Numpy \$\endgroup\$

Am1rr3zA
– Am1rr3zA

2014年04月30日 00:41:46 +00:00
Commented Apr 30, 2014 at 0:41
5

\$\begingroup\$ @Am1rr3zA, Python and Matlab are different languages. In Python, manual loops like you've written are rather slow. They used to be slow in Matlab too, but the implementation was improved. Furthermore, I doubt its actually impossible to vectorize, but since I have no idea what you are doing in this code it's hard to give you suggestions. \$\endgroup\$

Winston Ewert
– Winston Ewert

2014年04月30日 00:52:21 +00:00
Commented Apr 30, 2014 at 0:52
\$\begingroup\$ If I want to explain my code in a simple word I am trying to fill a MxM matrix (improve) each value of matrix is computed as follows: improve[i,j] is sum of 16 times (for all the element in bb array) calculating my formula ( out_t(i,j) and out_f(i,j) ), and my formula try to compute expected value of pti and pfi (outter loop) for all possible range of ptj and pfj (inner loop). \$\endgroup\$

Am1rr3zA
– Am1rr3zA

2014年04月30日 01:02:59 +00:00
Commented Apr 30, 2014 at 1:02
3

\$\begingroup\$ @Am1rr3zA, the third parameter to linspace is a count, not a step size. If you want to use linspace, it should be np.linspace(0, 1, steps). But that's not quite what you were doing because your code skipped zero. You do need the [:, None] to make it work. I may well have not gotten it quite right, but hopefully it helps you see what you could do to speed up the poythn. \$\endgroup\$

Winston Ewert
– Winston Ewert

2014年04月30日 02:59:17 +00:00
Commented Apr 30, 2014 at 2:59
1

\$\begingroup\$ @Am1rr3zA, as I noted, I may not have gotten it quite correct, but the general principle should work. If you fix my arange to refer to step_prob instead of step, I get the correct results at least for the first iteration. \$\endgroup\$

Winston Ewert
– Winston Ewert

2014年05月01日 21:23:00 +00:00
Commented May 1, 2014 at 21:23

| Show 5 more comments

TheBlackCat TheBlackCatTheBlackCat 2,44412 silver badges10 bronze badges · Answer 2 · 2016-05-06 19:13:47Z

My observations, in terms of implementing this in Python (without regards to improve the algorithm, that will be later):

You are doing a LOT of unnecessary type conversions in Python that you aren't doing in MATLAB. Type conversions are always slow, so this is going to hurt your performance a lot.
You shouldn't do for k in range(len(bb)):, always iterate over the sequence directly. You can use unpacking to get the values without needing to index in this case.
You are iterating over range(1, foo+1), then subtracting one to put it back into Python 0-based indexing. This is an unnecessary mathematical operation, better to do range(foo), or in your case enumerate(range(1, foo+1)). Better yet, pre-compute your values, then enumerate over those.
numpy.unpackbits is the equivalent of de2bi and will be much faster.
Python has in-place operations, which will be faster. So a += 1 instead of a = a + 1.
Numpy sums work of flattened arrays by default. Further, summing is a method of numpy arrays, which is simpler and probably faster.
Numpy arrays are floats by default if created using ones or zeros, but there is a dtype argument that can set it to whatever you want.
Numpy has a ones_like and zeros_like to create an array of the same shape as another. By default it also creates one with the same dtype, but you can override that.
PEP8 is the recommended style for python code.

Now, in terms of improving the algorithm:

si and sj are not very big (on the order of tens of megabytes, tops). You can vectorize those and re-use the values. In fact, they are identical, so you only need one.
You not only never use the other values of out_t and out_f besides the current one, you overwrite them repeatedly. So these are better off as scalars.
You can vectorize many of the values for the two innermost loops.
If you put the third loop as the outer loop, you can vectorize out_t and out_f.

So here is my partially-vectorized version:

bb0 = np.arange(16, dtype='uint8')
bb = np.unpackbits(bb0[None], axis=0)[-4:, :].T
step_prob = 0.01
step = int(1/step_prob)
pxx = np.linspace(step_prob, 1, step)
pxxt = pxx[:, None]
sx = (1-pxx+pxxt)/2
improve = np.zeros((step,step))
for f4, f3, f2, f1 in bb:
 out_x = f1*pxx*pxxt + f2*pxx*(1-pxxt) + f3*(1-pxx)*pxxt + f4*(1-pxx)*(1-pxxt)
 for ii, (out_t, sxi) in enumerate(izip(out_x, sx)):
 for jj, (out_f, si) in enumerate(izip(out_x, sxi)):
 s = (1-out_f[:, None]+out_t) - np.maximum(si, sxi)
 temp = s*(s>0)*np.tril(np.ones_like(s))
 improve[ii, jj] = temp.sum()

It is orders of magnitude faster than your version.

\$\begingroup\$ your points all are on point \$\endgroup\$

Am1rr3zA
– Am1rr3zA

2016年05月09日 14:35:02 +00:00
Commented May 9, 2016 at 14:35

Stack Exchange Network

Creation of MxM matrix: Python version much slower than Matlab version

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Creation of MxM matrix: Python version much slower than Matlab version

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions