performance loss after vectorization in numpy

Question 1

I am writing a time consuming program. To reduce the time, I have tried my best to use numpy.dot instead of for loops.

However, I found vectorized program to have much worse performance than the for loop version:

import numpy as np
import datetime
kpt_list = np.zeros((10000,20),dtype='float')
rpt_list = np.zeros((1000,20),dtype='float')
h_r = np.zeros((20,20,1000),dtype='complex')
r_ndegen = np.zeros(1000,dtype='float')
r_ndegen.fill(1)
# setup completed
# this is a the vectorized version
r_ndegen_tile = np.tile(r_ndegen.reshape(1000, 1), 10000)
start = datetime.datetime.now()
phase = np.exp(1j * np.dot(rpt_list, kpt_list.T))/r_ndegen_tile
kpt_data_1 = h_r.dot(phase)
end = datetime.datetime.now()
print((end-start).total_seconds())
# the result is 19.302483
# this is the for loop version
kpt_data_2 = np.zeros((20, 20, 10000), dtype='complex')
start = datetime.datetime.now()
for i in range(10000):
 kpt = kpt_list[i, :]
 phase = np.exp(1j * np.dot(kpt, rpt_list.T))/r_ndegen
 kpt_data_2[:, :, i] = h_r.dot(phase)
end = datetime.datetime.now()
print((end-start).total_seconds())
# the result is 7.74583

What is happening here?

Question 2

Shouldn't you be outputting for the loopy version in a different variable than kpt_data? Might not affect the runtimes greatly, but for clarity between two versions.

Question 3

Doesn't change performance much when the variable changes though

Question 4

I really can't find a reason for such a jump. In my opinion, numpy just does not support multi-dimensional arrays of dim > 2 using optimized code, but my best bet is that you should directly ask the numpy developers.

Question 5

Have you tried this with actual values (not zeros and 1s)? Are the results exactly the same? Otherwise, I agree with @FlavianHautbois that the dot source for multidimensional arrays is completely different and not optimized from the 1D or 2D cases.

Question 6

@daveydave400, yeah, I was using actual values and much larger array sizes in my program. For loops are also prominent. My opinion is that numpy.dot should at least have the same performance as for loops. This is not what I expected and will confuse me when I choose my algorithms.

Question 7

The first thing I suggest you do is break your script down into separate functions to make profiling and debugging easier:

def setup(n1=10000, n2=1000, n3=20, seed=None):
 gen = np.random.RandomState(seed)
 kpt_list = gen.randn(n1, n3).astype(np.float)
 rpt_list = gen.randn(n2, n3).astype(np.float)
 h_r = (gen.randn(n3, n3,n2) + 1j*gen.randn(n3, n3,n2)).astype(np.complex)
 r_ndegen = gen.randn(1000).astype(np.float)
 return kpt_list, rpt_list, h_r, r_ndegen
def original_vec(*args, **kwargs):
 kpt_list, rpt_list, h_r, r_ndegen = setup(*args, **kwargs)
 r_ndegen_tile = np.tile(r_ndegen.reshape(1000, 1), 10000)
 phase = np.exp(1j * np.dot(rpt_list, kpt_list.T)) / r_ndegen_tile
 kpt_data = h_r.dot(phase)
 return kpt_data
def original_loop(*args, **kwargs):
 kpt_list, rpt_list, h_r, r_ndegen = setup(*args, **kwargs)
 kpt_data = np.zeros((20, 20, 10000), dtype='complex')
 for i in range(10000):
 kpt = kpt_list[i, :]
 phase = np.exp(1j * np.dot(kpt, rpt_list.T)) / r_ndegen
 kpt_data[:, :, i] = h_r.dot(phase)
 return kpt_data

I would also highly recommend using random data rather than all-zero or all-one arrays, unless that's what your actual data looks like (!). This makes it much easier to check the correctness of your code - for example, if your last step is to multiply by a matrix of zeros then your output will always be all-zeros, regardless of whether or not there is a mistake earlier on in your code.

Next, I would run these functions through line_profiler to see where they are spending most of their time. In particular, for original_vec:

In [1]: %lprun -f original_vec original_vec()
Timer unit: 1e-06 s
Total time: 23.7598 s
File: <ipython-input-24-c57463f84aad>
Function: original_vec at line 12
Line # Hits Time Per Hit % Time Line Contents
==============================================================
 12 def original_vec(*args, **kwargs):
 13 
 14 1 86498 86498.0 0.4 kpt_list, rpt_list, h_r, r_ndegen = setup(*args, **kwargs)
 15 
 16 1 69700 69700.0 0.3 r_ndegen_tile = np.tile(r_ndegen.reshape(1000, 1), 10000)
 17 1 1331947 1331947.0 5.6 phase = np.exp(1j * np.dot(rpt_list, kpt_list.T)) / r_ndegen_tile
 18 1 22271637 22271637.0 93.7 kpt_data = h_r.dot(phase)
 19 
 20 1 4 4.0 0.0 return kpt_data

You can see that it spends 93% of its time computing the dot product between h_r and phase. Here, h_r is a (20, 20, 1000) array and phase is (1000, 10000). We're computing a sum product over the last dimension of h_r and the first dimension of phase (you could write this in einsum notation as ijk,kl->ijl).

The first two dimensions of h_r don't really matter here - we could just as easily reshape h_r into a (20*20, 1000) array before taking the dot product. It turns out that this reshaping operation by itself gives a huge performance improvement:

In [2]: %timeit h_r.dot(phase)
1 loop, best of 3: 22.6 s per loop
In [3]: %timeit h_r.reshape(-1, 1000).dot(phase)
1 loop, best of 3: 1.04 s per loop

I'm not entirely sure why this should be the case - I would have hoped that numpy's dot function would be smart enough to apply this simple optimization automatically. On my laptop the second case seems to use multiple threads whereas the first one doesn't, suggesting that it might not be calling multithreaded BLAS routines.

Here's a vectorized version that incorporates the reshaping operation:

def new_vec(*args, **kwargs):
 kpt_list, rpt_list, h_r, r_ndegen = setup(*args, **kwargs)
 phase = np.exp(1j * np.dot(rpt_list, kpt_list.T)) / r_ndegen[:, None]
 kpt_data = h_r.reshape(-1, phase.shape[0]).dot(phase)
 return kpt_data.reshape(h_r.shape[:2] + (-1,))

The -1 indices tell numpy to infer the size of those dimensions according to the other dimensions and the number of elements in the array. I've also used broadcasting to divide by r_ndegen, which eliminates the need for np.tile.

By using the same random input data, we can check that the new version gives the same result as the original:

In [4]: ans1 = original_loop(seed=0)
In [5]: ans2 = new_vec(seed=0) 
In [6]: np.allclose(ans1, ans2)
Out[6]: True

Some performance benchmarks:

In [7]: %timeit original_loop()
1 loop, best of 3: 13.5 s per loop
In [8]: %timeit original_vec()
1 loop, best of 3: 24.1 s per loop
In [5]: %timeit new_vec()
1 loop, best of 3: 2.49 s per loop

Update:

I was curious about why np.dot was so much slower for the original (20, 20, 1000) h_r array, so I dug into the numpy source code. The logic implemented in multiarraymodule.c turns out to be shockingly simple:

#if defined(HAVE_CBLAS)
 if (PyArray_NDIM(ap1) <= 2 && PyArray_NDIM(ap2) <= 2 &&
 (NPY_DOUBLE == typenum || NPY_CDOUBLE == typenum ||
 NPY_FLOAT == typenum || NPY_CFLOAT == typenum)) {
 return cblas_matrixproduct(typenum, ap1, ap2, out);
 }
#endif

In other words numpy just checks whether either of the input arrays has> 2 dimensions, and immediately falls back on a non-BLAS implementation of matrix-matrix multiplication. It seems like it shouldn't be too difficult to check whether the inner dimensions of the two arrays are compatible, and if so treat them as 2D and perform *gemm matrix-matrix multiplication on them. In fact there's an open feature request for this dating back to 2012, if any numpy devs are reading...

In the meantime, it's a nice performance trick to be aware of when multiplying tensors.

Update 2:

I forgot about np.tensordot. Since it calls the same underlying BLAS routines as np.dot on a 2D array, it can achieve the same performance bump, but without all those ugly reshape operations:

In [6]: %timeit np.tensordot(h_r, phase, axes=1)
1 loop, best of 3: 1.05 s per loop

Question 8

Interesting find that one! I was writing a lengthy one with a hybrid (vect + loopy) approach for some minor improvement, but this is much better! Just to give it a bit of further boost, numexpr could be incorporated to get phase : A = 1j * np.dot(rpt_list, kpt_list.T); phase = ne.evaluate("exp(A)")/ r_ndegen[:, None].

Question 9

@ali_m, My opinion is that np.dot should have at least the same performance as for loops. Otherwise I have to check every implementation to get a good performance.

Question 10

I agree that np.dot could do with improvement, but for me the take-home message is not to make unfounded assumptions about the efficiency of library functions. There will never be any guarantee that np.dot is faster than a for loop under all circumstances. If performance matters then you should get in the habit of profiling your code and testing different implementations.

Question 11

Another point I forgot to mention - you could also use np.tensordot(h_r, phase, axes=1), which is just as fast as reshaping to 2D and using np.dot (it calls the same underlying BLAS routine), but avoids all the ugly reshaping operations.

Question 12

I suspect the first operation is hitting the the resource limit. May be you can benefit from these two questions: Efficient dot products of large memory-mapped arrays, and Dot product of huge arrays in numpy.

ali_m 74.7k28 gold badges231 silver badges315 bronze badges · Accepted Answer · 2016-02-23 18:43:10Z

The first thing I suggest you do is break your script down into separate functions to make profiling and debugging easier:

def setup(n1=10000, n2=1000, n3=20, seed=None):
 gen = np.random.RandomState(seed)
 kpt_list = gen.randn(n1, n3).astype(np.float)
 rpt_list = gen.randn(n2, n3).astype(np.float)
 h_r = (gen.randn(n3, n3,n2) + 1j*gen.randn(n3, n3,n2)).astype(np.complex)
 r_ndegen = gen.randn(1000).astype(np.float)
 return kpt_list, rpt_list, h_r, r_ndegen
def original_vec(*args, **kwargs):
 kpt_list, rpt_list, h_r, r_ndegen = setup(*args, **kwargs)
 r_ndegen_tile = np.tile(r_ndegen.reshape(1000, 1), 10000)
 phase = np.exp(1j * np.dot(rpt_list, kpt_list.T)) / r_ndegen_tile
 kpt_data = h_r.dot(phase)
 return kpt_data
def original_loop(*args, **kwargs):
 kpt_list, rpt_list, h_r, r_ndegen = setup(*args, **kwargs)
 kpt_data = np.zeros((20, 20, 10000), dtype='complex')
 for i in range(10000):
 kpt = kpt_list[i, :]
 phase = np.exp(1j * np.dot(kpt, rpt_list.T)) / r_ndegen
 kpt_data[:, :, i] = h_r.dot(phase)
 return kpt_data

I would also highly recommend using random data rather than all-zero or all-one arrays, unless that's what your actual data looks like (!). This makes it much easier to check the correctness of your code - for example, if your last step is to multiply by a matrix of zeros then your output will always be all-zeros, regardless of whether or not there is a mistake earlier on in your code.

Next, I would run these functions through line_profiler to see where they are spending most of their time. In particular, for original_vec:

In [1]: %lprun -f original_vec original_vec()
Timer unit: 1e-06 s
Total time: 23.7598 s
File: <ipython-input-24-c57463f84aad>
Function: original_vec at line 12
Line # Hits Time Per Hit % Time Line Contents
==============================================================
 12 def original_vec(*args, **kwargs):
 13 
 14 1 86498 86498.0 0.4 kpt_list, rpt_list, h_r, r_ndegen = setup(*args, **kwargs)
 15 
 16 1 69700 69700.0 0.3 r_ndegen_tile = np.tile(r_ndegen.reshape(1000, 1), 10000)
 17 1 1331947 1331947.0 5.6 phase = np.exp(1j * np.dot(rpt_list, kpt_list.T)) / r_ndegen_tile
 18 1 22271637 22271637.0 93.7 kpt_data = h_r.dot(phase)
 19 
 20 1 4 4.0 0.0 return kpt_data

You can see that it spends 93% of its time computing the dot product between h_r and phase. Here, h_r is a (20, 20, 1000) array and phase is (1000, 10000). We're computing a sum product over the last dimension of h_r and the first dimension of phase (you could write this in einsum notation as ijk,kl->ijl).

The first two dimensions of h_r don't really matter here - we could just as easily reshape h_r into a (20*20, 1000) array before taking the dot product. It turns out that this reshaping operation by itself gives a huge performance improvement:

In [2]: %timeit h_r.dot(phase)
1 loop, best of 3: 22.6 s per loop
In [3]: %timeit h_r.reshape(-1, 1000).dot(phase)
1 loop, best of 3: 1.04 s per loop

I'm not entirely sure why this should be the case - I would have hoped that numpy's dot function would be smart enough to apply this simple optimization automatically. On my laptop the second case seems to use multiple threads whereas the first one doesn't, suggesting that it might not be calling multithreaded BLAS routines.

Here's a vectorized version that incorporates the reshaping operation:

def new_vec(*args, **kwargs):
 kpt_list, rpt_list, h_r, r_ndegen = setup(*args, **kwargs)
 phase = np.exp(1j * np.dot(rpt_list, kpt_list.T)) / r_ndegen[:, None]
 kpt_data = h_r.reshape(-1, phase.shape[0]).dot(phase)
 return kpt_data.reshape(h_r.shape[:2] + (-1,))

The -1 indices tell numpy to infer the size of those dimensions according to the other dimensions and the number of elements in the array. I've also used broadcasting to divide by r_ndegen, which eliminates the need for np.tile.

By using the same random input data, we can check that the new version gives the same result as the original:

In [4]: ans1 = original_loop(seed=0)
In [5]: ans2 = new_vec(seed=0) 
In [6]: np.allclose(ans1, ans2)
Out[6]: True

Some performance benchmarks:

In [7]: %timeit original_loop()
1 loop, best of 3: 13.5 s per loop
In [8]: %timeit original_vec()
1 loop, best of 3: 24.1 s per loop
In [5]: %timeit new_vec()
1 loop, best of 3: 2.49 s per loop

Update:

I was curious about why np.dot was so much slower for the original (20, 20, 1000) h_r array, so I dug into the numpy source code. The logic implemented in multiarraymodule.c turns out to be shockingly simple:

#if defined(HAVE_CBLAS)
 if (PyArray_NDIM(ap1) <= 2 && PyArray_NDIM(ap2) <= 2 &&
 (NPY_DOUBLE == typenum || NPY_CDOUBLE == typenum ||
 NPY_FLOAT == typenum || NPY_CFLOAT == typenum)) {
 return cblas_matrixproduct(typenum, ap1, ap2, out);
 }
#endif

In other words numpy just checks whether either of the input arrays has> 2 dimensions, and immediately falls back on a non-BLAS implementation of matrix-matrix multiplication. It seems like it shouldn't be too difficult to check whether the inner dimensions of the two arrays are compatible, and if so treat them as 2D and perform *gemm matrix-matrix multiplication on them. In fact there's an open feature request for this dating back to 2012, if any numpy devs are reading...

In the meantime, it's a nice performance trick to be aware of when multiplying tensors.

Update 2:

I forgot about np.tensordot. Since it calls the same underlying BLAS routines as np.dot on a 2D array, it can achieve the same performance bump, but without all those ugly reshape operations:

In [6]: %timeit np.tensordot(h_r, phase, axes=1)
1 loop, best of 3: 1.05 s per loop

Interesting find that one! I was writing a lengthy one with a hybrid (vect + loopy) approach for some minor improvement, but this is much better! Just to give it a bit of further boost, numexpr could be incorporated to get phase : A = 1j * np.dot(rpt_list, kpt_list.T); phase = ne.evaluate("exp(A)")/ r_ndegen[:, None].
@ali_m, My opinion is that np.dot should have at least the same performance as for loops. Otherwise I have to check every implementation to get a good performance.
I agree that np.dot could do with improvement, but for me the take-home message is not to make unfounded assumptions about the efficiency of library functions. There will never be any guarantee that np.dot is faster than a for loop under all circumstances. If performance matters then you should get in the habit of profiling your code and testing different implementations.
Another point I forgot to mention - you could also use np.tensordot(h_r, phase, axes=1), which is just as fast as reshaping to 2D and using np.dot (it calls the same underlying BLAS routine), but avoids all the ugly reshaping operations.

CollectivesTM on Stack Overflow

performance loss after vectorization in numpy

2 Answers 2

Update:

Update 2:

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

CollectivesTM on Stack Overflow

2 Answers 2

Update:

Update 2:

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related