8

I know numba creates some overheads and in some situations (non-intensive computation) it become slower that pure python. But what I don't know is where to draw the line. Is it possible to use order of algorithm complexity to figure out where?

for example for adding two arrays (~O(n)) shorter that 5 in this code pure python is faster:

def sum_1(a,b):
 result = 0.0
 for i,j in zip(a,b):
 result += (i+j)
 return result
@numba.jit('float64[:](float64[:],float64[:])')
def sum_2(a,b):
 result = 0.0
 for i,j in zip(a,b):
 result += (i+j)
 return result
# try 100
a = np.linspace(1.0,2.0,5)
b = np.linspace(1.0,2.0,5)
print("pure python: ")
%timeit -o sum_1(a,b)
print("\n\n\n\npython + numba: ")
%timeit -o sum_2(a,b)

UPDADE: what I am looking for is a similar guideline like here:

"A general guideline is to choose different targets for different data sizes and algorithms. The "cpu" target works well for small data sizes (approx. less than 1KB) and low compute intensity algorithms. It has the least amount of overhead. The "parallel" target works well for medium data sizes (approx. less than 1MB). Threading adds a small delay. The "cuda" target works well for big data sizes (approx. greater than 1MB) and high compute intensity algorithms. Transfering memory to and from the GPU adds significant overhead."

MSeifert
154k41 gold badges356 silver badges378 bronze badges
asked Mar 29, 2019 at 12:18
4
  • 2
    When measuring numba's performances you need to call the function once before measuring. Because the during the first call the function is compiled, which takes a lot of time. Also you may want to try @numba.autojit. Commented Mar 29, 2019 at 13:02
  • On my machine python: 3.31 µs, numba: 589 ns, not a huge improvement though. As for you question I really think this is not really related to the complexity and it will probably depend on the kind of operations you are doing. On the other hand you can still plot a python/numba comparison to see where the shift happens for a given function. I've done few tests on a toy example too, adding one to all elements in an array, and numba is always faster than python, and is similar to cython. Commented Mar 29, 2019 at 13:08
  • @cglacet thanks for replying. yes, I run it before benchmark, (it is just in time compiling after all :) ). about your 2nd comment @numba.autojit does not reproduce what you got. for array len 5 their runnig time was almost the same.Maybe I did not put it in the right place. I tried this @numba.jit('float64[:](float64[:],float64[:])') Commented Mar 29, 2019 at 13:19
  • Do you really want to declare non contiguous arrays explicitly, while providing contiguous arrays for testing? Also declaring an array for output is weird, since result is a scalar. On this type of problems you normally would let Numba determine the array declaration and just add a fastmath flag @numba.njit(fastmath=True) (makes SIMD-Vectorization possible) and maybe a cache=True for caching the compiled function. This corrected, the function should run faster on any circumstances than the numpy_methods method by @MSeifert. Commented Apr 1, 2019 at 8:08

3 Answers 3

19

It's hard to draw the line when numba becomes effective. However there are a few indicators when it might not be effective:

  • If you cannot use jit with nopython=True - whenever you cannot compile it in nopython mode you either try to compile too much or it won't be significantly faster.

  • If you don't use arrays - When you deal with lists or other types that you pass to the numba function (except from other numba functions), numba needs to copy these which incurs a significant overhead.

  • If there is already a NumPy or SciPy function that does it - even if numba can be significantly faster for short arrays it will almost always be as fast for longer arrays (also you might easily neglect some common edge cases that these would handle).

There's also another reason why you might not want to use numba in cases where it's just "a bit" faster than other solutions: Numba functions have to be compiled, either ahead-of-time or when first called, in some situations the compilation will be much slower than your gain, even if you call it hundreds of times. Also the compilation times add up: numba is slow to import and compiling the numba functions also adds some overhead. It doesn't make sense to shave off a few milliseconds if the import overhead increased by 1-10 seconds.

Also numba is complicated to install (without conda at least) so if you want to share your code then you have a really "heavy dependency".


Your example is lacking a comparison with NumPy methods and a highly optimized version of pure Python. I added some more comparison functions and did a benchmark (using my library simple_benchmark):

import numpy as np
import numba as nb
from itertools import chain
def python_loop(a,b):
 result = 0.0
 for i,j in zip(a,b):
 result += (i+j)
 return result
@nb.njit
def numba_loop(a,b):
 result = 0.0
 for i,j in zip(a,b):
 result += (i+j)
 return result
def numpy_methods(a, b):
 return a.sum() + b.sum()
def python_sum(a, b):
 return sum(chain(a.tolist(), b.tolist()))
from simple_benchmark import benchmark, MultiArgument
arguments = {
 2**i: MultiArgument([np.zeros(2**i), np.zeros(2**i)])
 for i in range(2, 17)
}
b = benchmark([python_loop, numba_loop, numpy_methods, python_sum], arguments, warmups=[numba_loop])
%matplotlib notebook
b.plot()

enter image description here

Yes, the numba function is fastest for small arrays, however the NumPy solution will be slightly faster for longer arrays. The Python solutions are slower but the "faster" alternative is already significantly faster than your original proposed solution.

In this case I would simply use the NumPy solution because it's short, readable and fast, except when you're dealing with lots of short arrays and call the function a lot of times - then the numba solution would be significantly better.

answered Mar 31, 2019 at 15:15
Sign up to request clarification or add additional context in comments.

2 Comments

Great answer! Why is Numba faster than NumPy for shorter arrays?
@Dahn NumPy functions tend to have more overhead which is noticable when using them on short arrays. As you can see in the graph the NumPy function takes roughly the same time for 1 element arrays as for 100 element arrays. This overhead depends on the exact NumPy function but it's typically checking/converting input parameters, covering edge cases, and handling array subclasses (or classes that can be used like arrays). But to be plain: It rarely matters because it's still really fast even for short arrays.
7

If you do not exactly know what is the consequence of explicit input and output declarations let numba decide it. With your input you may want to use 'float64(float64[::1],float64[::1])'. (scalar output, contiguous input arrays). If you call the explicitly declared function with strided inputs it will fail, if you would Numba do the job it would simply recompile. Without using fastmath=True it is also not possible to use SIMD, because it changes the precision of the result.

Calculating at least 4 partial sums (256 Bit vector) and than calculating the sum of these partial sums is preferable here (Numpy also don't calculate a naive sum).

Example using MSeiferts excellent benchmark utility

import numpy as np
import numba as nb
from itertools import chain
def python_loop(a,b):
 result = 0.0
 for i,j in zip(a,b):
 result += (i+j)
 return result
@nb.njit
def numba_loop_zip(a,b):
 result = 0.0
 for i,j in zip(a,b):
 result += (i+j)
 return result
#Your version with suboptimal input and output (prevent njit compilation) declaration
@nb.jit('float64[:](float64[:],float64[:])')
def numba_your_func(a,b):
 result = 0.0
 for i,j in zip(a,b):
 result += (i+j)
 return result
@nb.njit(fastmath=True)
def numba_loop_zip_fastmath(a,b):
 result = 0.0
 for i,j in zip(a,b):
 result += (i+j)
 return result
@nb.njit(fastmath=True)
def numba_loop_fastmath_single(a,b):
 result = 0.0
 size=min(a.shape[0],b.shape[0])
 for i in range(size):
 result += a[i]+b[i]
 return result
@nb.njit(fastmath=True,parallel=True)
def numba_loop_fastmath_multi(a,b):
 result = 0.0
 size=min(a.shape[0],b.shape[0])
 for i in nb.prange(size):
 result += a[i]+b[i]
 return result
#just for fun... single-threaded for small arrays,
#multithreaded for larger arrays
@nb.njit(fastmath=True,parallel=True)
def numba_loop_fastmath_combined(a,b):
 result = 0.0
 size=min(a.shape[0],b.shape[0])
 if size>2*10**4:
 result=numba_loop_fastmath_multi(a,b)
 else:
 result=numba_loop_fastmath_single(a,b)
 return result
def numpy_methods(a, b):
 return a.sum() + b.sum()
def python_sum(a, b):
 return sum(chain(a.tolist(), b.tolist()))
from simple_benchmark import benchmark, MultiArgument
arguments = {
 2**i: MultiArgument([np.zeros(2**i), np.zeros(2**i)])
 for i in range(2, 19)
}
b = benchmark([python_loop, numba_loop_zip, numpy_methods,numba_your_func, python_sum,numba_loop_zip_fastmath,numba_loop_fastmath_single,numba_loop_fastmath_multi,numba_loop_fastmath_combined], arguments, warmups=[numba_loop_zip,numba_loop_zip_fastmath,numba_your_func,numba_loop_fastmath_single,numba_loop_fastmath_multi,numba_loop_fastmath_combined])
%matplotlib notebook
b.plot()

Results

Please note that using the numba_loop_fastmath_multi or numba_loop_fastmath_combined(a,b) is only in some special cases recommended. More often such a simple function is one part of another problem which can be more efficiently parallelized (starting threads has some overhead)

answered Apr 1, 2019 at 11:40

1 Comment

Please help me to figure out this error. stackoverflow.com/questions/65629130/…
2

Running this code lead to a ~6 times speedup on my machine:

@numba.autojit
def sum_2(a,b):
 result = 0.0
 for i,j in zip(a,b):
 result += (i+j)
 return result

Python: 3.31 μs, numba: 589 ns.

As for you question I really think this is not really related to the complexity and it will probably depend mostly on the kind of operations you are doing. On the other hand you can still plot a python/numba comparison to see where the shift happens for a given function.

answered Mar 29, 2019 at 13:33

3 Comments

Dear @cglacet, thank for the answer, I think that what I am looking for is a guideline like the next comment but for the operation intensity or something like that
"A general guideline is to choose different targets for different data sizes and algorithms. The "cpu" target works well for small data sizes (approx. less than 1KB) and low compute intensity algorithms. It has the least amount of overhead. The "parallel" target works well for medium data sizes (approx. less than 1MB). Threading adds a small delay. The "cuda" target works well for big data sizes (approx. greater than 1MB) and high compute intensity algorithms. Transfering memory to and from the GPU adds significant overhead." numba.pydata.org/numba-doc/latest/user/vectorize.html
BTW, your answer was helpful, but I let us see what else can be learn. live long and prosper ;-)

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.