I'm a bit new to working with Numba, but I got the gist of it. I wonder if there any more advanced tricks to make four nested for
loops even faster that what I have now. In particular, I need to calculate the following integral:
$$ G_B(\mathbf X, T) = \Lambda \int_\Omega G(\mathbf X, \mathbf X', T) W(\mathbf X', T) \ d\mathbf X' \\ G(\mathbf X, \mathbf X', T) = \frac{1}{2\pi S_0^2} \exp\left[-\frac{\left| \mathbf X -\mathbf X'\right|^2}{2[S_0(1+EB(\mathbf X, T))]^2}\right] $$
Where \$B\$ is a 2D array, and \$S_0\$ and \$E\$ are certain parameters. My code is the following:
import numpy as np
from numba import njit, double
def calc_gb_gauss_2d(b,s0,e,dx):
n,m=b.shape
norm = 1.0/(2*np.pi*s0**2)
gb = np.zeros((n,m))
for i in range(n):
for j in range(m):
sigma = 2.0*(s0*(1.0+e*b[i,j]))**2
for ii in range(n):
for jj in range(m):
gb[i,j]+=np.exp(-(((i-ii)*dx)**2+((j-jj)*dx)**2)/sigma)
gb[i,j]*=norm
return gb
calc_gb_gauss_2d_nb = njit(double[:, :](double[:, :],double,double,double))(calc_gb_gauss_2d)
For and input array of size ×ばつ256 the calculation speed is:
In [4]: a=random.random((256,256))
In [5]: %timeit calc_gb_gauss_2d_nb(a,0.1,1.0,0.5)
The slowest run took 8.46 times longer than the fastest. This could mean that an intermediate result is being cached.
1 loop, best of 3: 1min 1s per loop
Comparison between pure Python and Numba calculation speed give me this picture: Comparative performance plot
Is there any way to optimize my code for better performance?
1 Answer 1
When using njit
as a decorator you have access to multiple options that you might want to try, such as:
- fastmath
- parallel
- nogil
I will only focus on fastmath
and parallel
. There must be someone more competent to deal with the nogil
argument.
Fastmath
This options let you relax some operations (diminishing the precision) for shorter computation time. It can be set to True
/False
but you can be more picky with the options you want to make faster.
Parallel and here
It doesn't have the freedom of conventional parallel projects such as OMP/MPI, but it supports the prange
directive. You have to explicitly choose the loop you want to parallelize. Also you might need to slightly modify the code - for example, creating a variable to compute reductions or "private" to each thread/process.
In addition to that, you should not only test times, but also test your integration scheme on test cases when you have the exact solution or at best an approximation.
Explore related questions
See similar questions with these tags.