6
\$\begingroup\$

I'm a bit new to working with Numba, but I got the gist of it. I wonder if there any more advanced tricks to make four nested for loops even faster that what I have now. In particular, I need to calculate the following integral:

$$ G_B(\mathbf X, T) = \Lambda \int_\Omega G(\mathbf X, \mathbf X', T) W(\mathbf X', T) \ d\mathbf X' \\ G(\mathbf X, \mathbf X', T) = \frac{1}{2\pi S_0^2} \exp\left[-\frac{\left| \mathbf X -\mathbf X'\right|^2}{2[S_0(1+EB(\mathbf X, T))]^2}\right] $$

Where \$B\$ is a 2D array, and \$S_0\$ and \$E\$ are certain parameters. My code is the following:

import numpy as np
from numba import njit, double
def calc_gb_gauss_2d(b,s0,e,dx):
 n,m=b.shape
 norm = 1.0/(2*np.pi*s0**2)
 gb = np.zeros((n,m))
 for i in range(n):
 for j in range(m):
 sigma = 2.0*(s0*(1.0+e*b[i,j]))**2
 for ii in range(n):
 for jj in range(m):
 gb[i,j]+=np.exp(-(((i-ii)*dx)**2+((j-jj)*dx)**2)/sigma)
 gb[i,j]*=norm
 return gb
calc_gb_gauss_2d_nb = njit(double[:, :](double[:, :],double,double,double))(calc_gb_gauss_2d)

For and input array of size ×ばつ256 the calculation speed is:

In [4]: a=random.random((256,256))
In [5]: %timeit calc_gb_gauss_2d_nb(a,0.1,1.0,0.5)
The slowest run took 8.46 times longer than the fastest. This could mean that an intermediate result is being cached.
1 loop, best of 3: 1min 1s per loop

Comparison between pure Python and Numba calculation speed give me this picture: Comparative performance plot

Is there any way to optimize my code for better performance?

200_success
146k22 gold badges190 silver badges479 bronze badges
asked May 9, 2018 at 15:34
\$\endgroup\$
0

1 Answer 1

2
\$\begingroup\$

When using njit as a decorator you have access to multiple options that you might want to try, such as:

  • fastmath
  • parallel
  • nogil

I will only focus on fastmath and parallel. There must be someone more competent to deal with the nogil argument.

Fastmath

This options let you relax some operations (diminishing the precision) for shorter computation time. It can be set to True/False but you can be more picky with the options you want to make faster.

Parallel and here

It doesn't have the freedom of conventional parallel projects such as OMP/MPI, but it supports the prange directive. You have to explicitly choose the loop you want to parallelize. Also you might need to slightly modify the code - for example, creating a variable to compute reductions or "private" to each thread/process.

In addition to that, you should not only test times, but also test your integration scheme on test cases when you have the exact solution or at best an approximation.

Toby Speight
87.9k14 gold badges104 silver badges325 bronze badges
answered Feb 21, 2024 at 21:52
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.