Custom model intended to be used for curve fitting: vectorization vs for loop and JIT Numba capabilities

Question 1

I am new to community and please pardon me if I didn't provide information as intended.

This code is supposed to be creating a custom model which will be used with lmfit for curve fitting purposes. hc(q,radius,pd) is the function. It is one of the simplest functions that I will be using but even this function is taking quite a bit in python compared to Matlab. q values which are created for the sake of simplicity will be replaced by the experimental values and it will be used to fit the intensity from 1D SAXS results.

The difference between the vectorization and for loop is almost double. Although the documentation suggests to use for Loop, it gives slower results.

I am trying to learn Python, coming from Matlab. I have a very simple code for starting purposes:

from numpy import vectorize
from scipy import integrate
from scipy.special import j1
from math import sqrt, exp, pi, log
import matplotlib.pyplot as plt
import numpy as np
from numpy import empty
def plot_hc(radius, pd):
 q = np.linspace(0.008, 1.0, num=500)
 y = hc(q, radius, pd)
 plt.loglog(q, y)
 plt.show()
def hc_formfactor(q, radius):
 y = (1.0 / q) * (radius * j1(q * radius))
 y = y ** 2
 return y
def g_distribution(z, radius, pd):
 return (1 / (sqrt(2 * pi) * pd)) * exp(
 -((z - radius) / (sqrt(
 2) * pd)) ** 2)
def ln_distribution(z, radius, pd):
 return (1 / (sqrt(2 * pi) * pd * z / radius)) * exp(
 -(log(z / radius) / (sqrt(2) * pd)) ** 2)
# Dist=1(for G_Distribution)
# Dist=2(for LN Distribution)
Dist = 1
@vectorize
def hc(x, radius, pd):
 global d
 if Dist == 1:
 nmpts = 4
 va = radius - nmpts * pd
 vb = radius + nmpts * pd
 if va < 0:
 va = 0
 d = integrate.quad(lambda z: g_distribution(z, radius, pd), va, vb)
 elif Dist == 2:
 nmpts = 4
 va = radius - nmpts * pd
 vb = radius + nmpts * pd
 if va < 0:
 va = 0
 d = integrate.quad(lambda z: ln_distribution(z, radius, pd), va, vb)
 else:
 d = 1
 def fun(z, x, radius, pd):
 if Dist == 1:
 return hc_formfactor(x, z) * g_distribution(z, radius, pd)
 elif Dist == 2:
 return hc_formfactor(x, z) * ln_distribution(z, radius, pd)
 else:
 return hc_formfactor(x, z)
 y = integrate.quad(lambda z: fun(z, x, radius, pd), va, vb)[0]
 return y/d[0]
if __name__ == '__main__':
 plot_hc(radius=40, pd=0.5)

As suggested in the documentation, I should use for loop, but that reduced the speed even more. The code using for loop is as follows:

from numpy import vectorize
from scipy import integrate
from scipy.special import j1
from math import sqrt, exp, pi, log
import matplotlib.pyplot as plt
import numpy as np
from numpy import empty
def plot_hc(radius, pd):
 q = np.linspace(0.008, 1.0, num=500)
 y = hc(q, radius, pd)
 plt.loglog(q, y)
 plt.show()
def hc_formfactor(q, radius):
 y = (1.0 / q) * (radius * j1(q * radius))
 y = y ** 2
 return y
def g_distribution(z, radius, pd):
 return (1 / (sqrt(2 * pi) * pd)) * exp(
 -((z - radius) / (sqrt(
 2) * pd)) ** 2)
def ln_distribution(z, radius, pd):
 return (1 / (sqrt(2 * pi) * pd * z / radius)) * exp(
 -(log(z / radius) / (sqrt(2) * pd)) ** 2)
# Dist=1(for G_Distribution)
# Dist=2(for LN Distribution)
Dist = 1
def hc(q, radius, pd):
 if Dist == 1:
 nmpts = 4
 va = radius - nmpts * pd
 vb = radius + nmpts * pd
 if va < 0:
 va = 0
 d = integrate.quad(lambda z: g_distribution(z, radius, pd), va,vb)
 elif Dist == 2:
 nmpts = 4
 va = radius - nmpts * pd
 vb = radius + nmpts * pd
 if va < 0:
 va = 0
 d = integrate.quad(lambda z: ln_distribution(z, radius, pd), va, vb)
 else:
 d = 1
 def fun(z, q, radius, pd):
 if Dist == 1:
 return hc_formfactor(q, z) * g_distribution(z, radius, pd)
 elif Dist == 2:
 return hc_formfactor(q, z) * ln_distribution(z, radius, pd)
 else:
 return hc_formfactor(q, z)
 y = empty([len(q)])
 for n in range(len(q)):
 y[n] = integrate.quad(lambda z: fun(z, q[n], radius, pd), va, vb)[0]
 return y / d[0]
if __name__ == '__main__':
 plot_hc(radius=40, pd=0.5)

I don't understand, what should I do to optimise the code? If I run the same program for the same values in Matlab it is very fast. I don't know what mistake I did here. Also, some suggested to use JIT from Numba to speed up the integration, but I am not sure how to implement it.

Question 2

Welcome to Code Review! We need to know what the code is intended to achieve. To help reviewers give you better answers, please add sufficient context to your question, including a title that summarises the purpose of the code. We want to know why much more than how. The more you tell us about what your code is for, the easier it will be for reviewers to help you. The title needs an edit to simply state the task, rather than your concerns about the code.

Question 3

@TobySpeight Thanks a lot for your reply and for correcting me. Being a beginner, its always good to learn from the experts in the community. Thanks for the suggestions. I have edited the title as well as provided information in the text. Please let me know if I need to provide more information.

Question 4

Your first approach doesn't run, because there's a scalar/vector mismatch.

Your second approach should change based on the following:

In a Numpy context, you should essentially never use the scalar math module; use np instead.

loglog is not an appropriate choice for these data; instead use semilogy.

Several of your arithmetic expressions can be simplified to remove redundant parens and roots; for example both of your sqrt(2) should be squared and written as 0.5 instead.

Don't make Dist a global, and don't leave it as a bare integer: use enum.

The two Dist cases only need to vary on one thing, which is a distribution callable.

Don't use lambdas. Most Scipy functions that accept function references also accept an arguments tuple.

integrate.quad_vec is a faster drop-in replacement for integrate.quad on vector-valued functions.

All together,

import enum
import typing
import scipy
import matplotlib.pyplot as plt
import numpy as np
class Distribution(enum.IntEnum):
 G = 1
 LN = 2
def hc_formfactor(q: np.ndarray, radius: float) -> np.ndarray:
 y = radius/q * scipy.special.j1(q * radius)
 return y**2
def g_distribution(z: float, radius: float, pd: float) -> float:
 return np.exp(
 -0.5*(
 (z - radius)/pd
 )**2
 ) / np.sqrt(2*np.pi) / pd
def ln_distribution(z: float, radius: float, pd: float) -> float:
 return np.exp(
 -0.5*(
 np.log(z / radius)/pd
 )**2
 ) / np.sqrt(2*np.pi) / pd / z * radius
def integrand(
 z: float, q: np.ndarray, radius: float, pd: float,
 dist_call: typing.Callable[[float, float, float], float],
) -> np.ndarray:
 return hc_formfactor(q=q, radius=z) * dist_call(z, radius, pd)
def hc(q: np.ndarray, radius: float, pd: float, dist: Distribution):
 nmpts = 4
 va = np.clip(radius - nmpts*pd, min=0)
 vb = radius + nmpts*pd
 dist_call = g_distribution if dist == Distribution.G else ln_distribution
 d, err = scipy.integrate.quad(func=dist_call, a=va, b=vb, args=(radius, pd))
 y, err = scipy.integrate.quad_vec(
 f=integrand, a=va, b=vb, args=(q, radius, pd, dist_call),
 )
 return y/d
def regression_test(radius: float, pd: float) -> None:
 q = np.linspace(start=0.008, stop=1, num=500)
 y = hc(q, radius, pd, dist=Distribution.G)
 assert y.shape == (500,)
 assert np.isclose(y[0], 624351.5074680314, rtol=1e-14, atol=0)
 assert np.isclose(y[-1], 20.43871887263733, rtol=1e-14 ,atol=0)
 assert np.isclose(y.mean(), 22846.79848431491, rtol=1e-14, atol=0)
 y = hc(q, radius, pd, dist=Distribution.LN)
 assert y.shape == (500,)
 assert np.isclose(y[0], 624739.6241089568, rtol=1e-14, atol=0)
 assert np.isclose(y[-1], 10.3404639914693, rtol=1e-14 ,atol=0)
 assert np.isclose(y.mean(), 22831.015200474, rtol=1e-14, atol=0)
def plot_hc(radius: float, pd: float) -> None:
 q = np.linspace(start=0.008, stop=1, num=500)
 y = hc(q, radius, pd, dist=Distribution.G)
 plt.semilogy(q, y)
 plt.show()
if __name__ == '__main__':
 regression_test(radius=40, pd=0.5)
 plot_hc(radius=40, pd=0.5)

sim

Pre-calculations

You can do some limited pre-calculation of terms in the two distribution functions, and curry them so that the integrand doesn't need to be passed those terms:

import enum
import functools
import typing
import scipy
import matplotlib.pyplot as plt
import numpy as np
class Distribution(enum.IntEnum):
 G = 1
 LN = 2
def hc_formfactor(q: np.ndarray, radius: float) -> np.ndarray:
 y = radius/q * scipy.special.j1(q * radius)
 return y**2
def g_distribution(z: float, base: float, coef: float, radius: float) -> float:
 return coef*base**((z - radius)**2)
def ln_distribution(z: float, base: float, coef: float, radius: float) -> float:
 return coef/z * base**(np.log(z/radius)**2)
def integrand(
 z: float, q: np.ndarray,
 dist_call: typing.Callable[[float], float],
) -> np.ndarray:
 return hc_formfactor(q=q, radius=z) * dist_call(z)
def hc(q: np.ndarray, radius: float, pd: float, dist: Distribution):
 nmpts = 4
 va = np.clip(radius - nmpts*pd, min=0)
 vb = radius + nmpts*pd
 base = np.exp(-0.5/pd**2)
 coef = 1/np.sqrt(2*np.pi)/pd
 match dist:
 case Distribution.G:
 dist_call = functools.partial(
 g_distribution, radius=radius, base=base, coef=coef,
 )
 case Distribution.LN:
 dist_call = functools.partial(
 ln_distribution, radius=radius, base=base, coef=coef*radius,
 )
 case _:
 raise ValueError('Invalid distribution')
 d, err = scipy.integrate.quad(func=dist_call, a=va, b=vb)
 y, err = scipy.integrate.quad_vec(f=integrand, a=va, b=vb, args=(q, dist_call))
 return y/d
def regression_test(radius: float, pd: float) -> None:
 q = np.linspace(start=0.008, stop=1, num=500)
 y = hc(q, radius, pd, dist=Distribution.G)
 assert y.shape == (500,)
 assert np.isclose(y[0], 624351.5074680314, rtol=1e-14, atol=0)
 assert np.isclose(y[-1], 20.43871887263733, rtol=1e-14 ,atol=0)
 assert np.isclose(y.mean(), 22846.79848431491, rtol=1e-14, atol=0)
 y = hc(q, radius, pd, dist=Distribution.LN)
 assert y.shape == (500,)
 assert np.isclose(y[0], 624739.6241089568, rtol=1e-14, atol=0)
 assert np.isclose(y[-1], 10.3404639914693, rtol=1e-14 ,atol=0)
 assert np.isclose(y.mean(), 22831.015200474, rtol=1e-14, atol=0)
def plot_hc(radius: float, pd: float) -> None:
 q = np.linspace(start=0.008, stop=1, num=500)
 y = hc(q, radius, pd, dist=Distribution.G)
 plt.semilogy(q, y)
 plt.show()
if __name__ == '__main__':
 regression_test(radius=40, pd=0.5)
 plot_hc(radius=40, pd=0.5)

Reinderien Reinderien 70.9k5 gold badges76 silver badges256 bronze badges · Answer 1 · 2024-11-01 00:58:29Z

Your first approach doesn't run, because there's a scalar/vector mismatch.

Your second approach should change based on the following:

In a Numpy context, you should essentially never use the scalar math module; use np instead.

loglog is not an appropriate choice for these data; instead use semilogy.

Several of your arithmetic expressions can be simplified to remove redundant parens and roots; for example both of your sqrt(2) should be squared and written as 0.5 instead.

Don't make Dist a global, and don't leave it as a bare integer: use enum.

The two Dist cases only need to vary on one thing, which is a distribution callable.

Don't use lambdas. Most Scipy functions that accept function references also accept an arguments tuple.

integrate.quad_vec is a faster drop-in replacement for integrate.quad on vector-valued functions.

All together,

import enum
import typing
import scipy
import matplotlib.pyplot as plt
import numpy as np
class Distribution(enum.IntEnum):
 G = 1
 LN = 2
def hc_formfactor(q: np.ndarray, radius: float) -> np.ndarray:
 y = radius/q * scipy.special.j1(q * radius)
 return y**2
def g_distribution(z: float, radius: float, pd: float) -> float:
 return np.exp(
 -0.5*(
 (z - radius)/pd
 )**2
 ) / np.sqrt(2*np.pi) / pd
def ln_distribution(z: float, radius: float, pd: float) -> float:
 return np.exp(
 -0.5*(
 np.log(z / radius)/pd
 )**2
 ) / np.sqrt(2*np.pi) / pd / z * radius
def integrand(
 z: float, q: np.ndarray, radius: float, pd: float,
 dist_call: typing.Callable[[float, float, float], float],
) -> np.ndarray:
 return hc_formfactor(q=q, radius=z) * dist_call(z, radius, pd)
def hc(q: np.ndarray, radius: float, pd: float, dist: Distribution):
 nmpts = 4
 va = np.clip(radius - nmpts*pd, min=0)
 vb = radius + nmpts*pd
 dist_call = g_distribution if dist == Distribution.G else ln_distribution
 d, err = scipy.integrate.quad(func=dist_call, a=va, b=vb, args=(radius, pd))
 y, err = scipy.integrate.quad_vec(
 f=integrand, a=va, b=vb, args=(q, radius, pd, dist_call),
 )
 return y/d
def regression_test(radius: float, pd: float) -> None:
 q = np.linspace(start=0.008, stop=1, num=500)
 y = hc(q, radius, pd, dist=Distribution.G)
 assert y.shape == (500,)
 assert np.isclose(y[0], 624351.5074680314, rtol=1e-14, atol=0)
 assert np.isclose(y[-1], 20.43871887263733, rtol=1e-14 ,atol=0)
 assert np.isclose(y.mean(), 22846.79848431491, rtol=1e-14, atol=0)
 y = hc(q, radius, pd, dist=Distribution.LN)
 assert y.shape == (500,)
 assert np.isclose(y[0], 624739.6241089568, rtol=1e-14, atol=0)
 assert np.isclose(y[-1], 10.3404639914693, rtol=1e-14 ,atol=0)
 assert np.isclose(y.mean(), 22831.015200474, rtol=1e-14, atol=0)
def plot_hc(radius: float, pd: float) -> None:
 q = np.linspace(start=0.008, stop=1, num=500)
 y = hc(q, radius, pd, dist=Distribution.G)
 plt.semilogy(q, y)
 plt.show()
if __name__ == '__main__':
 regression_test(radius=40, pd=0.5)
 plot_hc(radius=40, pd=0.5)

sim

Pre-calculations

You can do some limited pre-calculation of terms in the two distribution functions, and curry them so that the integrand doesn't need to be passed those terms:

import enum
import functools
import typing
import scipy
import matplotlib.pyplot as plt
import numpy as np
class Distribution(enum.IntEnum):
 G = 1
 LN = 2
def hc_formfactor(q: np.ndarray, radius: float) -> np.ndarray:
 y = radius/q * scipy.special.j1(q * radius)
 return y**2
def g_distribution(z: float, base: float, coef: float, radius: float) -> float:
 return coef*base**((z - radius)**2)
def ln_distribution(z: float, base: float, coef: float, radius: float) -> float:
 return coef/z * base**(np.log(z/radius)**2)
def integrand(
 z: float, q: np.ndarray,
 dist_call: typing.Callable[[float], float],
) -> np.ndarray:
 return hc_formfactor(q=q, radius=z) * dist_call(z)
def hc(q: np.ndarray, radius: float, pd: float, dist: Distribution):
 nmpts = 4
 va = np.clip(radius - nmpts*pd, min=0)
 vb = radius + nmpts*pd
 base = np.exp(-0.5/pd**2)
 coef = 1/np.sqrt(2*np.pi)/pd
 match dist:
 case Distribution.G:
 dist_call = functools.partial(
 g_distribution, radius=radius, base=base, coef=coef,
 )
 case Distribution.LN:
 dist_call = functools.partial(
 ln_distribution, radius=radius, base=base, coef=coef*radius,
 )
 case _:
 raise ValueError('Invalid distribution')
 d, err = scipy.integrate.quad(func=dist_call, a=va, b=vb)
 y, err = scipy.integrate.quad_vec(f=integrand, a=va, b=vb, args=(q, dist_call))
 return y/d
def regression_test(radius: float, pd: float) -> None:
 q = np.linspace(start=0.008, stop=1, num=500)
 y = hc(q, radius, pd, dist=Distribution.G)
 assert y.shape == (500,)
 assert np.isclose(y[0], 624351.5074680314, rtol=1e-14, atol=0)
 assert np.isclose(y[-1], 20.43871887263733, rtol=1e-14 ,atol=0)
 assert np.isclose(y.mean(), 22846.79848431491, rtol=1e-14, atol=0)
 y = hc(q, radius, pd, dist=Distribution.LN)
 assert y.shape == (500,)
 assert np.isclose(y[0], 624739.6241089568, rtol=1e-14, atol=0)
 assert np.isclose(y[-1], 10.3404639914693, rtol=1e-14 ,atol=0)
 assert np.isclose(y.mean(), 22831.015200474, rtol=1e-14, atol=0)
def plot_hc(radius: float, pd: float) -> None:
 q = np.linspace(start=0.008, stop=1, num=500)
 y = hc(q, radius, pd, dist=Distribution.G)
 plt.semilogy(q, y)
 plt.show()
if __name__ == '__main__':
 regression_test(radius=40, pd=0.5)
 plot_hc(radius=40, pd=0.5)

Stack Exchange Network

Custom model intended to be used for curve fitting: vectorization vs for loop and JIT Numba capabilities

1 Answer 1

Pre-calculations

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Custom model intended to be used for curve fitting: vectorization vs for loop and JIT Numba capabilities

1 Answer 1

Pre-calculations

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions