Monte Carlo Methods¶

The basic idea of Monte Carlo Methods is to use random numbers and statistics to solve very complicated problems that are hard to solve in any other way. Typically, this is achieved through simulations, such as Markov chains. However, the most straightforward application is in computing high-dimensional integrals by throwing points into a volume and estimating the result. There are numerous other applications, such as finding global minimums through simulated annealing, drawing inspiration from statistical physics, and so on.

Here we will cover a very small fraction of Classical Monte Carlo algorithms. There is a huge set of Monte Carlo algorithms that we will not discuss here.

First we need to understand what are random numbers on a comuter.

Random Numbers¶

It is very hard to implement a good random number generator because a sequence of trully random numbers can not be generated by deterministic computers. Only pseudo-random number generators can be coded. There are several excellent pseudo random number generators available in various libraries, which give very satisfactory results in combination with Monte Carlo methods or multidimensional integrations. For every Monte Carlo application, it is crucial to use high quality random number generator. In practice, it is best to select a few random number generators with a good reputation, and make sure that results do not depend on the choice of a random number generator.

Numpy's default generators are now from PCG library, wikipedia which has very good reputation, and is very new. Older random number generations with good reputation are Mersenne Twister generators, which usually go by the name rng mt19937. For example gnu scientific library is implementing them as gsl_rng_mt19937 and intel mkl library is calling them VSL_BRNG_MT19937 and VSL_BRNG_SFMT19937.

How do random number generators work?

The simplest and fastest random number generators, which are unfortunately not of high quality, are linear congruential generators: $$x_{n+1} = (a,円x_n + c) \bmod m.$$ for example, $a=16843009,ドル $c=3014898611,ドル and $m=2^{32}$ in cc65 generator.

Problems with this generator:

low-dimensional correlations (points fall on hyperplanes),
short periods unless m is large,
and poor randomness in the low-order bits.

PCG has two steps:

A linear congruential state update: $x_{n+1} = a,円x_n + c \pmod{2^{64}}$
A non-linear permutation output function: "scrambles" the bits of $x_{n+1}$ before returning them as the random number.

The first step can be understood as a linear drift in phase space, while the second step, the permutation, acts as a non-linear shear or rotation, scrambling coordinates so that any observable (bit pattern) looks random.

Modern generators, like PCG are of course much better, but still one needs to understand that these are pseudo-random numbers. Before you spent million of CPU hours, you should test that results don't depend on random number generator within stastistical error.

Multidimensional integration¶

Multidimensional numeric integration in more than 4 dimensions is more appropriate for Monte-Carlo than one-dimensional quadratures. If the function is smooth enough, or we know how to transform integral to make it smooth, the integration can be performed with MC. The reason for MC success is that it's error, according to central limit theorem, is always proportional to 1ドル/\sqrt{N}$ independent of dimension.

The error of one dimensional quadratures can be estimated: If the number of points used in each dimension is $N_1,ドル the number of all points used in $d$ dimensions is $N=(N_1)^d$. The error for trapezoid rule was estimated to 1ドル/(N_1)^2$ therefore the error in $d$ dimensions is 1ドル/N^{2/d}$. It is therefore clear than for $d=4$ the Monte-Carlo error and the trapezoid-rule error are equal.

Straighforward MC

For more or less flat functions, the integration is straightforward \begin{eqnarray} \frac{1}{V}\int f dV \approx \langle f\rangle \pm \sqrt{\frac{\langle f^2\rangle-\langle f\rangle^2}{N}} \end{eqnarray} here \begin{eqnarray} \langle f\rangle\equiv \frac{1}{N}\sum_{i=0}^{N-1} f(x_i)\\ \langle f^2\rangle\equiv \frac{1}{N}\sum_{i=0}^{N-1} f^2(x_i) \end{eqnarray}

If the function $f$ is rapidly varying, the variance is going to be large and precision of the integral vary bad. The scaling 1ドル/\sqrt{N}$ is very bad! If one has a lot of computer time and patience, one might still try to use this method, because it is so straighforward to implement.

If the region $V$ of the integral is complicated and is hard to generate distribution of points in volume $V,ドル one can just find a larger and simpler volume $W$ which contains volume $V$. Then one samples over $W$ and defines the function $f$ to be zero outside $V$. Of course the error will increase because number of "good" points is smaller.

Monte Carlo wikipedia.¶

Example1 : Computing $\pi$ by integrating function $f=1$ inside unit circle, and $f=0$ outside unit circle, but in the box with interval $x=[-1,1],ドル $y=[-1,1]$.

In [1]:

from numpy import random
batch = 1000
num_inside = 0
Ntot = 0
for i in range(1000):
 X = random.uniform(low=-1, high=1, size=batch)
 Y = random.uniform(low=-1, high=1, size=batch)
 num_inside += sum(X**2+Y**2<=1)
 Ntot += batch
V = 4
print(V * num_inside/Ntot)

3.141948

Example2: Integrate Gaussian of width=1 inside unit circle: $f=\exp(-x^2-y^2)$ if $x^2+y^2<=1$. Result is $$\int_{circle} \exp(-x^2-y^2) dx dy= 2\pi \int_0^1 r dr \exp(-r^2)=\pi (1-1/e)$$

In [3]:

from numpy import *
batch = 1000
faver = 0
Ntot = 0
for i in range(1000):
 X = random.uniform(low=-1, high=1, size=batch)
 Y = random.uniform(low=-1, high=1, size=batch)
 R2 = X**2+Y**2
 is_inside = (R2<=1).astype(int)
 faver += sum(exp(-R2)* is_inside)
 Ntot += batch
V = 4
print(V * faver/Ntot, 'exact=', pi*(1-exp(-1)))

1.9862011126318213 exact= 1.9858653037988714

Importance sampling

Usefulness of Monte Carlo becomes more appealing when importance sampling strategy is implemented. Of course one needs to know something about the function to implement the strategy, but one is rewarded with much higher accuracy.

It is simplest to ilustrate the idea in 1D. If one knows that function $f(x)$ to be integrated is mostly proportional to another function $w(x)$ in the region where the integral contains most of the weight, we might want to rewrite the integral \begin{eqnarray} \int f(x)dx = \int \frac{f(x)}{w(x)} w(x) dx \end{eqnarray} If weight function $w(x)$ is a simple analytic function, which can be integrated analytically, and obeys the following constrains

$w(x)>0$ for every $x$
$\int w(x)dx = 1$

we can define \begin{eqnarray} W(x) = \int_{-\infty}^x w(t)dt \qquad \rightarrow\qquad dW(x) = w(x)dx \end{eqnarray} and rewrite \begin{eqnarray} \int f(x)dx = \int \frac{f(x)}{w(x)} dW(x) = \int \frac{f(x(W))}{w(x(W))} dW\rightarrow \left\langle \frac{f(x(W))}{w(x(W))}\right\rangle_{W\; uniform\;\in[0,1]} \nonumber \end{eqnarray}

If the function $f/w$ on mesh $W$ is reasonably flat, it can be efficiently integrated by MC. The error is now proportional to $$\sqrt{\frac{\langle(f/w)^2\rangle-\langle f/w\rangle^2}{N}}$$ and the error is therefore greatly reduced.

To implement the algorith, we generate uniform random numbers $r$ in the interval $r\in[0,1]$ which correspond to variable $W$. We can solve the equaton for $x=W^{-1}(r)$ to get $x$ and use it to evaluate $f(x)/w(x)$. The random numbers are therefore uniformly distributed on mesh $W$ while they are non-uniformly distributed on mesh $x$.

This can also be writtens as \begin{eqnarray} \int f(x)dx =\left\langle \frac{f(x)}{w(x)}\right\rangle_{\frac{P(x)}{dx}=w(x)} \nonumber \end{eqnarray} because the distribution of points $x$ is $$\frac{dP}{dx}=\frac{dP}{dW}\frac{dW}{dx}$$ and since distribution $\frac{dP}{dW}$ is uniform, and $\frac{dW}{dx}=w(x),ドル we have $$\frac{dP}{dx}=w(x)$$.

The archaic example is the exponential weight function \begin{eqnarray} w(x) = \frac{1}{\lambda}e^{-x/\lambda}\qquad for\; x>0 \end{eqnarray} This is equivalent to our exponentially distributed mesh points. Most of them are going to be close to 0ドル$ and only a few at large $x$.

The integral is $W(x)=1-e^{-x/\lambda}$ which gives for the inverse $x=-\lambda\ln(1-W)$. The integral \begin{equation} \int_0^\infty f(x)dx=\int_0^1 \frac{f(-\lambda\ln(1-W))}{w(-\lambda\ln(1-W))} dW = \int_0^1 f(-\lambda\ln(1-W))\frac{\lambda dW}{1-W} \end{equation} is easily evaluated with MC if $f(x)$ is exponentially falling function. \begin{eqnarray} \int_0^\infty f(x)dx\rightarrow \langle \lambda \frac{f(-\lambda\ln(1-W))}{1-W}\rangle_{W\;uniform\in[0,1]} \end{eqnarray}

Since $\frac{dP}{dW}=1$ ($W$ is uniformly distributed), the probability for $x$ is $\frac{dP}{dx}=w(x),ドル therefore $x$ is exponentially distributed random number.

We could also write \begin{eqnarray} \int_0^\infty f(x)dx\rightarrow \left\langle\frac{f(x)}{\frac{e^{-x/\lambda}}{\lambda}}\right\rangle_{\frac{dP}{dx}=e^{-x/\lambda}/\lambda} \end{eqnarray}

Example1 : $\int_0^\infty \exp(-x)\frac{\sin(x)}{x} dx = 0.785398$

First lets check distribution of points after transformation $x=-\log(1-w)$.

In [15]:

from numpy import *
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
Npoints=100_000
Nbins=100
L=5
W = random.uniform(low=0, high=1, size=Npoints)
X = -log(1-W)
hist, bin_edges = histogram(X, bins=linspace(0,L,Nbins))
xn=0.5*(bin_edges[:-1]+bin_edges[1:])
plt.bar(xn, hist*Nbins/(L*Npoints), width=0.04)
plt.plot(xn, exp(-xn),'r')
plt.show()

No description has been provided for this image

In [42]:

def f(x):
 return exp(-x)*sin(x)/x
 
batch = 100
faver = 0
Nall = 0
f2 = 0
for i in range(100):
 W = random.uniform(low=0, high=1, size=batch)
 X = -log(1-W)
 faver += sum(f(X)/exp(-X))
 Nall += batch
 f2 += sum((f(X)/exp(-X))**2)
f2 /= Nall
faver /= Nall
print(faver, '+-',sqrt((f2-faver**2)/Nall),'exact=', 0.785398)

0.7861767875860837 +- 0.0029654492885685083 exact= 0.785398

Compare with naive Monte Carlo

In [41]:

batch = 100
faver = 0
Nall = 0
L=100.
f2 = 0
for i in range(100):
 X = random.uniform(low=0, high=L, size=batch)
 faver += sum(f(X))
 Nall += batch
 f2 += sum(f(X)**2)
f2 /= Nall
faver /= Nall
print(L*faver,'+-', L*sqrt((f2-faver**2)/Nall), 'exact=', 0.785398)

0.7550766998739598 +- 0.06401388537192822 exact= 0.785398

Another very usefull weight function is Gaussian distribution \begin{eqnarray} w(x) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(x-x_0)^2}{2\sigma^2}} \end{eqnarray} How do we get random number $x$ to be distributed according to the above distribution? The integral gives $\mathrm{erf}$ function and its inverse is not simple to evaluate.

The trick is to use two random numbers to get Gaussian distrbution. Consider the following algorithm \begin{eqnarray} x_1 = x_0 + \sqrt{-2\sigma^2\ln r_1}\cos(2\pi r_2)\\ x_2 = x_0 + \sqrt{-2\sigma^2\ln r_1}\sin(2\pi r_2) \end{eqnarray} The distribution of $r_1$ and $r_2$ is uniform in the interval $[0,1]$. The distribution of $x_1$ and $x_2$ is \begin{eqnarray} \frac{d^2 P}{dx_1 dx_2} = \frac{d^2 P}{dr_1 dr_2} \left|\frac{\partial(r_1,r_2)}{\partial(x_1,x_2)}\right|= \frac{1}{\sqrt{2\pi}}e^{-(x_1-x_0)^2/2} \frac{1}{\sqrt{2\pi}}e^{-(x_2-x_0)^2/2} \end{eqnarray} therefore Gaussian. In this way, we can obtaine $x$ to be distributed gaussian, and we can evaluate \begin{eqnarray} \int f(x)dx = \left\langle \frac{f(x)}{\frac{1}{\sqrt{2\pi\sigma^2}}e^{-(x-x_0)^2/(2\sigma^2)}} \right\rangle_{\frac{dP}{dx}=\frac{1}{\sqrt{2\pi\sigma^2}}e^{\frac{(x-x_0)^2}{2\sigma^2}}} \end{eqnarray}

In [56]:

from numpy import *
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
r1 = random.uniform(low=0, high=1, size=10_000)
r2 = random.uniform(low=0, high=1, size=10_000)
s2 = 1.0 # s^2
t = sqrt(-2*s2*log(1-r1)) # note log(1-r1) is safe, while log(r1) is not!
X1 = t*cos(2*pi*r2)
X2 = t*sin(2*pi*r2)
X = hstack((X1,X2))
Npts = len(X)
Nbins = 100
L2 = 5 
hist, bin_edges = histogram(X, bins=linspace(-L2,L2,Nbins))
xn=0.5*(bin_edges[:-1]+bin_edges[1:])
plt.bar(xn, hist*Nbins/(2*L2*Npts), width=0.05)
plt.plot(xn, exp(-xn**2/(2*s2))/(sqrt(2*pi*s2)), 'r')
plt.show()

No description has been provided for this image

Example2 : $$\int_{-\infty}^{\infty} \exp(-x^2) \frac{sin(x)}{x} dx = 1.6352$$

In [67]:

def f(x):
 return exp(-x**2)*sin(x)/x
 
batch = 100
faver = 0
Nall = 0
f2 = 0
spi = sqrt(pi)
for i in range(100):
 r1 = random.uniform(low=0, high=1, size=batch)
 r2 = random.uniform(low=0, high=1, size=batch)
 t = sqrt(-log(1-r1))
 X1 = t*cos(2*pi*r2)
 X2 = t*sin(2*pi*r2)
 X = hstack((X1,X2))
 faver += sum(f(X) * spi/exp(-X**2))
 f2 += sum( (f(X) * spi/exp(-X**2))**2 )
 Nall += len(X)
faver /= Nall
f2 /= Nall
print(faver, '+-',sqrt((f2-faver**2)/Nall),'exact=', 1.6352)

1.6351722239796684 +- 0.0012851133445292995 exact= 1.6352

Compare with naive Monte Carlo

In [70]:

batch = 100
faver = 0
Nall = 0
L = 6
f2 = 0
for i in range(100):
 X = random.uniform(low=-L, high=L, size=2*batch)
 faver += sum( f(X) )
 f2 += sum( f(X)**2 )
 Nall += len(X)
faver /= Nall
f2 /= Nall
print(2*L*faver,'+-',2*L*sqrt((f2-faver**2)/Nall), 'exact=', 1.6352)

1.6224885398880498 +- 0.023618273343361695 exact= 1.6352

What is the best choice for weight function $w$?

If function $f$ is positive, clearly best $w$ is just proportional to $f$. What if $f$ is not positive everywhere? It turns out that the best choice is absolute value of $f,ドル i.e., \begin{eqnarray} w=\frac{|f|}{\int|f|dV} \end{eqnarray} The proof is simple. The MC importance sampling evaluates \begin{eqnarray} \int f dV = \int \frac{f}{w}w dV\approx \left\langle\frac{f}{w}\right\rangle\pm \sqrt{\frac{\langle(\frac{f}{w})^2\rangle-\langle\frac{f}{w}\rangle^2}{N}} \end{eqnarray} and the error is minimal when \begin{eqnarray} \delta\left(\langle(\frac{f}{w})^2\rangle-\langle\frac{f}{w}\rangle^2 +\lambda(\int wdV -1)\right)=0 \end{eqnarray} This is constrained minimization to find minimum of the variance. We are varying the weight function $w$.

We rewrite in terms of integrals, and take the derivative with respect to $w$: \begin{eqnarray} \delta\left( \int \frac{f^2}{w^2} w dV - \left( \int \frac{f}{w}w dV \right)^2+\lambda(\int wdV -1) \right)=0 \end{eqnarray} \begin{eqnarray} \int \left(-\frac{f^2}{w^2}+\lambda\right)dV=0\qquad\rightarrow\qquad w\propto|f| \end{eqnarray}

If we know a good approximation for function $f,ドル we can use this information to sample the same function $f$ to higher accuracy with importance sampling. The solution can thus be improved iteratively. This idea is implemented in Vegas algorithm, which will be implemented in the next chapter.

The Central Limit Theorem (CLT)

If you take many independent samples from any distribution with finite mean and variance, and average them, the distribution of those averages tends to a Gaussian as the sample size $n$ increases.

That’s why sums of random variables (e.g. energies, noise, measurement errors) in physics often look Gaussian — regardless of the microscopic distribution.

We take uniform distribution of 10,000 points and average over $n$ number of such distributions. We should see that the average has the mean average (here 1/2) and width, which is $\sigma_{combined} = \sigma/\sqrt{n}$. For uniform distribution $\sigma=1/\sqrt{12}$.

In [133]:

N_samples = 10_000 # number of experiments
n = 100 # number of draws per experiment
# Choose a highly non-Gaussian distribution — e.g. uniform or exponential
#x = random.exponential(size=(N_samples, n))
x = random.uniform(size=(N_samples, n))
# Compute the mean of each experiment
means = mean(x, axis=1)
Nbins=1000
hist, bin_edges = histogram(means, bins=linspace(0,1,Nbins))
xn=0.5*(bin_edges[:-1]+bin_edges[1:])
#plt.hist(means, bins=50, density=True)
plt.bar(xn, hist*Nbins/N_samples, width=0.001)
s20 = 1./12 # variance of uniform distribution between 0-1
s2 = s20/n # variance of mean over all distributions
plt.plot(xn, exp(-(xn-0.5)**2/(2*s2))/sqrt(2*pi*s2),'r')
plt.title(f"Central Limit Theorem demonstration (n={n})")
plt.xlabel("Sample mean")
plt.ylabel("Probability density")
plt.show()

No description has been provided for this image

Misser algorithm¶

There is another set of algorithms to improve precision of Monte Carlo sampling. The idea is to divide the volume into smaller subregions and check in each subregion how rapidly is the function $f$ varying in each subregion. The quantitative estimation can be the variance of the function is each subregion $\sqrt{\langle f^2\rangle-\langle f\rangle^2}$. The idea is to increase the number of points in those regions where variance is big. The algorithm is called Stratified Sampling and is used in Miser integration routine. The idea seems simple and powerful, but is not very usefull for high-dimensional integration because the number of subregions grows exponentially with the number of dimensions therefore it is usefull only if we have some idea how to constract small number of subregions where variance of $f$ is large. This last trick is also used in Vegas algorithm which is probabbly the best algorithm available at the moment.

Vegas algorithm¶

If we know a good approximation for function $f,ドル we can use this information to sample the same function $f$ to higher accuracy with importance sampling. The solution can thus be improved iteratively. This idea is implemented in Vegas algorithm (see next chapter).

In [ ]: