Increase performance of Bayesian likelihood equation

Question 1

I need to calculate a Bayesian likelihood in its negative logarithmic form:

$$-\ln L = N\ln M -\sum\limits_{i=1}^{N} \ln\left\{P_i \sum\limits_{j=1}^M \frac{\exp\left[ -\frac{1}{2}\left(\sum_{k=1}^D \frac{(f_{ik}-g_{jk})^2}{\sigma_{ik}^2 + \rho_{jk}^2}\right) \right]}{\prod_{k=1}^D \sqrt{\sigma_{ik}^2 + \rho_{jk}^2}} \right\}$$

where these $$P_i\;\;;\;\; f_{ik} \;\;;\;\; \sigma_{ik}^2\;\;;\;\; g_{jk} \;\;;\;\; \rho_{jk}^2$$ are all known floats. This is my code:

import numpy as np
# Define random data with correct shapes.
N, M = np.random.randint(10, 1000), np.random.randint(10, 1000)
D = np.random.randint(2, 5)
f_N, g_M = np.random.rand(N, D, 2), np.random.rand(M, D, 2)
P_i = np.random.uniform(0., 1., N)
# N summatory. 
# Store here each 'i' element from the M summatory.
sum_Mi = []
for f_i in f_N:
 # M summatory.
 sum_M_j = 0.
 for g_j in g_M:
 # Obtain ijk element inside the M summatory
 D_sum, sigma_ijk = 0., 1.
 for k, f_ik in enumerate(f_i):
 # Where: f_ik[0] = \f_{ik} ; g_j[k][0] = \g_{jk}
 # and: f_ik[1] = \sigma_{ik}^2 ; g_j[k][1] = \rho_{jk}^2
 D_sum += np.square(f_ik[0] - g_j[k][0]) / (f_ik[1] + g_j[k][1])
 sigma_ijk = sigma_ijk * (f_ik[1] + g_j[k][1])
 sum_M_j += np.exp(-0.5 * D_sum) / np.sqrt(sigma_ijk)
 sum_Mi.append(sum_M_j)
sum_Mi_Pi = P_i * sum_Mi
# Remove 0. elements before applying the logarithm below.
sum_N = sum_Mi_Pi[sum_Mi_Pi != 0.]
# Final negative logarithmic likelihood
lik = len(f_N) * np.log(len(g_M)) - sum(np.log(sum_N))
print(lik)

I'm after an optimization done at the code level, ie: not using multi-threading.

I've tried numpy broadcasting on the inner loop

# Obtain ijk element inside the M summatory
D_sum = sum(np.square((f_i[:, 0] - g_j[:, 0])) / (f_i[:, 1] + g_j[:, 1]))
sigma_ijk = np.prod(f_i[:, 1] + g_j[:, 1])

but that actually takes a longer time to complete than using the for loop!

I also tried using the logsumexp function from scipy combined with the broadcasted arrays:

# Store here each 'i' element from the M summatory.
sum_N = 0.
# N summatory.
for f_i in f_N:
 # M summatory.
 A, B = [], []
 for g_j in g_M:
 # Obtain ijk element inside the M summatory
 A.append(sum(
 np.square((f_i[:, 0] - g_j[:, 0])) / (f_i[:, 1] + g_j[:, 1])))
 B.append(np.prod(f_i[:, 1] + g_j[:, 1]))
 sum_N += logsumexp(-.5 * np.array(A), b=1. / np.sqrt(B))
# Final negative logarithmic likelihood
lik = len(f_N) * np.log(len(g_M)) - sum(np.log(P_i[P_i != 0.])) - sum_N
print(lik)

This resulted in a marginal improvement over just using broadcasting, but still slower than the original code.

Question 2

There also seems to be a way to calculate the \$\sigma\$ from the \$f\,ドル which you do not show.

Question 3

$f_{ik}$ is not equal to $f_{jk},ドル the notation implies it but this is not true, Ill fix that. The $\sigma$s are completely independent of the $f$s. (Why is my LateX not rendering?)

Question 4

You need to add a \ in front of your $.

Question 5

Thank you @Graipher, I've fixed the notation above. Please let mw know if something doesn't make sense to you. (testing \$\alpha\$)

Question 6

Vectorizing the inner loop doesn't help because D is so small compared to N and M, e.g. 3 v 500 and 500. It's just a few loops of a costly expression.

Question 7

This is much faster. I added the seed to better compare results across runs.

import numpy as np
# Define random data with correct shapes.
np.random.seed(0)
N, M = np.random.randint(10, 1000), np.random.randint(10, 1000)
D = np.random.randint(2, 5)
f_N, g_M = np.random.rand(N, D, 2), np.random.rand(M, D, 2)
P_i = np.random.uniform(0., 1., N)
# replace loops with broadcasted array operation, followed by sums 
fgsum = f_N[:,None,:,1] + g_M[None,:,:,1]
fgdif = f_N[:,None,:,0] - g_M[None,:,:,0]
print(fgsum.shape) # e.g. (694, 569, 3)
Dsum = (fgdif**2 / fgsum).sum(axis=-1)
sigma_ijk = np.prod(fgsum, axis=-1)
sum_M_j = np.exp(-0.5 * Dsum) / np.sqrt(sigma_ijk)
sum_Mi = np.sum(sum_M_j, axis=-1)
# as before
sum_Mi_Pi = P_i * sum_Mi
# Remove 0. elements before applying the logarithm below.
sum_N = sum_Mi_Pi[sum_Mi_Pi != 0.]
# Final negative logarithmic likelihood
lik = len(f_N) * np.log(len(g_M)) - sum(np.log(sum_N))
print(lik)

Question 8

Excellent! This code is between 80-100 times faster in my machine, thank you very much hpaulj!

Question 9

You can get another few percent faster by replacing sum(np.log(sum_N)) with np.log(sum_N).sum() or np.sum(np.log(sum_N)).

Question 10

okay ... nice that this is faster, but ... WHY???

Question 11

Because it replaces Python loops with compiled numpy code. Sorry. I normally answer on SO where this kind of numpy 'vectorization' is an everyday question.

Question 12

More about increasing performance with numpy: codereview.stackexchange.com/questions/17702/…

hpaulj hpaulj 1,5611 gold badge9 silver badges16 bronze badges · Accepted Answer · 2017-03-17 21:45:15Z

This is much faster. I added the seed to better compare results across runs.

import numpy as np
# Define random data with correct shapes.
np.random.seed(0)
N, M = np.random.randint(10, 1000), np.random.randint(10, 1000)
D = np.random.randint(2, 5)
f_N, g_M = np.random.rand(N, D, 2), np.random.rand(M, D, 2)
P_i = np.random.uniform(0., 1., N)
# replace loops with broadcasted array operation, followed by sums 
fgsum = f_N[:,None,:,1] + g_M[None,:,:,1]
fgdif = f_N[:,None,:,0] - g_M[None,:,:,0]
print(fgsum.shape) # e.g. (694, 569, 3)
Dsum = (fgdif**2 / fgsum).sum(axis=-1)
sigma_ijk = np.prod(fgsum, axis=-1)
sum_M_j = np.exp(-0.5 * Dsum) / np.sqrt(sigma_ijk)
sum_Mi = np.sum(sum_M_j, axis=-1)
# as before
sum_Mi_Pi = P_i * sum_Mi
# Remove 0. elements before applying the logarithm below.
sum_N = sum_Mi_Pi[sum_Mi_Pi != 0.]
# Final negative logarithmic likelihood
lik = len(f_N) * np.log(len(g_M)) - sum(np.log(sum_N))
print(lik)

Excellent! This code is between 80-100 times faster in my machine, thank you very much hpaulj!
You can get another few percent faster by replacing sum(np.log(sum_N)) with np.log(sum_N).sum() or np.sum(np.log(sum_N)).
Because it replaces Python loops with compiled numpy code. Sorry. I normally answer on SO where this kind of numpy 'vectorization' is an everyday question.
More about increasing performance with numpy: codereview.stackexchange.com/questions/17702/…

Stack Exchange Network

Increase performance of Bayesian likelihood equation

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Increase performance of Bayesian likelihood equation

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions