I have some code that is slow because it loops over every row in an input matrix Y
. Obviously, this code does not scale with the number of rows. I don't think it's critical to understand what the code does, but for the curious, it is a Gibbs sampling step for sampling the dispersion parameter r
for the negative binomial distribution.
Is there anything I can do to make this faster? I've tried for y in Y_j[Y_j > 0]
in the crt_sum
function, but that's not actually faster according to a simple speed test. I suspect because in many cases, the code loops over every row in the Y_j
th column of Y
in C and then does it again in Python.
import numpy as np
def sample_r(Y, P, R, e0=1e-2, f0=1e-2):
"""Sample negative binomial dispersion parameter `r` based on
(Zhou 2020). See:
- http://people.ee.duke.edu/~mz1/Papers/Mingyuan_NBP_NIPS2012.pdf
- https://mingyuanzhou.github.io/Softwares/LGNB_Regression_v0.zip
"""
J = Y.shape[1]
for j in range(J):
L = crt_sum(Y, R, j)
A = e0 + L
# `maximum` is element-wise, while `max` is not.
maxes = np.maximum(1 - P[:, j], -np.inf)
B = 1. / (f0 - np.sum(np.log(maxes)))
R[j] = np.random.gamma(A, B)
# `R` cannot be zero.
inds = np.isclose(R, 0)
R[inds] = 0.0000001
return R
def crt_sum(Y, R, j):
"""Sum independent Chinese restaurant table random variables.
"""
Y_j = Y[:, j]
r = R[j]
L = 0
tbl = r / (r + np.arange(Y_j.max()))
for y in Y_j:
if y > 0:
u = np.random.uniform(0, 1, size=y)
inds = np.arange(y)
L += (u <= tbl[inds]).sum()
return L
Synthetic data should be sufficient. Basically, Y
is a matrix of count data (non-negative integers), P
is a matrix of numbers in the range [0, 1], and R
is a matrix of positive (nonzero) real numbers. This should be sufficient to generate realistic data:
def sigmoid(x):
return 1 / (1 + np.exp(-x))
N = 100
J = 10
Y = np.arange(N*J).reshape(N, J)
P = sigmoid(np.random.random((N, J)))
R = np.arange(J, dtype=float)+1
sample_r(Y, P, R)
I currently don't have any tests (sorry, lonely researcher code), but assuming my implementation is correct, this should pass
assert(sample_r(Y, P, R) == sample_r_fast(Y, P, R))
where sample_r_fast
a faster implementation. Thanks in advance.
1 Answer 1
Yes, it is possible to vectorize the code. Does that make it faster? That depends on the spread of values in your data Y.
Computing the scale
parameter can be vectorized without issues and should be faster.
For the shape
parameter - I haven't read the papers - you seem to use something that looks quite similar to rejection sampling. You compute a probability distribution tbl
over NxJ, then draw Y[n, j] many samples from a uniform distribution and check how often you land within tbl[n,j]
. shape
is then the sum of "hits" along N.
The naive approach to vectorizing this would be to draw the same amount of samples for each Y[n, j]. In this case at least max(Y)
many. This can create a lot of overhead if your data has a few really large values but otherwise small ones. If the values are fairly close together, this will make things faster. If you have a way to quickly (pre-)generate large quantities of uniform numbers this limitation doesn't matter.
Here is the code:
def sample_r_vec(Y, P, R, e0=1e-2, f0=1e-2, random_numbers=None):
"""Sample negative binomial dispersion parameter `r` based on
(Zhou 2020). See:
- http://people.ee.duke.edu/~mz1/Papers/Mingyuan_NBP_NIPS2012.pdf
- https://mingyuanzhou.github.io/Softwares/LGNB_Regression_v0.zip
"""
if random_numbers is None:
random_numbers = np.random.uniform(0, 1, size=(*Y.shape, np.max(Y) + 1))
# compute shape
Y_max_vec = np.arange(np.max(Y) + 1).reshape((-1, 1))
R_vec = R.reshape((1, -1))
tbl = (R_vec / (R_vec + Y_max_vec))
tbl = tbl.reshape((*tbl.shape, 1))
N_vec = np.arange(Y.shape[0]).reshape(-1, 1)
J_vec = np.arange(Y.shape[1]).reshape(1, -1)
sum_hits = np.cumsum(random_numbers <= tbl.T, axis=2)[N_vec, J_vec, Y - 1]
sum_hits[Y == 0] = 0
shape = e0 + np.sum(sum_hits, axis=0)
# compute scale
maxes = np.maximum(1 - P, -np.inf)
scale = 1. / (f0 - np.sum(np.log(maxes), axis=0))
# sample
R = np.random.gamma(shape, scale)
R[R < 1e-7] = 1e-7
return R
Edit: Based on the comment, I did find a different bug that is now fixed.
Edit: Demonstration that both versions perform equally.
Firstly, we need a different test to assert equality of the code; try assert(sample_r(Y, P, R) == sample_r(Y, P, R))
and you will quickly see the problem. There are many reasons why a different test is needed: (1) sample_r
and sample_r_fast
return vectors and not scalars. The truth value of a vector is undefined (numpy also warns for that). (2) your code (sample_r
) modifies R
in place, which means that the input to sample_r_fast
will be different to the input of sample_r
. Logically, the outputs will differ due to the unwanted side effect. (3) Both functions are expected to create random samples and compute results based on that. assert
tests for exact equality and will hence fail even if both versions are correct. Providing the same random seed won't be enough either, because the order in which the samples are used may differ, which will change the results. (4) This is a numerics problem; even in the deterministic part of the code it is only accurate up to a tolerance. (5) The code estimates parameters for a distribution and then samples once from the estimated distribution; these samples are then compared. If we want to know whether or not the two versions estimate the same distributions it seems much more efficient to directly compare the parameters.
To fix all of this, I modified the code in the following way:
- Added an optional parameter to the function that can be used as random numbers; generate them if
none
. - Made sure the same random numbers are used for comparisons each time.
- Removed the sampling of R at the end and instead return the estimated shape and scale directly.
- It is also nice to test if the new code is actually faster, so I added a speed test (excluding RNG).
My full script looks like this:
import numpy as np
def sample_r(Y, P, R, e0=1e-2, f0=1e-2, random_numbers=None):
"""Sample negative binomial dispersion parameter `r` based on
(Zhou 2020). See:
- http://people.ee.duke.edu/~mz1/Papers/Mingyuan_NBP_NIPS2012.pdf
- https://mingyuanzhou.github.io/Softwares/LGNB_Regression_v0.zip
"""
if random_numbers is None:
random_numbers = np.random.uniform(0, 1, size=(*Y.shape, np.max(Y) + 1))
A_vec = np.zeros_like(R)
B_vec = np.zeros_like(R)
J = Y.shape[1]
for j in range(J):
L = crt_sum(Y, R, j, random_numbers)
A = e0 + L
A_vec[j] = A
# `maximum` is element-wise, while `max` is not.
maxes = np.maximum(1 - P[:, j], -np.inf)
B = 1. / (f0 - np.sum(np.log(maxes)))
B_vec[j] = B
# R[j] = np.random.gamma(A, B)
# `R` cannot be zero.
# inds = np.isclose(R, 0)
# R[inds] = 0.0000001
return A_vec, B_vec
def crt_sum(Y, R, j, random_numbers):
"""Sum independent Chinese restaurant table random variables.
"""
Y_j = Y[:, j]
r = R[j]
L = 0
tbl = r / (r + np.arange(Y_j.max()))
for n_idx, y in enumerate(Y_j):
if y > 0:
relevant_numbers = random_numbers[n_idx, j, :y]
inds = np.arange(y)
L += (relevant_numbers <= tbl[inds]).sum()
return L
def sample_r_vec(Y, P, R, e0=1e-2, f0=1e-2, random_numbers=None):
"""Sample negative binomial dispersion parameter `r` based on
(Zhou 2020). See:
- http://people.ee.duke.edu/~mz1/Papers/Mingyuan_NBP_NIPS2012.pdf
- https://mingyuanzhou.github.io/Softwares/LGNB_Regression_v0.zip
"""
if random_numbers is None:
random_numbers = np.random.uniform(0, 1, size=(*Y.shape, np.max(Y) + 1))
# compute shape
Y_max_vec = np.arange(np.max(Y) + 1).reshape((-1, 1))
R_vec = R.reshape((1, -1))
tbl = (R_vec / (R_vec + Y_max_vec))
tbl = tbl.reshape((*tbl.shape, 1))
N_vec = np.arange(Y.shape[0]).reshape(-1, 1)
J_vec = np.arange(Y.shape[1]).reshape(1, -1)
sum_hits = np.cumsum(random_numbers <= tbl.T, axis=2)[N_vec, J_vec, Y - 1]
sum_hits[Y == 0] = 0
shape = e0 + np.sum(sum_hits, axis=0)
# compute scale
maxes = np.maximum(1 - P, -np.inf)
scale = 1. / (f0 - np.sum(np.log(maxes), axis=0))
return shape, scale
if __name__ == "__main__":
def sigmoid(x):
return 1 / (1 + np.exp(-x))
np.random.seed(1337)
N = 100
J = 10
Y = np.arange(N*J, dtype=np.int32).reshape(N, J)
P = sigmoid(np.random.random((N, J)))
# use test case from comments
R = np.ones(J, dtype=np.float32); R[J-1] = 5000
random_numbers = np.random.uniform(0, 1, size=(*Y.shape, np.max(Y) + 1))
shape_normal, scale_normal = sample_r(Y.copy(), P.copy(), R.copy(), random_numbers=random_numbers)
shape_vec, scale_vec = sample_r_vec(Y.copy(), P.copy(), R.copy(), random_numbers=random_numbers)
assert np.all(np.isclose(scale_normal, scale_vec))
assert np.all(np.isclose(shape_normal, shape_vec))
#speed test
import timeit
t1 = timeit.timeit(lambda: sample_r(Y.copy(), P.copy(), R.copy(), random_numbers=random_numbers), number=100)
t2 = timeit.timeit(lambda: sample_r_vec(Y.copy(), P.copy(), R.copy(), random_numbers=random_numbers), number=100)
print(f"Original version total time {t1:.2f}. Vector Version total time {t2:.2f}.")
N = 1000
J = 10
Y = 100*np.ones(N*J, dtype=np.int32).reshape(N, J)
P = sigmoid(np.random.random((N, J)))
R = np.arange(J)+1
random_numbers = np.random.uniform(0, 1, size=(*Y.shape, np.max(Y) + 1))
t1 = timeit.timeit(lambda: sample_r(Y.copy(), P.copy(), R.copy(), random_numbers=random_numbers), number=100)
t2 = timeit.timeit(lambda: sample_r_vec(Y.copy(), P.copy(), R.copy(), random_numbers=random_numbers), number=100)
print(f"Original version total time {t1:.2f}. Vector Version total time {t2:.2f}.")
Note that this now has a python 3.7+ dependency due to format strings in the functional test, the relevant code does not have that dependency.
Output:
Original version total time 1.29. Vector Version total time 1.05.
Original version total time 8.55. Vector Version total time 0.98.
I did the following modifications to your code: To return the estimated parameters, I am stacking them up in a vector as they are being computed. To deal with the random sampling, I generate a whole bunch of random numbers (Y.max()
many) for each Y[n, j]
, and then select the first y=Y[n,j]
of them. This is the same idea I used to vectorize the code; it 'wastes' Y.max() - Y[n, j]
many random samples at each step, because the code generates them but then doesn't use them; it's a trick to match the shapes for vectorization. This is the main reason why the vectorized version will only be faster if you either (1) pre-generate the random numbers, or (2) have a situation where the different Y[n, j]
don't differ too much, so that generating the 'waste' doesn't take more time than is saved by hardware acceleration.
I hope this explains what is going on.
PS: Please, please use informative variable names for code in the future. L,n,j,A,B
, ect. are not wise choices if others are to read your code or if you try to understand it 3 years from now.
-
\$\begingroup\$ +1 because this is close; however the code has a bug. Note that we do not compute
Y.max()
but ratherY[:, j].max()
. Hence,random_numbers
is wrong,tbl
is wrong, etc. This is actually the main thing I don't know how to vectorize, since it would result in arrays with inconsistently sized rows. The reason your code approximates mine is that in the toy data, all the largest values are in the last row. This means that each column has approximately the same value forY[:, j].max()
. Try this pathological case to see the approximation error grow:R = np.ones(J); R[J-1] = 5000
. \$\endgroup\$jds– jds2020年03月30日 21:00:25 +00:00Commented Mar 30, 2020 at 21:00 -
\$\begingroup\$ @gwg I probably didn't explain my idea well. I can start by computing
tbl = r / (r + np.arange(Y[:, j].max()))
which is an array of lengthY[:, j]
. Note thatnp.arange(Y.max())[:Y[:,j]] == np.arange(Y[:,j])
(note the leading:
). The values afterY[:,j]
are unused padding to make the shapes match acrossj
. As long as I don't access the padding, that is okay. Same idea forrandom_numbers
. If the comment is not clear, I will elaborate in the answer. I will also check for other bugs to see why the results differ. \$\endgroup\$FirefoxMetzger– FirefoxMetzger2020年03月31日 03:59:37 +00:00Commented Mar 31, 2020 at 3:59 -
\$\begingroup\$ I didn't get a notification for your edit, but it looks great. Thank you. I haven't had time to verify it yet—this is an active research project with a lot of moving parts—but I'll accept it so I don't forget. You're correct that the
assert(...)
was conceptual, not a real test. I'll test usingnp.isclose
and by verifying that model learns theR
parameter in a controlled setting. \$\endgroup\$jds– jds2020年04月07日 14:48:41 +00:00Commented Apr 7, 2020 at 14:48 -
\$\begingroup\$ @gwg for such a controlled setting, you can check the second half of my answer :) \$\endgroup\$FirefoxMetzger– FirefoxMetzger2020年04月07日 15:36:50 +00:00Commented Apr 7, 2020 at 15:36
assert(sample_r(Y, P, R) == sample_r_fast(Y, P, R))
sufficient? I also initialized some sample data at the bottom. Do you want real data? \$\endgroup\$np.int32
because you replace values in R which has been generated using arange; yet, gamma is a real valued function. Is this intended? \$\endgroup\$