I wrote the following code in Rcpp, i.e. it is C++ code that is compiled in R for making faster calculations. The code in Rcpp is intended to calculate in a recursive way the likelihood function for an Hidden Markov Model, the likelihood function in this case is Like_Fun_acoustic_FAST
. First I provide the code in Rcpp
#include <RcppArmadilloExtensions/sample.h>
#include <omp.h>
#include <random>
#include <iostream>
#include <cmath>
#include <vector>
#include <numeric>
#include <algorithm>
#include <stdio.h>
#include <math.h>
#include<Rmath.h>
using namespace Rcpp;
// [[Rcpp::export]]
arma::vec log_sum_exp2_cpp_apply(arma::mat x){
int n = x.n_rows;
arma::vec res(n);
double M;
for(int j=0; j<n; ++j){
M = max(x.row(j));
res[j] = M + log(sum(exp(x.row(j) - M)));
}
return res;
}
// [[Rcpp::export]]
arma::mat SummarizeLastCol(arma::cube Q){
int n = size(Q)[0];
int k = size(Q)[2];
int t = size(Q)[1];
arma::mat cd(n,k);
for(int j=0; j<k; ++j){
cd.col(j) = Q.slice(j).tail_cols(1);
}
return cd;
}
void inplace_tri_mat_mult(arma::rowvec &x, arma::mat const &trimat){
arma::uword const n = trimat.n_cols;
for(unsigned j = n; j-- > 0;){
double tmp(0.);
for(unsigned i = 0; i <= j; ++i)
tmp += trimat.at(i, j) * x[i];
x[j] = tmp;
}
}
// [[Rcpp::export]]
arma::vec dmvnrm_arma_mc(arma::mat const &x,
arma::rowvec const &mean,
arma::mat const &sigma,
bool const logd = false,
int const cores = 1) {
using arma::uword;
omp_set_num_threads(cores);
uword const n = x.n_rows,
xdim = x.n_cols;
arma::vec out(n);
arma::mat const rooti = arma::inv(trimatu(arma::chol(sigma)));
double const rootisum = arma::sum(log(rooti.diag())),
constants = -(double)xdim/2.0 * log2pi,
other_terms = rootisum + constants;
arma::rowvec z;
#pragma omp parallel for schedule(static) private(z)
for (uword i = 0; i < n; i++) {
z = (x.row(i) - mean);
inplace_tri_mat_mult(z, rooti);
out(i) = other_terms - 0.5 * arma::dot(z, z);
}
if (logd)
return out;
return exp(out);
}
// [[Rcpp::export]]
double Like_Fun_acoustic_FAST( int n, int TT, int k, List Data, arma::mat mu, arma::cube Sigma,
arma::vec pi, arma::mat PI, int num_feat){
arma::mat cb(n,2);
arma::cube q(n,TT,k);
arma::vec res(n);
//time point 1
for(int j=0; j<k; ++j){
q.slice(j).col(0) = dmvnrm_arma_mc(as<arma::mat>(Data[0]), mu.row(j),
Sigma.slice(j), true,4) + log(pi[j]);
}
// //late time points
for(int t=1; t<TT; ++t){
for(int j=0; j<k; ++j){
if(k>0){
q.slice(j).col(t) = q.slice(0).col(t-1) + log(PI(0,j));
for(int d=1; d<k; ++d){
cb.col(0) = q.slice(j).col(t);
cb.col(1) = q.slice(d).col(t-1)+log(PI(d,j));
(q.slice(j)).col(t) = log_sum_exp2_cpp_apply(cb);
}
q.slice(j).col(t) +=
dmvnrm_arma_mc(as<arma::mat>(Data[t]), mu.row(j),
Sigma.slice(j), true,4);
}
if(k==1){
q.slice(j).submat(0,t,n-1,t) = q.slice(0).submat(0,t-1,n-1,t-1) + log(PI(0,j)) +
dmvnrm_arma_mc(as<arma::mat>(Data[t]), mu.submat(j,0,j,num_feat-1),
Sigma.slice(j), true,4);
}
}
}
if(k>1){
cb = SummarizeLastCol(q);
res = log_sum_exp2_cpp_apply(cb);
return accu(res);
}
if(k==1){
return accu(q.slice(0).submat(0,TT-1,n-1,TT-1));
}
}
I tried to optimize all the functions shown by avoiding using any cumbersome functions, however, the time needed for the function Like_Fun_acoustic_FAST
to run is still too much. The first function log_sum_exp2_cpp_apply
takes as input a matrix and calculates the log sum for each row. The function SummarizeLastCol
takes a cube and a creates a matrix whose columns are the last column for matrix component of the cube. The dmvnrm_arma_mc
calculates the multivariate normal density based on parallelization.
The likelihood function Like_Fun_acoustic_FAST
recursively integrates out the latent states of a hidden markov chain. It does that with the use of the cube q
which stores K
latent components corresponding to K
latent states, and for each latent state we have a matrix of size n
row and TT
columns corresponding to observations and time respectively. In particular, we have a time series of TT points, and on each point we calculate the multivariate normal density conditional on all the possible latent states.
For the number of cores used for the parallelization I used the optimal number, based on which number of cores gives me the fastest calculations.
So, I try to find any possible way to make the algorithms faster, because in my eyes all functions are written in the most neat way in terms of time complexity.
I give inputs if someone wants to reproduce the example in R.
library(MASS)
library(extraDistr)
n_dim = 10
TT = 15
n_feat = 3
K = 3
m = matrix(NA,K,n_feat)
for(i in 1:K){
m[i,] = rnorm(n_feat,0,1)
}
Sigma = array(NA,c(n_feat,n_feat,K))
for(i in 1:K){
Sigma[,,i] = rwishart(3,diag(1,n_feat))
}
data = list()
for(i in 1:TT){
data[[i]] = mvrnorm(n_dim, m[1,], Sigma[,,1])
}
prob_pi = rdirichlet(1,rep(1,K))
prob_P = matrix(NA,K,K)
for(i in 1:K){
prob_P[i,] = rdirichlet(1,rep(1,K))
}
Like_Fun_acoustic_FAST( n_dim, TT, K, data, m, Sigma,
prob_pi,
prob_P, n_feat)
1 Answer 1
I tried to look for performance issues, but unfortunately I can't, because your code has a major problem:
Naming things
Your code is very hard to read because of the inconsistent way you are naming things. Also, while you are apparently not afraid to use long names for certain things, in most cases you are unnecessarily abbreviating things.
Let's begin with Like_Fun_acoustic_FAST()
. I also like fun! But I wouldn't have known it was a "likelihood function" if it wasn't for your description in the question. It's not mentioned anywhere in the code. But "likelihood" is just a value that you want to calculate, there is no "likelihood function" in the mathematical problem. Sure, you wrote a C++ function to calculate it, but why put Fun
in this function's name? You didn't do that for the other functions. Why are Like
and Fun
capitalized, but not acoustic
? Why is FAST
in all caps? That makes it either look like it's an acronym, or that you are shouting.
Why did you add _FAST
to the function name? Sure you want this to go fast, but typically you want all your functions to run fast. It might make sense to add this to the function if there is also a _SLOW
, or perhaps an _ACCURATE
if you have a trade-off between speed and accuracy. Otherwise, just leave it off.
What does acoustic
mean here? Is there something like an "acoustic likelihood"? That doesn't sound very likely to me. Maybe the problem involves sound waves, but what exactly is it you want to calculate the likelihood of? If you are trying to detect the presence of some feature in a sound signal, and want to know the likelihood of it actually being there, then I would name this function calculate_likelihood_of_feature()
, where feature
is replaced by the actual name of the feature.
Go over all the function names, make sure they use the same style of naming (I recommend using snake_case, as that is the most commonly used style for function and variable names in C++ code), and ensure the name clearly explains what it is doing.
Do the same for variable names: make sure they are consistent, and convey clearly what information they hold.
Some abbreviations are fine, but only if they are widely used ones, and are unambiguous in the context where they are used. For example, i
, j
and k
are often used as loop indices, so that might be fine. However, compare:
for(int j=0; j<n; ++j){
With:
for (int row = 0; row < num_rows; ++row) {
Sure, it's a bit more typing, but now I instantly know what is being looped over, whereas with j
and n
I have to look elsewhere to see what is actually stored in those variables.
Another thing you can do is imagine your are talking with someone about your code, and having to pronounce your variable and function names. Will they understand you if you say "dmvnrm_arma_mc"? What about "pi" and "PI"? Won't they think you are talking about the constant \$\pi\$?
Performance
I don't even know what exactly this code is doing, so I can't tell whether your algorithm can be improved or not. There are a few things that I did notice that you might want to look at:
- Do you really need
double
precision? Often you get a speedup of more than 2 by usingfloat
instead. - Avoid unnecessary allocations.
dmvnrm_arma_mc()
allocates a vector and a matrix each time it is called. Those will have exactly the same size. It might be better to ensure these are allocated once and then reused.
-
\$\begingroup\$ Do you have any references for using
floats
being more than twice as fast as usingdouble
? I've seen many instances where the opposite is true, and from what I can tell, it seems to be very hardware dependent. Here is the result of a lazy search: stackoverflow.com/a/4584707/4408538 \$\endgroup\$Joseph Wood– Joseph Wood2024年03月23日 02:01:20 +00:00Commented Mar 23, 2024 at 2:01 -
\$\begingroup\$ There's throughput and latency. A single
float
operation might take just as long as adouble
, but since you can pack twice the amount offloat
s in a SSE register, usually you get double the throughput. There are also situations where you are more limited by memory bandwidth than by CPU power, in which case againfloat
s could have twice the performance of adouble
by virtue of being half the size. Something that could slow downfloat
s is if you have to convert betweenfloat
anddouble
, so make sure all your operations use and return the same type. \$\endgroup\$G. Sliepen– G. Sliepen2024年03月23日 08:08:52 +00:00Commented Mar 23, 2024 at 8:08
Like_Fun_acoustic_FAST
to run is still too much." How much is too much? How much should it be in your opinion? On what kind of device is your code running, and how often is it called with how much data each time? Did you enable all relevant compiler optimizations? \$\endgroup\$K
matrices where is matrix is ````n``` timesTT
, and those matrices are used 10 times on each iterations of my MCMC, i.e. they are used each time that theLike_Fun_acoustic_FAST
is used. \$\endgroup\$-march
processor-specific optimizations would be relevant? For extra credit, mention a link to godbolt.org that shows how your source code is translated to assembly. \$\endgroup\$