Molecular dynamics simulation

Question 1

I'm looking for some fresh ideas on how to optimize a couple of functions in my C++ code. I run the code through a profiler and it shows that they take 20% and 40% of the time respectively. The functions are called many times as they live inside a triple nested loop (cannot be avoided).

The blip3_inner(args...) (20% of the time) is a follows:

 void PairBlip::blip3_inner(int &i, int &j, int &k, int &e,double g[][2][3],
 double gp[][2][3] ,double &blip_ijk, double coef[2]) 
 { 
 double f[3][2][3]; 
 for (int d=0; d<3; ++d) 
 for (int a=0; a<2; ++a) 
 for (int b=0; b<3; ++b) 
 f[d][a][b] = gp[d][a][b] * blip_ijk - g[d][a][b] * coef[a]; 
 blip3_dump(i,j,k,e,f); 
 }

and blip3_dump(args...) (40% of time) is:

 void PairBlip::blip3_dump(int &i, int &j, int &k, int &e, double f[][2][3]) 
 { 
 for (int d=0; d<3; ++d) 
 { 
 for (int b=0; b<3; ++b) 
 { 
 force_fp[i][d][e+size2b+b*size_grid_3b] += f[d][0][b] + f[d][1][b]; 
 force_fp[j][d][e+size2b+b*size_grid_3b] -= f[d][0][b]; 
 force_fp[k][d][e+size2b+b*size_grid_3b] -= f[d][1][b]; 
 } 
 } 
 }

The code is compiled with the following flags:

-g -std=c++11 -O3 -ffast-math -march=native

The code will be used on range of architectures and the choice of compiler flags is really down to the end user.

The entire code is parallelized with MPI so above functions are computed by the single core.

For the second function, I tried to replace force_fp which is declared on the heap with an intermediate variable declared on the stack. While it improves things when compiling with -O any benefit disappears soon as -O3 is used.

I would appreciate any ideas.

Background

I'm developing machine learning interatomic pair potential to work with https://lammps.sandia.gov/. The structure of the code is very similar to the classical three-body potentials such as Tersoff or Stillinger-Weber. The aforementioned code is called inside nested i, j, k loop where i goes over all atoms in a box, j and k overall nearest neighbors of i. It is what it is and there is no room for improvement here.

Update

I tried reducing a number of nested loops and improve caching in blip3_inner() by flattening f[18],g[18],gp[18] but it does not improve speed.

Question 2

Welcome to CodeReview@SE! I much like seeing a quantified motivation to try and reduce execution time in pieces of code presented upfront.

Question 3

(It may improve readability to enclose identifiers in `"backticks" like where i goes.)

Question 4

I have a nagging feeling that the performance problem may be in upper level algorithm design or ("bulk")memory access pattern: can you disclose more about the enclosing triple nested loop?

Question 5

In my book, compute g, gp, coeff (all depend on i, j, k) didn't cut it, anyway: blip_ijk and coef don't change in blip3_whatever() - why aren't they already incorporated into gp[][][] and g[][][]? I could take educated guesses seeing the code.

Question 6

(One search I'd check for ideas: Intel's take on n bodies.)

Question 7

Introduce a variable for blip3_dump()'s offset expression:

 for (int b=0, e_size2b = e + size2b; b<3; ++b) {
 int offset = e_size2b + b * size_grid_3b;
 for (int d=0; d<3; ++d) {
 force_fp[i][d][offset] += f[d][0][b] + f[d][1][b];
 force_fp[j][d][offset] -= f[d][0][b];
 force_fp[k][d][offset] -= f[d][1][b];
 }
 }

Note how this has "the a-loop" unrolled - more obvious with separate force_fp[i] updates.
In PairBlip::blip3_inner(), unroll the a-loop and swap iterating first&last index:

 for (int b=0; b<3; ++b)
 for (int d=0; d<3; ++d) {
 f[d][0][b] = gp[d][0][b] * blip_ijk - g[d][0][b] * coef[0];
 f[d][1][b] = gp[d][1][b] * blip_ijk - g[d][1][b] * coef[1];
 }

This seems to suggest to replace f[][][] and associated indexing with f0 and f1:

void blip3_inner(int i, int j, int k, int e_size2b,
 double g[][2][3], double gp[][2][3],
 double &blip_ijk, double coef[2])
{
 for (int b=0; b<3; ++b) {
 int offset = e_size2b + b * size_grid_3b;
 for (int d=0; d<3; ++d) {
 double f0 = gp[d][0][b] * blip_ijk - g[d][0][b] * coef[0],
 f1 = gp[d][1][b] * blip_ijk - g[d][1][b] * coef[1];
 force_fp[i][d][offset] += f0 + f1;
 force_fp[j][d][offset] -= f0;
 force_fp[k][d][offset] -= f1;
 }
 }
}

Question 8

@greybeard In fact couldn't wait till tomorrow. There is 6.5% gain in speed plus cleaner code. Thanks for that.

greybeard greybeard 7,3313 gold badges21 silver badges55 bronze badges · Accepted Answer · 2020-03-19 12:31:53Z

Introduce a variable for blip3_dump()'s offset expression:

 for (int b=0, e_size2b = e + size2b; b<3; ++b) {
 int offset = e_size2b + b * size_grid_3b;
 for (int d=0; d<3; ++d) {
 force_fp[i][d][offset] += f[d][0][b] + f[d][1][b];
 force_fp[j][d][offset] -= f[d][0][b];
 force_fp[k][d][offset] -= f[d][1][b];
 }
 }

Note how this has "the a-loop" unrolled - more obvious with separate force_fp[i] updates.
In PairBlip::blip3_inner(), unroll the a-loop and swap iterating first&last index:

 for (int b=0; b<3; ++b)
 for (int d=0; d<3; ++d) {
 f[d][0][b] = gp[d][0][b] * blip_ijk - g[d][0][b] * coef[0];
 f[d][1][b] = gp[d][1][b] * blip_ijk - g[d][1][b] * coef[1];
 }

This seems to suggest to replace f[][][] and associated indexing with f0 and f1:

void blip3_inner(int i, int j, int k, int e_size2b,
 double g[][2][3], double gp[][2][3],
 double &blip_ijk, double coef[2])
{
 for (int b=0; b<3; ++b) {
 int offset = e_size2b + b * size_grid_3b;
 for (int d=0; d<3; ++d) {
 double f0 = gp[d][0][b] * blip_ijk - g[d][0][b] * coef[0],
 f1 = gp[d][1][b] * blip_ijk - g[d][1][b] * coef[1];
 force_fp[i][d][offset] += f0 + f1;
 force_fp[j][d][offset] -= f0;
 force_fp[k][d][offset] -= f1;
 }
 }
}

@greybeard In fact couldn't wait till tomorrow. There is 6.5% gain in speed plus cleaner code. Thanks for that.

Stack Exchange Network

Molecular dynamics simulation

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Molecular dynamics simulation

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions