I'm looking for some fresh ideas on how to optimize a couple of functions in my C++ code. I run the code through a profiler and it shows that they take 20% and 40% of the time respectively. The functions are called many times as they live inside a triple nested loop (cannot be avoided).
The blip3_inner(args...)
(20% of the time) is a follows:
void PairBlip::blip3_inner(int &i, int &j, int &k, int &e,double g[][2][3],
double gp[][2][3] ,double &blip_ijk, double coef[2])
{
double f[3][2][3];
for (int d=0; d<3; ++d)
for (int a=0; a<2; ++a)
for (int b=0; b<3; ++b)
f[d][a][b] = gp[d][a][b] * blip_ijk - g[d][a][b] * coef[a];
blip3_dump(i,j,k,e,f);
}
and blip3_dump(args...)
(40% of time) is:
void PairBlip::blip3_dump(int &i, int &j, int &k, int &e, double f[][2][3])
{
for (int d=0; d<3; ++d)
{
for (int b=0; b<3; ++b)
{
force_fp[i][d][e+size2b+b*size_grid_3b] += f[d][0][b] + f[d][1][b];
force_fp[j][d][e+size2b+b*size_grid_3b] -= f[d][0][b];
force_fp[k][d][e+size2b+b*size_grid_3b] -= f[d][1][b];
}
}
}
The code is compiled with the following flags:
-g -std=c++11 -O3 -ffast-math -march=native
The code will be used on range of architectures and the choice of compiler flags is really down to the end user.
The entire code is parallelized with MPI so above functions are computed by the single core.
For the second function, I tried to replace force_fp
which is declared on the heap with an intermediate variable declared on the stack. While it improves things when compiling with -O
any benefit disappears soon as -O3
is used.
I would appreciate any ideas.
Background
I'm developing machine learning interatomic pair potential to work with https://lammps.sandia.gov/. The structure of the code is very similar to the classical three-body potentials such as Tersoff or Stillinger-Weber. The aforementioned code is called inside nested i
, j
, k
loop where i
goes over all atoms in a box, j
and k
overall nearest neighbors of i
. It is what it is and there is no room for improvement here.
Update
I tried reducing a number of nested loops and improve caching in blip3_inner()
by flattening f[18]
,g[18]
,gp[18]
but it does not improve speed.
1 Answer 1
Introduce a variable for blip3_dump()
's offset expression:
for (int b=0, e_size2b = e + size2b; b<3; ++b) {
int offset = e_size2b + b * size_grid_3b;
for (int d=0; d<3; ++d) {
force_fp[i][d][offset] += f[d][0][b] + f[d][1][b];
force_fp[j][d][offset] -= f[d][0][b];
force_fp[k][d][offset] -= f[d][1][b];
}
}
Note how this has "the a
-loop" unrolled - more obvious with separate force_fp[i]
updates.
In PairBlip::blip3_inner()
, unroll the a
-loop and swap iterating first&last index:
for (int b=0; b<3; ++b)
for (int d=0; d<3; ++d) {
f[d][0][b] = gp[d][0][b] * blip_ijk - g[d][0][b] * coef[0];
f[d][1][b] = gp[d][1][b] * blip_ijk - g[d][1][b] * coef[1];
}
This seems to suggest to replace f[][][]
and associated indexing with f0
and f1
:
void blip3_inner(int i, int j, int k, int e_size2b,
double g[][2][3], double gp[][2][3],
double &blip_ijk, double coef[2])
{
for (int b=0; b<3; ++b) {
int offset = e_size2b + b * size_grid_3b;
for (int d=0; d<3; ++d) {
double f0 = gp[d][0][b] * blip_ijk - g[d][0][b] * coef[0],
f1 = gp[d][1][b] * blip_ijk - g[d][1][b] * coef[1];
force_fp[i][d][offset] += f0 + f1;
force_fp[j][d][offset] -= f0;
force_fp[k][d][offset] -= f1;
}
}
}
-
\$\begingroup\$ @greybeard In fact couldn't wait till tomorrow. There is 6.5% gain in speed plus cleaner code. Thanks for that. \$\endgroup\$ad1v7– ad1v72020年03月19日 22:02:42 +00:00Commented Mar 19, 2020 at 22:02
`
"backticks" like wherei
goes.) \$\endgroup\$triple nested loop
? \$\endgroup\$compute g, gp, coeff (all depend on i, j, k)
didn't cut it, anyway:blip_ijk
andcoef
don't change inblip3_whatever()
- why aren't they already incorporated intogp[][][]
andg[][][]
? I could take educated guesses seeing the code. \$\endgroup\$