I have this C++ code that implements a rectangular numerical integration. It evaluates the \$K\$-dimensional integral
$$\int_{u_K = 0}^{\gamma}\int_{u_{K-1} = 0}^{\gamma-u_K}\cdots\int_{u_2}^{\gamma-u_K-\cdots-u_3}F_U(\gamma-\sum_{k=2}^Ku_k)f_U(u_2)\cdots f_U(u_K),円du_2\cdots du_K$$ where \$F_U\$ is the cdf and \$f_U\$ is the pdf.
#include <iostream>
#include <cmath>
using namespace std;
float pdf(float u){
return (1/(pow(1+u, 2)));
}
float cdf(float u){
return (1 - 1/(u+1));
}
// The main function that implements the numerical integration,
//and it is a recursive function
float integ(float h, int k, float du){
float res = 0;
if (k == 1){
res = cdf(h);
}else{
float u = 0;
while (u < h){
res += integ(h - u, k - 1, du)*pdf(u)*du;
u += du;
}
}
return res;
}
int main(){
float du = 0.0001;
int K = 3;
float gamma[4] = {0.31622777, 0.79432823,
1.99526231, 5.01187234};
int G = 50;
int Q = 2;
for (int i = 0; i < 4; i++){
if ((G-Q*(K-1)) > 0){
float gammath = (gamma[i]/Q)*(G-Q*(K-1));
cout<<1-integ(gammath, K, du)<< endl;
}
}
return 0;
}
I am facing a speed problem, although I switched to C++ from Python and MATLAB, because C++ is faster. The problem is that I need a small step size du
to get an accurate evaluation of the integration.
Basically, I want to evaluate the integral at 4 different points defined by gammath, which is a function of other defined parameters.
Is there anyway I can speed up this program? I already have 25x+ speed factor over the same code in Python, but still the code takes too long (I ran it all night, and it wasn't finished in the morning). And this is only for K=3, and G=50. In other cases I want to test K = 10, and G = 100 or 300.
3 Answers 3
I see a few things that may help you improve your program.
Avoid pow
for float
The use of pow
converts the argument to a double
and returns a double
. If you're casting the result to a float
anyway, the use of std::powf
would be more appropriate. But see the next suggestion for an even better approach.
Prefer multiplication to pow
with small, fixed integral powers
The content of the pdf
function could be rewritten like this instead:
return 1.0f/((u+1)*(u+1));
Usealittlemorewhitespace
With everything crowded together on the line, it's hard to read and understand some lines like this:
float gammath = (gamma[i]/Q)*(G-Q*(K-1));
I'd write it like this instead:
float gammath{gamma[i] / Q) * (G - Q * (K - 1)};
This also uses the C++11 uniform initialization syntax, which I prefer.
Use constexpr
where appropriate
Any time you can make a numerical constant or expression constexpr
is one more opportunity for compiler optimization. In this case, virtually everything in main
, including du
, K
, gamma
, G
, and Q
can be made constexpr
simply by adding that keyword in front of the declaration. With that, the compiler is free to evaluate expressions like this:
G - Q * (K - 1)
once at compile-time rather than at runtime. It can save quite a bit of time if you use it.
Move invariants out of the loop
The value of the expression listed above does not change within the loop (or within the program at all, as presented). Smart compilers can usually optimize that by themselves, but doing it manually may give you insight. For example, we can make a constant of that expression outside the loop:
constexpr auto m{G - Q * (K - 1)};
Now the inner if
boils down to if (m > 0)
but since it's invariant within the loop, move it outside. Now we have the encapsulating if
outside and the for
loop inside which makes a lot more sense.
Use C++11 threading features
The calculations for each value of gamma
are completely independent. For that reason, they could easily be calculated in parallel. Here's a way to do that:
int main() {
constexpr float du = 0.0001;
constexpr int K = 3;
constexpr float gamma[4] = {0.31622777, 0.79432823, 1.99526231, 5.01187234};
constexpr int G = 50;
constexpr int Q = 2;
constexpr auto m{G - Q * (K - 1)};
std::vector<std::future<float>> chunks;
if (m > 0) {
for (int i = 0; i < 4; i++) {
float gammath{gamma[i] / Q * m};
chunks.push_back(std::async(integ, gammath, K, du));
}
}
std::mutex cout_mutex;
for (int i=0; i < 4; ++i) {
std::lock_guard<std::mutex> lock(cout_mutex);
std::cout << gamma[i] << '\t' << 1 - chunks[i].get() << '\n';
}
}
This requires <future>
, <mutex>
, and <vector>
as #include
s.
Parallelize the recursive integrals
If there are 100,000 steps in the integral, why do that with a single core? If you have a multicore machine like I do, you could divide that work among multiple cores by using the same kind of std::async
technique shown above. This would dramatically speed the results (and can be divided among as many cores as you have). I did not implement that, but the code above should give a hint as to how to do such a thing generically.
Use static
for functions which can have file scope
If the pdf
, cdf
and integ
functions are never used outside the file in which they're defined, the compiler can get quite clever about inlining code and other optimizations. For that reason, I'd recommend making those functions static
if at all possible.
Results
I didn't have the patience to run this at the du
value in the program (0.0001) so I ran it at 50 times that at 0.005 to speed up testing. Obviously, that's an option if you can accept the lower resolution, but I just did it to speed up timing tests.
The original (with the modified value of du
) took 1.06 seconds, while the modified version took 0.723s. At full resolution (du = 0.0001
) the results took 30 minutes and 3 seconds on the same 64-bit 8-core Linux machine.
-
\$\begingroup\$ Why would you use the Lockguard? \$\endgroup\$JVApen– JVApen2018年07月25日 20:43:16 +00:00Commented Jul 25, 2018 at 20:43
-
1\$\begingroup\$ The
lock_guard
is used to prevent the results from being intermixed as they are printed. With twogamma
values very numerically close, they could finish at nearly the same time and without thelock_guard
the output might be interleaved. \$\endgroup\$Edward– Edward2018年07月25日 20:44:45 +00:00Commented Jul 25, 2018 at 20:44 -
1\$\begingroup\$
std::pow
is overloaded forfloat
, so won't promote argument or result todouble
. It's still a poor choice when the exponent is a small integer, of course. \$\endgroup\$Toby Speight– Toby Speight2018年07月25日 20:44:50 +00:00Commented Jul 25, 2018 at 20:44 -
1\$\begingroup\$ @BlackMath: you'll need to have compiler support for at least C++11 (that is, the 2011 version of the standard and libraries) and also to have a line that says
#include <future>
. On Linux (where I'm testing) one also needs to link with thepthread
library. \$\endgroup\$Edward– Edward2018年07月25日 21:39:28 +00:00Commented Jul 25, 2018 at 21:39 -
1\$\begingroup\$ @Edward I have C++14 compiler MinGW, I think. I included all you have mentioned: <future>, <mutex>, and <vector>. OK, now I saw you did the full resolution. 30 minutes is actually very good. My machine is i5 with 4 cores (I suppose), so, I think it would be slower. I will need to have the code works first to test it. \$\endgroup\$BlackMath– BlackMath2018年07月25日 21:42:06 +00:00Commented Jul 25, 2018 at 21:42
Improve robustness and readability
- Don't
using namespace
with namespaces not designed for it. - Use
const
andconstexpr
to make your intent clear and prevent silly accidents. - Make your internal functions
static
. - Use more whitespace around operators.
Improve performance
I started with your code as a baseline, but increased du
tenfold (to .001
) to be able to test in a reasonable amount of time. I compiled with g++ -std=c++17 -O3 -march=native
to allow the optimiser to use its full range of substitutions and to use the full capabilities of the machine.
std::pow()
normally works by multiplying logarithms. For a power of 2, it's much simpler to simply multiply a number by itself. I got about 10% speed improvement by re-writing pdf()
thus:
static float pdf(float u)
{
return 1 / ((1+u)*(1+u));
}
Note that the seemingly equivalent return 1 / (1+u) / (1+u)
provided exactly zero improvement, because the cost of division is so high. (Which obviously suggests std::pow(1+u, -2)
as an alternative, but I recommend the version I shown above, as it's more accurate).
Are you sure you want to use float
? If double
is slower on your platform, it may still be a worthwhile trade-off if it allows you to use a larger value of du
.
You can parallelize the loop in main()
, like this (add -fopenmp
to your g++
command):
float result[4] = {};
constexpr auto gqk = G - Q*(K-1);
#pragma omp parallel for
for (int i = 0; i < 4; ++i) {
if (gqk > 0) {
result[i] = 1-integ(gamma[i]*gqk/Q, K, du);
}
}
for (auto r: result) {
std::cout << r << std::endl;
}
That saves me another 25% or so in wall-clock time (at the expense of using more CPU cores).
However, you'll get a much better improvement by parallelizing the loop in integ
instead:
static float integ(float h, int k, float du)
{
if (k == 1) {
return cdf(h);
}
float res = 0;
const int iterations = h / du;
#pragma omp parallel for reduction(+:res)
for (int i = 0; i < iterations; ++i) {
const float u = i * du;
res += integ(h - u, k - 1, du) * pdf(u) * du;
}
return res;
}
(We need the reduction
clause so that all the threads can contribute to res
without needing to synchronize for that step - OpenMP applies the reduction after the threads have each produced their own local sum).
That gained about 80% improvement over the original, on this 8-core machine. (9 seconds elapsed with du = .001, and 15m 6s with du = .0001. You're still going to have severe scaling problems when K goes to 10 - you'll probably need to also apply some mathematical insight to reduce the complexity at this point).
-
\$\begingroup\$ Thanks for your answer. I am not a professional programmer, so, thanks for clarifying how to make my code clearer. I am learning. 15m6s for du=0.0001 is fantastic. What do I need to use OpenMP? I use MinGW compiler, and it seems it doesn't support multi-threading because it is cross platform or something. Also I have to say I am using it on Windows not Linux. \$\endgroup\$BlackMath– BlackMath2018年07月25日 22:07:25 +00:00Commented Jul 25, 2018 at 22:07
-
\$\begingroup\$ I'm using GCC on Linux, and for that I just needed to add
-fopenmp
to my command line, as shown. I don't know how MinGW differs, so I'm not qualified to help there, I'm afraid (other than to recommend that you check you have a recent version, but I expect you've already thought of that!). I don't know whether you could write an on-topic Stack Overflow question (with a minimal C++&OpenMP program and a "How do I make MinGW compile this to parallel code?"). You'd have to be careful with the wording there. \$\endgroup\$Toby Speight– Toby Speight2018年07月25日 22:13:31 +00:00Commented Jul 25, 2018 at 22:13 -
\$\begingroup\$ Would using
res += (((k - 1) == 1) ? cdf(h) : integ(h - u, k - 1, du)) * pdf(u) * du;
add anything? This saves the extra call tointeg
when it's just going to usecdf
. IfK
is always guaranteed to be greater than 1, theif (k == 1)
at the top could be eliminated as well. \$\endgroup\$Craig Estey– Craig Estey2018年07月25日 22:20:41 +00:00Commented Jul 25, 2018 at 22:20 -
\$\begingroup\$ You're probably right @Craig - I have to leave now, but perhaps you could try your suggestion and post it as another answer if it turns out to be helpful? \$\endgroup\$Toby Speight– Toby Speight2018年07月25日 22:22:32 +00:00Commented Jul 25, 2018 at 22:22
-
\$\begingroup\$ @Toby I actually has Ubuntu as a dual boot on my laptop. I think gcc is already installed and part of the Linux OS, right? @Graig
k
is always greater than or equal to 1. \$\endgroup\$BlackMath– BlackMath2018年07月25日 23:49:53 +00:00Commented Jul 25, 2018 at 23:49
I got some speedup by replacing pow(u,2)
with u * u
. Time for i == 0
went from 33 sec to 26 sec.
It can be noted that pdf(u)
is h
invariant [unlike cdf(h - u)
].
So, at each integral level pdf(u)
can be cached and only calculated once. This is akin to memoization in dynamic programming.
By doing this, I reduced the runtime to 12 sec. That is, it is 2.75x faster than the original
Note that on the first call to pdf(u)
, u
will be 0. It follows the sequence: 0, du, 2*du, ..., iter*du
where iter
is the iteration index of the loop.
This is true regardless of what level integral we're calculating. The pdf
sequence will always be the same.
So, we can keep track of the largest iter
index we've seen. If the current index is greater than the largest, we do: pdf_cache[iter] = pdf(u)
. Otherwise, we use pdf_cache[iter]
.
See the code below for full details.
As a possible alternative, the pdf_cache
array could be fully precalculated once before starting anything. Then, all usage of pdf(u)
could be replaced with pdf_cache[i]
.
This could be done in main
before its for
loop as pdf_cache
will be the same no matter what main
's values for i
or gammath
(aka h
) are.
Further, this precalculation could be split up amongst multiple cores if needed.
Also, if openmp
is used within integ
[as others have suggested], the full precalculation would probably be necessary to prevent interthread write contention.
Edit: I added this version as an update at the bottom
Here is the code. Please pardon the C-like stuff. This is also a prototype level.
#include <iostream>
#include <cmath>
#include <stdio.h>
#include <time.h>
double
tvgetf(void)
{
struct timespec ts;
double sec;
clock_gettime(CLOCK_REALTIME,&ts);
sec = ts.tv_nsec;
sec /= 1e9;
sec += ts.tv_sec;
return sec;
}
#define inline_always static inline __attribute__((__always_inline__))
// NOTE: the size of pdf_cache is _hardwired_. The max size _can_ be calculated
// and this can become a pointer to an allocated area
int pdf_max;
float pdf_cache[10000000];
inline_always float
pdf(float u)
{
//u += 1;
return (1 / (u * u));
}
inline_always float
cdf(float u)
{
u += 1;
return (1 - 1 / (u));
}
inline_always float
pdfx(float u,int i)
{
//u += 1;
if (i > pdf_max) {
pdf_cache[i] = pdf(u);
pdf_max = i;
}
return pdf_cache[i];
}
// The main function that implements the numerical integration,
//and it is a recursive function
float
integ(float h, int k, float du)
{
float res = 0;
float u = 1;
int i;
int iter = h / du;
k -= 1;
h += 1;
if (k == 1) {
for (i = 0; i < iter; ++i) {
res += cdf(h - u) * pdfx(u,i) * du;
u += du;
}
}
else {
for (i = 0; i < iter; ++i) {
res += integ(h - u, k, du) * pdfx(u,i) * du;
u += du;
}
}
return res;
}
// The main function that implements the numerical integration,
//and it is a recursive function
float
integ_top(float h, int k, float du)
{
float res = 0;
pdf_max = -1;
if (k == 1)
res = cdf(h);
else
res = integ(h,k,du);
return res;
}
int
main()
{
float du = 0.0001;
int K = 3;
float gamma[4] = { 0.31622777, 0.79432823,
1.99526231, 5.01187234
};
int G = 50;
int Q = 2;
for (int i = 0; i < 4; i++) {
if ((G - Q * (K - 1)) > 0) {
float gammath = (gamma[i] / Q) * (G - Q * (K - 1));
double tvbeg = tvgetf();
double rtn = 1 - integ_top(gammath, K, du);
//std::cout << 1 - integ(gammath, K, du) << endl;
double tvdif = tvgetf() - tvbeg;
printf("i=%d rtn=%f tvdif=%.9f\n",i,rtn,tvdif);
fflush(stdout);
}
}
return 0;
}
This technique may have some roundoff error as I ran the original and got:
i=0 rtn=0.418665 tvdif=27.172003746
i=1 rtn=0.183092 tvdif=168.498732328
The cached version output was:
i=0 rtn=0.418630 tvdif=13.691759109
i=1 rtn=0.183040 tvdif=85.953905582
i=2 rtn=0.070858 tvdif=526.217260361
This might be alleviated by using double
for the pdf_cache
array
UPDATE:
Here is a version that precalculates the pdf_cache
values (again, please pardon the C-like stuff):
#include <iostream>
#include <cmath>
#include <stdio.h>
#include <time.h>
double
tvgetf(void)
{
struct timespec ts;
double sec;
clock_gettime(CLOCK_REALTIME,&ts);
sec = ts.tv_nsec;
sec /= 1e9;
sec += ts.tv_sec;
return sec;
}
#define inline_always static inline __attribute__((__always_inline__))
// NOTE: the size of pdf_cache is _hardwired_. The max size _can_ be calculated
// and this can become a pointer to an allocated area
int pdf_max;
float *pdf_cache;
inline_always float
pdf(float u)
{
//u += 1;
return (1 / (u * u));
}
inline_always float
cdf(float u)
{
u += 1;
return (1 - 1 / (u));
}
inline_always float
pdfx(float u,int i)
{
//u += 1;
if (i > pdf_max) {
pdf_cache[i] = pdf(u);
pdf_max = i;
}
return pdf_cache[i];
}
// The main function that implements the numerical integration,
//and it is a recursive function
float
integ(float h, int k, float du)
{
float res = 0;
float u = 1;
int i;
int iter = h / du;
k -= 1;
h += 1;
if (k == 1) {
for (i = 0; i < iter; ++i) {
res += cdf(h - u) * pdf_cache[i] * du;
u += du;
}
}
else {
for (i = 0; i < iter; ++i) {
res += integ(h - u, k, du) * pdf_cache[i] * du;
u += du;
}
}
return res;
}
// The main function that implements the numerical integration,
//and it is a recursive function
float
integ_top(float h, int k, float du)
{
float res = 0;
if (k == 1)
res = cdf(h);
else
res = integ(h,k,du);
return res;
}
int
main()
{
float du = 0.0001;
int K = 3;
float gamma[4] = { 0.31622777, 0.79432823,
1.99526231, 5.01187234
};
int G = 50;
int Q = 2;
float maxgam = 0;
int initflg = 1;
for (int i = 0; i < 4; i++) {
if ((G - Q * (K - 1)) > 0) {
float gammath = (gamma[i] / Q) * (G - Q * (K - 1));
printf("i=%d gammath=%f\n",i,gammath);
if (initflg) {
maxgam = gammath;
initflg = 0;
continue;
}
if (gammath > maxgam)
maxgam = gammath;
}
}
pdf_max = maxgam / du;
printf("pdf_max=%d\n",pdf_max);
pdf_cache = (float *) malloc(sizeof(*pdf_cache) * pdf_max);
float u = 1;
for (int i = 0; i < pdf_max; ++i) {
pdf_cache[i] = pdf(u);
u += du;
}
for (int i = 0; i < 4; i++) {
if ((G - Q * (K - 1)) > 0) {
float gammath = (gamma[i] / Q) * (G - Q * (K - 1));
double tvbeg = tvgetf();
double rtn = 1 - integ_top(gammath, K, du);
//std::cout << 1 - integ(gammath, K, du) << endl;
double tvdif = tvgetf() - tvbeg;
printf("i=%d rtn=%f tvdif=%.9f\n",i,rtn,tvdif);
fflush(stdout);
}
}
return 0;
}
-
\$\begingroup\$ Interesting observations. I am not familiar with C, but do you have a rough total speed up factor compared to the original?
du=0.01
should run fast, but will give some insights. \$\endgroup\$BlackMath– BlackMath2018年07月27日 04:10:18 +00:00Commented Jul 27, 2018 at 4:10 -
1\$\begingroup\$ This is compiled in
c++
[a superset ofc
] and at a lower level they overlap and are compatible [arguably]. I just added themalloc/printf
[which works inc++
but is more idiomatic inc
] (e.g. it's not the "pythonic" way but it works :-). Otherwise, all the other code here is interchangeable. And, the bulk of your code,integ
,cdf
,pdf
are c/c++ agnostic. So, as above, the final benchmark is 2.75x faster than your original [where all I did was add benchmarking code] on a single core. My final version can be combined with the multicore stuff of others to get the best of both \$\endgroup\$Craig Estey– Craig Estey2018年07月27日 04:39:58 +00:00Commented Jul 27, 2018 at 4:39 -
1\$\begingroup\$ For the largest
gammath
, thepdf_cache
max size is about 1M elements. At 4 bytes / entry, this about 4MB. This fits within the cache memory of most x86 CPUs, so most access to it will come from the faster cache memory rather than the slower DRAM. On each loop, thepdf_cache
element fetch is faster than recalc ofpdf
(i.e. it eliminates a floating point add, multiply, and divide on each iteration). \$\endgroup\$Craig Estey– Craig Estey2018年07月27日 04:59:46 +00:00Commented Jul 27, 2018 at 4:59
Explore related questions
See similar questions with these tags.
-Ox
simply refers to compiler optimizations possible on gcc. Here is a short overview \$\endgroup\$