Implementing numerical integration

Question 1

I have this C++ code that implements a rectangular numerical integration. It evaluates the \$K\$-dimensional integral

$$\int_{u_K = 0}^{\gamma}\int_{u_{K-1} = 0}^{\gamma-u_K}\cdots\int_{u_2}^{\gamma-u_K-\cdots-u_3}F_U(\gamma-\sum_{k=2}^Ku_k)f_U(u_2)\cdots f_U(u_K),円du_2\cdots du_K$$ where \$F_U\$ is the cdf and \$f_U\$ is the pdf.

#include <iostream>
#include <cmath>
using namespace std;
float pdf(float u){
 return (1/(pow(1+u, 2)));
}
float cdf(float u){
 return (1 - 1/(u+1));
}
// The main function that implements the numerical integration, 
//and it is a recursive function
float integ(float h, int k, float du){
 float res = 0;
 if (k == 1){
 res = cdf(h);
 }else{
 float u = 0;
 while (u < h){
 res += integ(h - u, k - 1, du)*pdf(u)*du;
 u += du;
 }
}
 return res;
}
int main(){
 float du = 0.0001;
 int K = 3;
 float gamma[4] = {0.31622777, 0.79432823, 
 1.99526231, 5.01187234};
 int G = 50;
 int Q = 2;
 for (int i = 0; i < 4; i++){
 if ((G-Q*(K-1)) > 0){
 float gammath = (gamma[i]/Q)*(G-Q*(K-1));
 cout<<1-integ(gammath, K, du)<< endl;
 }
 }
 return 0;
}

I am facing a speed problem, although I switched to C++ from Python and MATLAB, because C++ is faster. The problem is that I need a small step size du to get an accurate evaluation of the integration.

Basically, I want to evaluate the integral at 4 different points defined by gammath, which is a function of other defined parameters.

Is there anyway I can speed up this program? I already have 25x+ speed factor over the same code in Python, but still the code takes too long (I ran it all night, and it wasn't finished in the morning). And this is only for K=3, and G=50. In other cases I want to test K = 10, and G = 100 or 300.

Question 2

Do you compile with -O3?

Question 3

@JVApen Sorry, I am not familiar with what that is. I am not new to C++, but I don't use it normally. I use the CodeLite IDE, and I believe the compiler is g++ MinGW .

Question 4

-Ox simply refers to compiler optimizations possible on gcc. Here is a short overview

Question 5

@vnp I want to evaluate the K-dimensional integral $$\int_{u_K = 0}^{\gamma}\int_{u_{K-1} = 0}^{\gamma-u_K}\cdots\int_{u_2}^{\gamma-u_K-\cdots-u_3}F_U(\gamma-\sum_{k=2}^Ku_k)f_U(u_2)\cdots f_U(u_K),円du_2\cdots du_K$$ where $$F_U$$ is the cdf and $$f_U$$ is the pdf.

Question 6

@BlackMath: That's perfect. I've added that explanation to the question for you.

Question 7

I see a few things that may help you improve your program.

Avoid `pow` for `float`

The use of pow converts the argument to a double and returns a double. If you're casting the result to a float anyway, the use of std::powf would be more appropriate. But see the next suggestion for an even better approach.

Prefer multiplication to `pow` with small, fixed integral powers

The content of the pdf function could be rewritten like this instead:

 return 1.0f/((u+1)*(u+1));

Usealittlemorewhitespace

With everything crowded together on the line, it's hard to read and understand some lines like this:

float gammath = (gamma[i]/Q)*(G-Q*(K-1));

I'd write it like this instead:

float gammath{gamma[i] / Q) * (G - Q * (K - 1)};

This also uses the C++11 uniform initialization syntax, which I prefer.

Use `constexpr` where appropriate

Any time you can make a numerical constant or expression constexpr is one more opportunity for compiler optimization. In this case, virtually everything in main, including du, K, gamma, G, and Q can be made constexpr simply by adding that keyword in front of the declaration. With that, the compiler is free to evaluate expressions like this:

G - Q * (K - 1)

once at compile-time rather than at runtime. It can save quite a bit of time if you use it.

Move invariants out of the loop

The value of the expression listed above does not change within the loop (or within the program at all, as presented). Smart compilers can usually optimize that by themselves, but doing it manually may give you insight. For example, we can make a constant of that expression outside the loop:

constexpr auto m{G - Q * (K - 1)};

Now the inner if boils down to if (m > 0) but since it's invariant within the loop, move it outside. Now we have the encapsulating if outside and the for loop inside which makes a lot more sense.

Use C++11 threading features

The calculations for each value of gamma are completely independent. For that reason, they could easily be calculated in parallel. Here's a way to do that:

int main() {
 constexpr float du = 0.0001;
 constexpr int K = 3;
 constexpr float gamma[4] = {0.31622777, 0.79432823, 1.99526231, 5.01187234};
 constexpr int G = 50;
 constexpr int Q = 2;
 constexpr auto m{G - Q * (K - 1)};
 std::vector<std::future<float>> chunks;
 if (m > 0) {
 for (int i = 0; i < 4; i++) {
 float gammath{gamma[i] / Q * m};
 chunks.push_back(std::async(integ, gammath, K, du));
 }
 }
 std::mutex cout_mutex;
 for (int i=0; i < 4; ++i) {
 std::lock_guard<std::mutex> lock(cout_mutex);
 std::cout << gamma[i] << '\t' << 1 - chunks[i].get() << '\n';
 }
}

This requires <future>, <mutex>, and <vector> as #includes.

Parallelize the recursive integrals

If there are 100,000 steps in the integral, why do that with a single core? If you have a multicore machine like I do, you could divide that work among multiple cores by using the same kind of std::async technique shown above. This would dramatically speed the results (and can be divided among as many cores as you have). I did not implement that, but the code above should give a hint as to how to do such a thing generically.

Use `static` for functions which can have file scope

If the pdf, cdf and integ functions are never used outside the file in which they're defined, the compiler can get quite clever about inlining code and other optimizations. For that reason, I'd recommend making those functions static if at all possible.

Results

I didn't have the patience to run this at the du value in the program (0.0001) so I ran it at 50 times that at 0.005 to speed up testing. Obviously, that's an option if you can accept the lower resolution, but I just did it to speed up timing tests.

The original (with the modified value of du) took 1.06 seconds, while the modified version took 0.723s. At full resolution (du = 0.0001) the results took 30 minutes and 3 seconds on the same 64-bit 8-core Linux machine.

Question 8

Why would you use the Lockguard?

Question 9

The lock_guard is used to prevent the results from being intermixed as they are printed. With two gamma values very numerically close, they could finish at nearly the same time and without the lock_guard the output might be interleaved.

Question 10

std::pow is overloaded for float, so won't promote argument or result to double. It's still a poor choice when the exponent is a small integer, of course.

Question 11

@BlackMath: you'll need to have compiler support for at least C++11 (that is, the 2011 version of the standard and libraries) and also to have a line that says #include <future>. On Linux (where I'm testing) one also needs to link with the pthread library.

Question 12

@Edward I have C++14 compiler MinGW, I think. I included all you have mentioned: <future>, <mutex>, and <vector>. OK, now I saw you did the full resolution. 30 minutes is actually very good. My machine is i5 with 4 cores (I suppose), so, I think it would be slower. I will need to have the code works first to test it.

Question 13

Improve robustness and readability

Don't using namespace with namespaces not designed for it.
Use const and constexpr to make your intent clear and prevent silly accidents.
Make your internal functions static.
Use more whitespace around operators.

Improve performance

I started with your code as a baseline, but increased du tenfold (to .001) to be able to test in a reasonable amount of time. I compiled with g++ -std=c++17 -O3 -march=native to allow the optimiser to use its full range of substitutions and to use the full capabilities of the machine.

std::pow() normally works by multiplying logarithms. For a power of 2, it's much simpler to simply multiply a number by itself. I got about 10% speed improvement by re-writing pdf() thus:

static float pdf(float u)
{
 return 1 / ((1+u)*(1+u));
}

Note that the seemingly equivalent return 1 / (1+u) / (1+u) provided exactly zero improvement, because the cost of division is so high. (Which obviously suggests std::pow(1+u, -2) as an alternative, but I recommend the version I shown above, as it's more accurate).

Are you sure you want to use float? If double is slower on your platform, it may still be a worthwhile trade-off if it allows you to use a larger value of du.

You can parallelize the loop in main(), like this (add -fopenmp to your g++ command):

 float result[4] = {};
 constexpr auto gqk = G - Q*(K-1);
#pragma omp parallel for
 for (int i = 0; i < 4; ++i) {
 if (gqk > 0) {
 result[i] = 1-integ(gamma[i]*gqk/Q, K, du);
 }
 }
 for (auto r: result) {
 std::cout << r << std::endl;
 }

That saves me another 25% or so in wall-clock time (at the expense of using more CPU cores).

However, you'll get a much better improvement by parallelizing the loop in integ instead:

static float integ(float h, int k, float du)
{
 if (k == 1) {
 return cdf(h);
 }
 float res = 0;
 const int iterations = h / du;
#pragma omp parallel for reduction(+:res)
 for (int i = 0; i < iterations; ++i) {
 const float u = i * du;
 res += integ(h - u, k - 1, du) * pdf(u) * du;
 }
 return res;
}

(We need the reduction clause so that all the threads can contribute to res without needing to synchronize for that step - OpenMP applies the reduction after the threads have each produced their own local sum).

That gained about 80% improvement over the original, on this 8-core machine. (9 seconds elapsed with du = .001, and 15m 6s with du = .0001. You're still going to have severe scaling problems when K goes to 10 - you'll probably need to also apply some mathematical insight to reduce the complexity at this point).

Question 14

Thanks for your answer. I am not a professional programmer, so, thanks for clarifying how to make my code clearer. I am learning. 15m6s for du=0.0001 is fantastic. What do I need to use OpenMP? I use MinGW compiler, and it seems it doesn't support multi-threading because it is cross platform or something. Also I have to say I am using it on Windows not Linux.

Question 15

I'm using GCC on Linux, and for that I just needed to add -fopenmp to my command line, as shown. I don't know how MinGW differs, so I'm not qualified to help there, I'm afraid (other than to recommend that you check you have a recent version, but I expect you've already thought of that!). I don't know whether you could write an on-topic Stack Overflow question (with a minimal C++&OpenMP program and a "How do I make MinGW compile this to parallel code?"). You'd have to be careful with the wording there.

Question 16

Would using res += (((k - 1) == 1) ? cdf(h) : integ(h - u, k - 1, du)) * pdf(u) * du; add anything? This saves the extra call to integ when it's just going to use cdf. If K is always guaranteed to be greater than 1, the if (k == 1) at the top could be eliminated as well.

Question 17

You're probably right @Craig - I have to leave now, but perhaps you could try your suggestion and post it as another answer if it turns out to be helpful?

Question 18

@Toby I actually has Ubuntu as a dual boot on my laptop. I think gcc is already installed and part of the Linux OS, right? @Graig k is always greater than or equal to 1.

Question 19

I got some speedup by replacing pow(u,2) with u * u. Time for i == 0 went from 33 sec to 26 sec.

It can be noted that pdf(u) is h invariant [unlike cdf(h - u)].

So, at each integral level pdf(u) can be cached and only calculated once. This is akin to memoization in dynamic programming.

By doing this, I reduced the runtime to 12 sec. That is, it is 2.75x faster than the original

Note that on the first call to pdf(u), u will be 0. It follows the sequence: 0, du, 2*du, ..., iter*du where iter is the iteration index of the loop.

This is true regardless of what level integral we're calculating. The pdf sequence will always be the same.

So, we can keep track of the largest iter index we've seen. If the current index is greater than the largest, we do: pdf_cache[iter] = pdf(u). Otherwise, we use pdf_cache[iter].

See the code below for full details.

As a possible alternative, the pdf_cache array could be fully precalculated once before starting anything. Then, all usage of pdf(u) could be replaced with pdf_cache[i].

This could be done in main before its for loop as pdf_cache will be the same no matter what main's values for i or gammath (aka h) are.

Further, this precalculation could be split up amongst multiple cores if needed.

Also, if openmp is used within integ [as others have suggested], the full precalculation would probably be necessary to prevent interthread write contention.

Edit: I added this version as an update at the bottom

Here is the code. Please pardon the C-like stuff. This is also a prototype level.

#include <iostream>
#include <cmath>
#include <stdio.h>
#include <time.h>
double
tvgetf(void)
{
 struct timespec ts;
 double sec;
 clock_gettime(CLOCK_REALTIME,&ts);
 sec = ts.tv_nsec;
 sec /= 1e9;
 sec += ts.tv_sec;
 return sec;
}
#define inline_always static inline __attribute__((__always_inline__))
// NOTE: the size of pdf_cache is _hardwired_. The max size _can_ be calculated
// and this can become a pointer to an allocated area
int pdf_max;
float pdf_cache[10000000];
inline_always float
pdf(float u)
{
 //u += 1;
 return (1 / (u * u));
}
inline_always float
cdf(float u)
{
 u += 1;
 return (1 - 1 / (u));
}
inline_always float
pdfx(float u,int i)
{
 //u += 1;
 if (i > pdf_max) {
 pdf_cache[i] = pdf(u);
 pdf_max = i;
 }
 return pdf_cache[i];
}
// The main function that implements the numerical integration,
//and it is a recursive function
float
integ(float h, int k, float du)
{
 float res = 0;
 float u = 1;
 int i;
 int iter = h / du;
 k -= 1;
 h += 1;
 if (k == 1) {
 for (i = 0; i < iter; ++i) {
 res += cdf(h - u) * pdfx(u,i) * du;
 u += du;
 }
 }
 else {
 for (i = 0; i < iter; ++i) {
 res += integ(h - u, k, du) * pdfx(u,i) * du;
 u += du;
 }
 }
 return res;
}
// The main function that implements the numerical integration,
//and it is a recursive function
float
integ_top(float h, int k, float du)
{
 float res = 0;
 pdf_max = -1;
 if (k == 1)
 res = cdf(h);
 else
 res = integ(h,k,du);
 return res;
}
int
main()
{
 float du = 0.0001;
 int K = 3;
 float gamma[4] = { 0.31622777, 0.79432823,
 1.99526231, 5.01187234
 };
 int G = 50;
 int Q = 2;
 for (int i = 0; i < 4; i++) {
 if ((G - Q * (K - 1)) > 0) {
 float gammath = (gamma[i] / Q) * (G - Q * (K - 1));
 double tvbeg = tvgetf();
 double rtn = 1 - integ_top(gammath, K, du);
 //std::cout << 1 - integ(gammath, K, du) << endl;
 double tvdif = tvgetf() - tvbeg;
 printf("i=%d rtn=%f tvdif=%.9f\n",i,rtn,tvdif);
 fflush(stdout);
 }
 }
 return 0;
}

This technique may have some roundoff error as I ran the original and got:

i=0 rtn=0.418665 tvdif=27.172003746
i=1 rtn=0.183092 tvdif=168.498732328

The cached version output was:

i=0 rtn=0.418630 tvdif=13.691759109
i=1 rtn=0.183040 tvdif=85.953905582
i=2 rtn=0.070858 tvdif=526.217260361

This might be alleviated by using double for the pdf_cache array

UPDATE:

Here is a version that precalculates the pdf_cache values (again, please pardon the C-like stuff):

#include <iostream>
#include <cmath>
#include <stdio.h>
#include <time.h>
double
tvgetf(void)
{
 struct timespec ts;
 double sec;
 clock_gettime(CLOCK_REALTIME,&ts);
 sec = ts.tv_nsec;
 sec /= 1e9;
 sec += ts.tv_sec;
 return sec;
}
#define inline_always static inline __attribute__((__always_inline__))
// NOTE: the size of pdf_cache is _hardwired_. The max size _can_ be calculated
// and this can become a pointer to an allocated area
int pdf_max;
float *pdf_cache;
inline_always float
pdf(float u)
{
 //u += 1;
 return (1 / (u * u));
}
inline_always float
cdf(float u)
{
 u += 1;
 return (1 - 1 / (u));
}
inline_always float
pdfx(float u,int i)
{
 //u += 1;
 if (i > pdf_max) {
 pdf_cache[i] = pdf(u);
 pdf_max = i;
 }
 return pdf_cache[i];
}
// The main function that implements the numerical integration,
//and it is a recursive function
float
integ(float h, int k, float du)
{
 float res = 0;
 float u = 1;
 int i;
 int iter = h / du;
 k -= 1;
 h += 1;
 if (k == 1) {
 for (i = 0; i < iter; ++i) {
 res += cdf(h - u) * pdf_cache[i] * du;
 u += du;
 }
 }
 else {
 for (i = 0; i < iter; ++i) {
 res += integ(h - u, k, du) * pdf_cache[i] * du;
 u += du;
 }
 }
 return res;
}
// The main function that implements the numerical integration,
//and it is a recursive function
float
integ_top(float h, int k, float du)
{
 float res = 0;
 if (k == 1)
 res = cdf(h);
 else
 res = integ(h,k,du);
 return res;
}
int
main()
{
 float du = 0.0001;
 int K = 3;
 float gamma[4] = { 0.31622777, 0.79432823,
 1.99526231, 5.01187234
 };
 int G = 50;
 int Q = 2;
 float maxgam = 0;
 int initflg = 1;
 for (int i = 0; i < 4; i++) {
 if ((G - Q * (K - 1)) > 0) {
 float gammath = (gamma[i] / Q) * (G - Q * (K - 1));
 printf("i=%d gammath=%f\n",i,gammath);
 if (initflg) {
 maxgam = gammath;
 initflg = 0;
 continue;
 }
 if (gammath > maxgam)
 maxgam = gammath;
 }
 }
 pdf_max = maxgam / du;
 printf("pdf_max=%d\n",pdf_max);
 pdf_cache = (float *) malloc(sizeof(*pdf_cache) * pdf_max);
 float u = 1;
 for (int i = 0; i < pdf_max; ++i) {
 pdf_cache[i] = pdf(u);
 u += du;
 }
 for (int i = 0; i < 4; i++) {
 if ((G - Q * (K - 1)) > 0) {
 float gammath = (gamma[i] / Q) * (G - Q * (K - 1));
 double tvbeg = tvgetf();
 double rtn = 1 - integ_top(gammath, K, du);
 //std::cout << 1 - integ(gammath, K, du) << endl;
 double tvdif = tvgetf() - tvbeg;
 printf("i=%d rtn=%f tvdif=%.9f\n",i,rtn,tvdif);
 fflush(stdout);
 }
 }
 return 0;
}

Question 20

Interesting observations. I am not familiar with C, but do you have a rough total speed up factor compared to the original? du=0.01 should run fast, but will give some insights.

Question 21

This is compiled in c++ [a superset of c] and at a lower level they overlap and are compatible [arguably]. I just added the malloc/printf [which works in c++ but is more idiomatic in c] (e.g. it's not the "pythonic" way but it works :-). Otherwise, all the other code here is interchangeable. And, the bulk of your code, integ, cdf, pdf are c/c++ agnostic. So, as above, the final benchmark is 2.75x faster than your original [where all I did was add benchmarking code] on a single core. My final version can be combined with the multicore stuff of others to get the best of both

Question 22

For the largest gammath, the pdf_cache max size is about 1M elements. At 4 bytes / entry, this about 4MB. This fits within the cache memory of most x86 CPUs, so most access to it will come from the faster cache memory rather than the slower DRAM. On each loop, the pdf_cache element fetch is faster than recalc of pdf (i.e. it eliminates a floating point add, multiply, and divide on each iteration).

Edward Edward 67.2k4 gold badges120 silver badges284 bronze badges · Answer 1 · 2018-07-25 20:38:43Z

I see a few things that may help you improve your program.

Avoid `pow` for `float`

The use of pow converts the argument to a double and returns a double. If you're casting the result to a float anyway, the use of std::powf would be more appropriate. But see the next suggestion for an even better approach.

Prefer multiplication to `pow` with small, fixed integral powers

The content of the pdf function could be rewritten like this instead:

 return 1.0f/((u+1)*(u+1));

Usealittlemorewhitespace

With everything crowded together on the line, it's hard to read and understand some lines like this:

float gammath = (gamma[i]/Q)*(G-Q*(K-1));

I'd write it like this instead:

float gammath{gamma[i] / Q) * (G - Q * (K - 1)};

This also uses the C++11 uniform initialization syntax, which I prefer.

Use `constexpr` where appropriate

Any time you can make a numerical constant or expression constexpr is one more opportunity for compiler optimization. In this case, virtually everything in main, including du, K, gamma, G, and Q can be made constexpr simply by adding that keyword in front of the declaration. With that, the compiler is free to evaluate expressions like this:

G - Q * (K - 1)

once at compile-time rather than at runtime. It can save quite a bit of time if you use it.

Move invariants out of the loop

The value of the expression listed above does not change within the loop (or within the program at all, as presented). Smart compilers can usually optimize that by themselves, but doing it manually may give you insight. For example, we can make a constant of that expression outside the loop:

constexpr auto m{G - Q * (K - 1)};

Now the inner if boils down to if (m > 0) but since it's invariant within the loop, move it outside. Now we have the encapsulating if outside and the for loop inside which makes a lot more sense.

Use C++11 threading features

The calculations for each value of gamma are completely independent. For that reason, they could easily be calculated in parallel. Here's a way to do that:

int main() {
 constexpr float du = 0.0001;
 constexpr int K = 3;
 constexpr float gamma[4] = {0.31622777, 0.79432823, 1.99526231, 5.01187234};
 constexpr int G = 50;
 constexpr int Q = 2;
 constexpr auto m{G - Q * (K - 1)};
 std::vector<std::future<float>> chunks;
 if (m > 0) {
 for (int i = 0; i < 4; i++) {
 float gammath{gamma[i] / Q * m};
 chunks.push_back(std::async(integ, gammath, K, du));
 }
 }
 std::mutex cout_mutex;
 for (int i=0; i < 4; ++i) {
 std::lock_guard<std::mutex> lock(cout_mutex);
 std::cout << gamma[i] << '\t' << 1 - chunks[i].get() << '\n';
 }
}

This requires <future>, <mutex>, and <vector> as #includes.

Parallelize the recursive integrals

If there are 100,000 steps in the integral, why do that with a single core? If you have a multicore machine like I do, you could divide that work among multiple cores by using the same kind of std::async technique shown above. This would dramatically speed the results (and can be divided among as many cores as you have). I did not implement that, but the code above should give a hint as to how to do such a thing generically.

Use `static` for functions which can have file scope

If the pdf, cdf and integ functions are never used outside the file in which they're defined, the compiler can get quite clever about inlining code and other optimizations. For that reason, I'd recommend making those functions static if at all possible.

Results

I didn't have the patience to run this at the du value in the program (0.0001) so I ran it at 50 times that at 0.005 to speed up testing. Obviously, that's an option if you can accept the lower resolution, but I just did it to speed up timing tests.

The original (with the modified value of du) took 1.06 seconds, while the modified version took 0.723s. At full resolution (du = 0.0001) the results took 30 minutes and 3 seconds on the same 64-bit 8-core Linux machine.

The lock_guard is used to prevent the results from being intermixed as they are printed. With two gamma values very numerically close, they could finish at nearly the same time and without the lock_guard the output might be interleaved.
std::pow is overloaded for float, so won't promote argument or result to double. It's still a poor choice when the exponent is a small integer, of course.
@BlackMath: you'll need to have compiler support for at least C++11 (that is, the 2011 version of the standard and libraries) and also to have a line that says #include <future>. On Linux (where I'm testing) one also needs to link with the pthread library.
@Edward I have C++14 compiler MinGW, I think. I included all you have mentioned: <future>, <mutex>, and <vector>. OK, now I saw you did the full resolution. 30 minutes is actually very good. My machine is i5 with 4 cores (I suppose), so, I think it would be slower. I will need to have the code works first to test it.

Toby Speight Toby Speight 87.3k14 gold badges104 silver badges322 bronze badges · Answer 2 · 2018-07-25 20:42:09Z

Improve robustness and readability

Don't using namespace with namespaces not designed for it.
Use const and constexpr to make your intent clear and prevent silly accidents.
Make your internal functions static.
Use more whitespace around operators.

Improve performance

I started with your code as a baseline, but increased du tenfold (to .001) to be able to test in a reasonable amount of time. I compiled with g++ -std=c++17 -O3 -march=native to allow the optimiser to use its full range of substitutions and to use the full capabilities of the machine.

std::pow() normally works by multiplying logarithms. For a power of 2, it's much simpler to simply multiply a number by itself. I got about 10% speed improvement by re-writing pdf() thus:

static float pdf(float u)
{
 return 1 / ((1+u)*(1+u));
}

Note that the seemingly equivalent return 1 / (1+u) / (1+u) provided exactly zero improvement, because the cost of division is so high. (Which obviously suggests std::pow(1+u, -2) as an alternative, but I recommend the version I shown above, as it's more accurate).

Are you sure you want to use float? If double is slower on your platform, it may still be a worthwhile trade-off if it allows you to use a larger value of du.

You can parallelize the loop in main(), like this (add -fopenmp to your g++ command):

 float result[4] = {};
 constexpr auto gqk = G - Q*(K-1);
#pragma omp parallel for
 for (int i = 0; i < 4; ++i) {
 if (gqk > 0) {
 result[i] = 1-integ(gamma[i]*gqk/Q, K, du);
 }
 }
 for (auto r: result) {
 std::cout << r << std::endl;
 }

That saves me another 25% or so in wall-clock time (at the expense of using more CPU cores).

However, you'll get a much better improvement by parallelizing the loop in integ instead:

static float integ(float h, int k, float du)
{
 if (k == 1) {
 return cdf(h);
 }
 float res = 0;
 const int iterations = h / du;
#pragma omp parallel for reduction(+:res)
 for (int i = 0; i < iterations; ++i) {
 const float u = i * du;
 res += integ(h - u, k - 1, du) * pdf(u) * du;
 }
 return res;
}

(We need the reduction clause so that all the threads can contribute to res without needing to synchronize for that step - OpenMP applies the reduction after the threads have each produced their own local sum).

That gained about 80% improvement over the original, on this 8-core machine. (9 seconds elapsed with du = .001, and 15m 6s with du = .0001. You're still going to have severe scaling problems when K goes to 10 - you'll probably need to also apply some mathematical insight to reduce the complexity at this point).

Thanks for your answer. I am not a professional programmer, so, thanks for clarifying how to make my code clearer. I am learning. 15m6s for du=0.0001 is fantastic. What do I need to use OpenMP? I use MinGW compiler, and it seems it doesn't support multi-threading because it is cross platform or something. Also I have to say I am using it on Windows not Linux.
I'm using GCC on Linux, and for that I just needed to add -fopenmp to my command line, as shown. I don't know how MinGW differs, so I'm not qualified to help there, I'm afraid (other than to recommend that you check you have a recent version, but I expect you've already thought of that!). I don't know whether you could write an on-topic Stack Overflow question (with a minimal C++&OpenMP program and a "How do I make MinGW compile this to parallel code?"). You'd have to be careful with the wording there.
Would using res += (((k - 1) == 1) ? cdf(h) : integ(h - u, k - 1, du)) * pdf(u) * du; add anything? This saves the extra call to integ when it's just going to use cdf. If K is always guaranteed to be greater than 1, the if (k == 1) at the top could be eliminated as well.
You're probably right @Craig - I have to leave now, but perhaps you could try your suggestion and post it as another answer if it turns out to be helpful?
@Toby I actually has Ubuntu as a dual boot on my laptop. I think gcc is already installed and part of the Linux OS, right? @Graig k is always greater than or equal to 1.

Craig Estey Craig Estey 8386 silver badges7 bronze badges · Answer 3 · 2018-07-27 02:26:50Z

I got some speedup by replacing pow(u,2) with u * u. Time for i == 0 went from 33 sec to 26 sec.

It can be noted that pdf(u) is h invariant [unlike cdf(h - u)].

So, at each integral level pdf(u) can be cached and only calculated once. This is akin to memoization in dynamic programming.

By doing this, I reduced the runtime to 12 sec. That is, it is 2.75x faster than the original

Note that on the first call to pdf(u), u will be 0. It follows the sequence: 0, du, 2*du, ..., iter*du where iter is the iteration index of the loop.

This is true regardless of what level integral we're calculating. The pdf sequence will always be the same.

So, we can keep track of the largest iter index we've seen. If the current index is greater than the largest, we do: pdf_cache[iter] = pdf(u). Otherwise, we use pdf_cache[iter].

See the code below for full details.

As a possible alternative, the pdf_cache array could be fully precalculated once before starting anything. Then, all usage of pdf(u) could be replaced with pdf_cache[i].

This could be done in main before its for loop as pdf_cache will be the same no matter what main's values for i or gammath (aka h) are.

Further, this precalculation could be split up amongst multiple cores if needed.

Also, if openmp is used within integ [as others have suggested], the full precalculation would probably be necessary to prevent interthread write contention.

Edit: I added this version as an update at the bottom

Here is the code. Please pardon the C-like stuff. This is also a prototype level.

#include <iostream>
#include <cmath>
#include <stdio.h>
#include <time.h>
double
tvgetf(void)
{
 struct timespec ts;
 double sec;
 clock_gettime(CLOCK_REALTIME,&ts);
 sec = ts.tv_nsec;
 sec /= 1e9;
 sec += ts.tv_sec;
 return sec;
}
#define inline_always static inline __attribute__((__always_inline__))
// NOTE: the size of pdf_cache is _hardwired_. The max size _can_ be calculated
// and this can become a pointer to an allocated area
int pdf_max;
float pdf_cache[10000000];
inline_always float
pdf(float u)
{
 //u += 1;
 return (1 / (u * u));
}
inline_always float
cdf(float u)
{
 u += 1;
 return (1 - 1 / (u));
}
inline_always float
pdfx(float u,int i)
{
 //u += 1;
 if (i > pdf_max) {
 pdf_cache[i] = pdf(u);
 pdf_max = i;
 }
 return pdf_cache[i];
}
// The main function that implements the numerical integration,
//and it is a recursive function
float
integ(float h, int k, float du)
{
 float res = 0;
 float u = 1;
 int i;
 int iter = h / du;
 k -= 1;
 h += 1;
 if (k == 1) {
 for (i = 0; i < iter; ++i) {
 res += cdf(h - u) * pdfx(u,i) * du;
 u += du;
 }
 }
 else {
 for (i = 0; i < iter; ++i) {
 res += integ(h - u, k, du) * pdfx(u,i) * du;
 u += du;
 }
 }
 return res;
}
// The main function that implements the numerical integration,
//and it is a recursive function
float
integ_top(float h, int k, float du)
{
 float res = 0;
 pdf_max = -1;
 if (k == 1)
 res = cdf(h);
 else
 res = integ(h,k,du);
 return res;
}
int
main()
{
 float du = 0.0001;
 int K = 3;
 float gamma[4] = { 0.31622777, 0.79432823,
 1.99526231, 5.01187234
 };
 int G = 50;
 int Q = 2;
 for (int i = 0; i < 4; i++) {
 if ((G - Q * (K - 1)) > 0) {
 float gammath = (gamma[i] / Q) * (G - Q * (K - 1));
 double tvbeg = tvgetf();
 double rtn = 1 - integ_top(gammath, K, du);
 //std::cout << 1 - integ(gammath, K, du) << endl;
 double tvdif = tvgetf() - tvbeg;
 printf("i=%d rtn=%f tvdif=%.9f\n",i,rtn,tvdif);
 fflush(stdout);
 }
 }
 return 0;
}

This technique may have some roundoff error as I ran the original and got:

i=0 rtn=0.418665 tvdif=27.172003746
i=1 rtn=0.183092 tvdif=168.498732328

The cached version output was:

i=0 rtn=0.418630 tvdif=13.691759109
i=1 rtn=0.183040 tvdif=85.953905582
i=2 rtn=0.070858 tvdif=526.217260361

This might be alleviated by using double for the pdf_cache array

UPDATE:

Here is a version that precalculates the pdf_cache values (again, please pardon the C-like stuff):

#include <iostream>
#include <cmath>
#include <stdio.h>
#include <time.h>
double
tvgetf(void)
{
 struct timespec ts;
 double sec;
 clock_gettime(CLOCK_REALTIME,&ts);
 sec = ts.tv_nsec;
 sec /= 1e9;
 sec += ts.tv_sec;
 return sec;
}
#define inline_always static inline __attribute__((__always_inline__))
// NOTE: the size of pdf_cache is _hardwired_. The max size _can_ be calculated
// and this can become a pointer to an allocated area
int pdf_max;
float *pdf_cache;
inline_always float
pdf(float u)
{
 //u += 1;
 return (1 / (u * u));
}
inline_always float
cdf(float u)
{
 u += 1;
 return (1 - 1 / (u));
}
inline_always float
pdfx(float u,int i)
{
 //u += 1;
 if (i > pdf_max) {
 pdf_cache[i] = pdf(u);
 pdf_max = i;
 }
 return pdf_cache[i];
}
// The main function that implements the numerical integration,
//and it is a recursive function
float
integ(float h, int k, float du)
{
 float res = 0;
 float u = 1;
 int i;
 int iter = h / du;
 k -= 1;
 h += 1;
 if (k == 1) {
 for (i = 0; i < iter; ++i) {
 res += cdf(h - u) * pdf_cache[i] * du;
 u += du;
 }
 }
 else {
 for (i = 0; i < iter; ++i) {
 res += integ(h - u, k, du) * pdf_cache[i] * du;
 u += du;
 }
 }
 return res;
}
// The main function that implements the numerical integration,
//and it is a recursive function
float
integ_top(float h, int k, float du)
{
 float res = 0;
 if (k == 1)
 res = cdf(h);
 else
 res = integ(h,k,du);
 return res;
}
int
main()
{
 float du = 0.0001;
 int K = 3;
 float gamma[4] = { 0.31622777, 0.79432823,
 1.99526231, 5.01187234
 };
 int G = 50;
 int Q = 2;
 float maxgam = 0;
 int initflg = 1;
 for (int i = 0; i < 4; i++) {
 if ((G - Q * (K - 1)) > 0) {
 float gammath = (gamma[i] / Q) * (G - Q * (K - 1));
 printf("i=%d gammath=%f\n",i,gammath);
 if (initflg) {
 maxgam = gammath;
 initflg = 0;
 continue;
 }
 if (gammath > maxgam)
 maxgam = gammath;
 }
 }
 pdf_max = maxgam / du;
 printf("pdf_max=%d\n",pdf_max);
 pdf_cache = (float *) malloc(sizeof(*pdf_cache) * pdf_max);
 float u = 1;
 for (int i = 0; i < pdf_max; ++i) {
 pdf_cache[i] = pdf(u);
 u += du;
 }
 for (int i = 0; i < 4; i++) {
 if ((G - Q * (K - 1)) > 0) {
 float gammath = (gamma[i] / Q) * (G - Q * (K - 1));
 double tvbeg = tvgetf();
 double rtn = 1 - integ_top(gammath, K, du);
 //std::cout << 1 - integ(gammath, K, du) << endl;
 double tvdif = tvgetf() - tvbeg;
 printf("i=%d rtn=%f tvdif=%.9f\n",i,rtn,tvdif);
 fflush(stdout);
 }
 }
 return 0;
}

Interesting observations. I am not familiar with C, but do you have a rough total speed up factor compared to the original? du=0.01 should run fast, but will give some insights.
This is compiled in c++ [a superset of c] and at a lower level they overlap and are compatible [arguably]. I just added the malloc/printf [which works in c++ but is more idiomatic in c] (e.g. it's not the "pythonic" way but it works :-). Otherwise, all the other code here is interchangeable. And, the bulk of your code, integ, cdf, pdf are c/c++ agnostic. So, as above, the final benchmark is 2.75x faster than your original [where all I did was add benchmarking code] on a single core. My final version can be combined with the multicore stuff of others to get the best of both
For the largest gammath, the pdf_cache max size is about 1M elements. At 4 bytes / entry, this about 4MB. This fits within the cache memory of most x86 CPUs, so most access to it will come from the faster cache memory rather than the slower DRAM. On each loop, the pdf_cache element fetch is faster than recalc of pdf (i.e. it eliminates a floating point add, multiply, and divide on each iteration).

Stack Exchange Network

Implementing numerical integration

3 Answers 3

Avoid `pow` for `float`

Prefer multiplication to `pow` with small, fixed integral powers

Usealittlemorewhitespace

Use `constexpr` where appropriate

Move invariants out of the loop

Use C++11 threading features

Parallelize the recursive integrals

Use `static` for functions which can have file scope

Results

Improve robustness and readability

Improve performance

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Implementing numerical integration

3 Answers 3

Avoid pow for float

Prefer multiplication to pow with small, fixed integral powers

Usealittlemorewhitespace

Use constexpr where appropriate

Move invariants out of the loop

Use C++11 threading features

Parallelize the recursive integrals

Use static for functions which can have file scope

Results

Improve robustness and readability

Improve performance

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions

Avoid `pow` for `float`

Prefer multiplication to `pow` with small, fixed integral powers

Use `constexpr` where appropriate

Use `static` for functions which can have file scope