Increasing performance of OpenMP based advection equation solver for Xeon Phi

Question 1

I am solving linear advection equation in order to learn parallel programming. I am running this code on Xeon Phi co processor. Below is a plot of how it scales up with increasing number of threads. I am still in learning process and this is my first "parallel" code. I am also not very familiar with architecture specific optimization.

Comments on how to improve the performance of this code and what other stuff should I look into will be great. Also link for writeup or further tutorials to follow will be really helpful.

The plot is for nx=100000.

enter image description here

#include <stdlib.h>
#include <stdio.h>
#include <omp.h>
#include <sys/time.h>
#include <unistd.h>
int64_t TimeInMicros() {
 struct timeval tv;
 gettimeofday(&tv, NULL);
 return tv.tv_sec*1000000 + tv.tv_usec;
}
struct Grid{
 int nx;
 double *u, *f, *res;
 double a, cfl, dx, tf;
};
struct ThreadData{
 int tid, maxthreads;
 struct Grid *grid;
};
void *solver(void *args){
 struct ThreadData *td = (struct ThreadData *)args;
 int tid = td->tid;
 struct Grid *grid = td->grid;
 double *u = grid->u;
 double *f = grid->f;
 double *res = grid->res;
 int nx = grid->nx;
 double tf = grid->tf;
 double cfl = grid->cfl;
 double a = grid->a;
 double dx = grid->dx;
 double dt = cfl*dx/a;
 double t = 0.0;
 int chunk = nx/(td->maxthreads);
 int start = tid*chunk;
 int rc;
 while(t < tf){
 // sync here
#pragma omp barrier
 if(start == 0){
 f[start+1:chunk] = a*u[start:chunk];
 }
 else{
 f[start:chunk+1] = a*u[start-1:chunk+1];
 }
 // sync here
#pragma omp barrier
 if(start == 0){
 f[start] = f[nx-1];
 }
 res[start:chunk] = -(f[start+1:chunk] - f[start:chunk])/dx;
 u[start:chunk] += res[start:chunk]*dt;
 t+=dt;
 }
 return NULL;
}
int main(int argc, char*argv[]){
 int nx=atoi(argv[1]);
 int nthreads=atoi(argv[2]);
 if(nx%nthreads != 0){
 printf("ERROR: Number of cells should be integral multiple of number of threads \n");
 exit(1);
 }
 double *u = new double[nx]();
 double *res = new double[nx]();
 double *f = new double[nx+1]();
 double dx = 1.0/nx;
 double cfl = 0.9;
 double a = 1.0;
 double tf = 1.0;
 int i;
 // initialize
 u[0:nx] = 0.0;
 u[nx/4:nx/2] = 1.0;
 f[0:nx+1] = 0.0;
 res[0:nx] = 0.0;
 struct Grid grid;
 grid.nx = nx;
 grid.a = a;
 grid.cfl = cfl;
 grid.dx = dx;
 grid.u = u;
 grid.res = res;
 grid.f = f;
 grid.tf = tf;
 int64_t t1 = TimeInMicros();
 #pragma omp parallel num_threads(nthreads)
 {
 struct ThreadData td;
 td.tid = omp_get_thread_num();
 td.grid = &grid;
 td.maxthreads = omp_get_num_threads();
 if(nthreads != omp_get_num_threads()) {
 if(td.tid == 0) {
 printf("ERROR: Number of cells should be integral multiple of number of threads \n");
 exit(-1);
 }
 }
 else {
 solver(&td);
 }
 }
 int64_t t2 = TimeInMicros();
 printf("Execution time: %.10f\n", (t2-t1)*1e-6);
 FILE * outfile;
 outfile = fopen("results.txt", "w+");
 for(i = 0; i < nx; i++){
 fprintf(outfile, "%.10f %.10f\n", i*dx, grid.u[i]);
 }
 fclose(outfile);
 delete[] u;
 delete[] res;
 delete[] f;
}

Question 2

You should be more explicit about affinity, and plot the performance by number of cores with separate lines for 1t/c, 2t/c, 3t/c and 4t/c. You will get extra performance from adding a whole core than from adding another HW thread on a core you're already using, so showing "2 threads" could have two different performance values (2cx1t or 1cx2t).

By using KMP_PLACE_THREADS, you can easily choose the appropriate configurations (and don't also set OMP_NUM_THREADS!)

Then also play with KMP_AFFINITY=compact and KMP_AFFINITY=scatter.

Question 3

You should cite a source or give a reason in your answer as to why you should not use OMP_NUM_THREADS. Do that and you will have my upvote.

Question 4

Hey Jim, I tried various affinities. compact gives the worst performance. And I think that is also the reason behind poor performance when number of threads exceed number of processors. Could you advice on how should I modify this code to make it more efficient for compactness?

Question 5

You should first check for the number of command line arguments via argc. If there are too few, print an error message and then terminate the program.
```
if (argc < 3) {
 // print an error message...
 return EXIT_FAILURE;
}
```
In addition, your first existing check in main() should also return EXIT_FAILURE, instead of calling exit(1). The latter is non-portable, unlike the former, and it's already safe to return from main() if the program cannot continue execution. More info about that here.

All of your erroneous outputs shouldn't be printed to printf():

printf("ERROR: Number of cells should be integral multiple of number of threads \n");

They should instead be printed to fprintf() and stderr:

fprintf(stderr, "Number of cells should be integral multiple of number of threads \n");

This comment is misleading:
```
// initialize
u[0:nx] = 0.0;
u[nx/4:nx/2] = 1.0;
f[0:nx+1] = 0.0;
res[0:nx] = 0.0;
```
They're not actually being initialized; they're being assigned after declaration. Initialization involves giving a variable its type and its initial value. You should do that here, which will keep variables as close in scope as possible, which is generally preferred for maintenance concerns.

Either way, you don't really need that comment. It's already clear what's being done (even after making this correction), and comments shouldn't state the obvious.
Also regarding closest possible variable scope: you have i declared towards the top of main(), but you don't use it until the for loop towards the end. If you don't have C99 (and thus cannot initialize i within the loop statement instead), declare i right before the loop.
This line doesn't look right:
```
td.maxthreads = omp_get_num_threads();
```
It appears that you're looking for omp_get_max_threads():
```
td.maxthreads = omp_get_max_threads();
```
This will return the maximum number of available threads for a parallel environment. If this is not really your intent, then rename maxthreads to something more accurate.

Jim Cownie Jim Cownie 561 bronze badge · Answer 1 · 2014-06-19 16:15:40Z

You should be more explicit about affinity, and plot the performance by number of cores with separate lines for 1t/c, 2t/c, 3t/c and 4t/c. You will get extra performance from adding a whole core than from adding another HW thread on a core you're already using, so showing "2 threads" could have two different performance values (2cx1t or 1cx2t).

By using KMP_PLACE_THREADS, you can easily choose the appropriate configurations (and don't also set OMP_NUM_THREADS!)

Then also play with KMP_AFFINITY=compact and KMP_AFFINITY=scatter.

You should cite a source or give a reason in your answer as to why you should not use OMP_NUM_THREADS. Do that and you will have my upvote.
Hey Jim, I tried various affinities. compact gives the worst performance. And I think that is also the reason behind poor performance when number of threads exceed number of processors. Could you advice on how should I modify this code to make it more efficient for compactness?

Jamal Jamal 35.2k13 gold badges134 silver badges238 bronze badges · Answer 2 · 2014-06-18 23:30:41Z

You should first check for the number of command line arguments via argc. If there are too few, print an error message and then terminate the program.
```
if (argc < 3) {
 // print an error message...
 return EXIT_FAILURE;
}
```
In addition, your first existing check in main() should also return EXIT_FAILURE, instead of calling exit(1). The latter is non-portable, unlike the former, and it's already safe to return from main() if the program cannot continue execution. More info about that here.

All of your erroneous outputs shouldn't be printed to printf():

printf("ERROR: Number of cells should be integral multiple of number of threads \n");

They should instead be printed to fprintf() and stderr:

fprintf(stderr, "Number of cells should be integral multiple of number of threads \n");

This comment is misleading:
```
// initialize
u[0:nx] = 0.0;
u[nx/4:nx/2] = 1.0;
f[0:nx+1] = 0.0;
res[0:nx] = 0.0;
```
They're not actually being initialized; they're being assigned after declaration. Initialization involves giving a variable its type and its initial value. You should do that here, which will keep variables as close in scope as possible, which is generally preferred for maintenance concerns.

Either way, you don't really need that comment. It's already clear what's being done (even after making this correction), and comments shouldn't state the obvious.
Also regarding closest possible variable scope: you have i declared towards the top of main(), but you don't use it until the for loop towards the end. If you don't have C99 (and thus cannot initialize i within the loop statement instead), declare i right before the loop.
This line doesn't look right:
```
td.maxthreads = omp_get_num_threads();
```
It appears that you're looking for omp_get_max_threads():
```
td.maxthreads = omp_get_max_threads();
```
This will return the maximum number of available threads for a parallel environment. If this is not really your intent, then rename maxthreads to something more accurate.

Stack Exchange Network

Increasing performance of OpenMP based advection equation solver for Xeon Phi

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Increasing performance of OpenMP based advection equation solver for Xeon Phi

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions