I am solving linear advection equation in order to learn parallel programming. I am running this code on Xeon Phi co processor. Below is a plot of how it scales up with increasing number of threads. I am still in learning process and this is my first "parallel" code. I am also not very familiar with architecture specific optimization.
Comments on how to improve the performance of this code and what other stuff should I look into will be great. Also link for writeup or further tutorials to follow will be really helpful.
The plot is for nx=100000
.
enter image description here
#include <stdlib.h>
#include <stdio.h>
#include <omp.h>
#include <sys/time.h>
#include <unistd.h>
int64_t TimeInMicros() {
struct timeval tv;
gettimeofday(&tv, NULL);
return tv.tv_sec*1000000 + tv.tv_usec;
}
struct Grid{
int nx;
double *u, *f, *res;
double a, cfl, dx, tf;
};
struct ThreadData{
int tid, maxthreads;
struct Grid *grid;
};
void *solver(void *args){
struct ThreadData *td = (struct ThreadData *)args;
int tid = td->tid;
struct Grid *grid = td->grid;
double *u = grid->u;
double *f = grid->f;
double *res = grid->res;
int nx = grid->nx;
double tf = grid->tf;
double cfl = grid->cfl;
double a = grid->a;
double dx = grid->dx;
double dt = cfl*dx/a;
double t = 0.0;
int chunk = nx/(td->maxthreads);
int start = tid*chunk;
int rc;
while(t < tf){
// sync here
#pragma omp barrier
if(start == 0){
f[start+1:chunk] = a*u[start:chunk];
}
else{
f[start:chunk+1] = a*u[start-1:chunk+1];
}
// sync here
#pragma omp barrier
if(start == 0){
f[start] = f[nx-1];
}
res[start:chunk] = -(f[start+1:chunk] - f[start:chunk])/dx;
u[start:chunk] += res[start:chunk]*dt;
t+=dt;
}
return NULL;
}
int main(int argc, char*argv[]){
int nx=atoi(argv[1]);
int nthreads=atoi(argv[2]);
if(nx%nthreads != 0){
printf("ERROR: Number of cells should be integral multiple of number of threads \n");
exit(1);
}
double *u = new double[nx]();
double *res = new double[nx]();
double *f = new double[nx+1]();
double dx = 1.0/nx;
double cfl = 0.9;
double a = 1.0;
double tf = 1.0;
int i;
// initialize
u[0:nx] = 0.0;
u[nx/4:nx/2] = 1.0;
f[0:nx+1] = 0.0;
res[0:nx] = 0.0;
struct Grid grid;
grid.nx = nx;
grid.a = a;
grid.cfl = cfl;
grid.dx = dx;
grid.u = u;
grid.res = res;
grid.f = f;
grid.tf = tf;
int64_t t1 = TimeInMicros();
#pragma omp parallel num_threads(nthreads)
{
struct ThreadData td;
td.tid = omp_get_thread_num();
td.grid = &grid;
td.maxthreads = omp_get_num_threads();
if(nthreads != omp_get_num_threads()) {
if(td.tid == 0) {
printf("ERROR: Number of cells should be integral multiple of number of threads \n");
exit(-1);
}
}
else {
solver(&td);
}
}
int64_t t2 = TimeInMicros();
printf("Execution time: %.10f\n", (t2-t1)*1e-6);
FILE * outfile;
outfile = fopen("results.txt", "w+");
for(i = 0; i < nx; i++){
fprintf(outfile, "%.10f %.10f\n", i*dx, grid.u[i]);
}
fclose(outfile);
delete[] u;
delete[] res;
delete[] f;
}
2 Answers 2
You should be more explicit about affinity, and plot the performance by number of cores with separate lines for 1t
/c
, 2t
/c
, 3t
/c
and 4t
/c
. You will get extra performance from adding a whole core than from adding another HW thread on a core you're already using, so showing "2 threads" could have two different performance values (2cx1t
or 1cx2t
).
By using KMP_PLACE_THREADS
, you can easily choose the appropriate configurations (and don't also set OMP_NUM_THREADS
!)
Then also play with KMP_AFFINITY=compact
and KMP_AFFINITY=scatter
.
-
\$\begingroup\$ You should cite a source or give a reason in your answer as to why you should not use
OMP_NUM_THREADS
. Do that and you will have my upvote. \$\endgroup\$syb0rg– syb0rg2014年06月19日 20:11:57 +00:00Commented Jun 19, 2014 at 20:11 -
\$\begingroup\$ Hey Jim, I tried various affinities. compact gives the worst performance. And I think that is also the reason behind poor performance when number of threads exceed number of processors. Could you advice on how should I modify this code to make it more efficient for compactness? \$\endgroup\$0b1100001– 0b11000012014年06月21日 08:55:24 +00:00Commented Jun 21, 2014 at 8:55
You should first check for the number of command line arguments via
argc
. If there are too few, print an error message and then terminate the program.if (argc < 3) { // print an error message... return EXIT_FAILURE; }
In addition, your first existing check in
main()
should also returnEXIT_FAILURE
, instead of callingexit(1)
. The latter is non-portable, unlike the former, and it's already safe to return frommain()
if the program cannot continue execution. More info about that here.All of your erroneous outputs shouldn't be printed to
printf()
:printf("ERROR: Number of cells should be integral multiple of number of threads \n");
They should instead be printed to
fprintf()
andstderr
:fprintf(stderr, "Number of cells should be integral multiple of number of threads \n");
This comment is misleading:
// initialize u[0:nx] = 0.0; u[nx/4:nx/2] = 1.0; f[0:nx+1] = 0.0; res[0:nx] = 0.0;
They're not actually being initialized; they're being assigned after declaration. Initialization involves giving a variable its type and its initial value. You should do that here, which will keep variables as close in scope as possible, which is generally preferred for maintenance concerns.
Either way, you don't really need that comment. It's already clear what's being done (even after making this correction), and comments shouldn't state the obvious.
Also regarding closest possible variable scope: you have
i
declared towards the top ofmain()
, but you don't use it until thefor
loop towards the end. If you don't have C99 (and thus cannot initializei
within the loop statement instead), declarei
right before the loop.This line doesn't look right:
td.maxthreads = omp_get_num_threads();
It appears that you're looking for
omp_get_max_threads()
:td.maxthreads = omp_get_max_threads();
This will return the maximum number of available threads for a parallel environment. If this is not really your intent, then rename
maxthreads
to something more accurate.