I'm studying MPI and I'm trying to solve the exercise (number 3.5) taken from Pacheco's book Introduction to parallel programming. Essentially the exercise is:
Write an MPI program that computes the matrix vector product
Ax
using a block-column distribution forA
. Look at what the functionMPI_Reduce_scatter
does. Assume that the number of processesp
divides the sizes of the matrix.
Given a m
by n
matrix A
, and p
processes, I have as usual that local_m = m/p
and local_n = n/p
. Then:
- I have
p - 1
rectangular sub-matrices of sizem
bylocal_n
. - Each rank will multiply its
m
bylocal_n
with the vectorx
- I'll end up with
p - 1
vectors, one per process, whose sum gives the desired resulty
.
Here it's written that this MPI function applies a reduction on the vector and then scatters each segment to each process. In the following there's my code. I tested it with different matrices and it seems to work just fine. The vector x
is living on every process.
#include <assert.h>
#include <mpi.h>
#include <stdio.h>
int main(int argc, char *argv[]) {
const int m = 12;
const int n = 12;
double matrix[m][n];
double x[n]; // vector
// Store the actual matrix and vector
for (int i = 0; i < m; i++) {
for (int j = 0; j < n; j++) {
matrix[i][j] = i + j;
}
x[i] = 1. + i;
}
int rank, p;
int local_m, local_n;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &p);
assert(n % p == 0 && m % p == 0);
local_m = m / p;
local_n = n / p;
double local_matrix[m][local_n];
double local_y[m]; // the result on each process
double y[m]; // the vector that will store the correct result after summing
// from each process
int recvcounts[p];
for (int i = 0; i < p; i++) {
recvcounts[i] = m;
}
// Store on each process the local (rectangular matrix) and the local part of
// x
for (int i = 0; i < m; i++) { // get all the rows now
for (int j = 0; j < local_n; j++) { // only local_n columns
local_matrix[i][j] = matrix[i][j + rank * local_n];
}
}
// Compute the local mat-vec product, the result is a vector of length m on
// each process.
for (int i = 0; i < m; i++) {
local_y[i] = 0.;
for (int j = 0; j < local_n; j++) {
local_y[i] += local_matrix[i][j] * x[j + rank * local_n];
}
}
MPI_Reduce_scatter(local_y, y, recvcounts, MPI_DOUBLE, MPI_SUM,
MPI_COMM_WORLD);
if (rank == 0) {
printf("The expected result is \n");
double result[n];
for (int i = 0; i < m; i++) {
result[i] = 0.;
for (int j = 0; j < n; j++) {
result[i] += matrix[i][j] * x[j];
}
printf("result[%d] is %lf \n", i, result[i]);
}
printf("The result obtained is being printed on process %d: \n", rank);
for (int i = 0; i < m; i++)
printf("y[%d] = %lf \n", i, y[i]);
}
MPI_Finalize();
return 0;
}
-
1\$\begingroup\$ Since m and n are exactly the same, why not use only 1 of them? \$\endgroup\$convert– convert2023年01月04日 22:24:18 +00:00Commented Jan 4, 2023 at 22:24
1 Answer 1
Observe that each process only multiplies with a small part of x
, so you should declare x[local_n]
.
Next, think about context. If you are doing a power method or so, your output functions as the input for the next matrix-vector multiplication. So inductively, if you start out using only part of x
as input, you need only part of y
as output, because that will be your next input. (You could also make the argument that one should never store redundant information unless strictly necessary.) In other words: your recvcounts
vector should be filled with local_m
, not the global m
. (This way you are actually doing more of a "reduce-bcast" than a reduce scatter, which you could've done with an allreduce
.)
Aside: you are using static allocation, which goes on the stack. It is better to do dynamic allocation on the heap.
But otherwise, nicely done. Finally, let me remark that Pacheco's book is quite old by now. May I suggest https://theartofhpc.com/pcse.html ?
-
\$\begingroup\$ Thanks for your review. I'd like to be sure to understand how
Reduce_scatter
works in this case. I havep-1
vectors of sizem
whose sum is the solution, one per process. The reduction is applied then entry-wise on each vector, ending up with process0
having the correct result, and then the result is split inp
segments, where the i-th segment has sizelocal_m
and sent to ranki
. Is that correct? \$\endgroup\$bobinthebox– bobinthebox2023年01月05日 15:15:21 +00:00Commented Jan 5, 2023 at 15:15 -
\$\begingroup\$ That book is also in the bibliography of my course (thanks for this). I was referring to Pacheco's book "An Introduction to Parallel Programming", the one of 2011, not the original one. Do you think this one is old too? @VictorEijkhout \$\endgroup\$bobinthebox– bobinthebox2023年01月05日 15:17:17 +00:00Commented Jan 5, 2023 at 15:17
-
\$\begingroup\$ I think 2011 predates both MPI 3 and (of course) 4. But for the basics you can indeed use that book. \$\endgroup\$Victor Eijkhout– Victor Eijkhout2023年01月05日 15:44:09 +00:00Commented Jan 5, 2023 at 15:44
-
\$\begingroup\$ Your interpretation is largely correct. It is indeed a big pointwise reduction, followed by a scatter. However, the intermediate result of the reduction is not stored on process zero or anywhere. The only thing you know is the inputs and the final output. Indeed, for efficiency the intermediate result is probably not assembled anywhere. Btw, you have
p
vectors, notp-1
. \$\endgroup\$Victor Eijkhout– Victor Eijkhout2023年01月05日 15:47:29 +00:00Commented Jan 5, 2023 at 15:47 -
\$\begingroup\$ @bobinthebox For my own information, which course are you taking? Institution, number? \$\endgroup\$Victor Eijkhout– Victor Eijkhout2023年01月07日 14:17:43 +00:00Commented Jan 7, 2023 at 14:17