Matrix-vector multiplication with block-column distribution and MPI_Reduce_scatter

Question 1

I'm studying MPI and I'm trying to solve the exercise (number 3.5) taken from Pacheco's book Introduction to parallel programming. Essentially the exercise is:

Write an MPI program that computes the matrix vector product Ax using a block-column distribution for A. Look at what the function MPI_Reduce_scatter does. Assume that the number of processes p divides the sizes of the matrix.

Given a m by n matrix A, and p processes, I have as usual that local_m = m/pand local_n = n/p. Then:

I have p - 1 rectangular sub-matrices of size m by local_n.
Each rank will multiply its m by local_n with the vector x
I'll end up with p - 1 vectors, one per process, whose sum gives the desired result y.

Here it's written that this MPI function applies a reduction on the vector and then scatters each segment to each process. In the following there's my code. I tested it with different matrices and it seems to work just fine. The vector x is living on every process.

#include <assert.h>
#include <mpi.h>
#include <stdio.h>
int main(int argc, char *argv[]) {
 const int m = 12;
 const int n = 12;
 double matrix[m][n];
 double x[n]; // vector
 // Store the actual matrix and vector
 for (int i = 0; i < m; i++) {
 for (int j = 0; j < n; j++) {
 matrix[i][j] = i + j;
 }
 x[i] = 1. + i;
 }
 int rank, p;
 int local_m, local_n;
 MPI_Init(&argc, &argv);
 MPI_Comm_rank(MPI_COMM_WORLD, &rank);
 MPI_Comm_size(MPI_COMM_WORLD, &p);
 assert(n % p == 0 && m % p == 0);
 local_m = m / p;
 local_n = n / p;
 double local_matrix[m][local_n];
 double local_y[m]; // the result on each process
 double y[m]; // the vector that will store the correct result after summing
 // from each process
 int recvcounts[p];
 for (int i = 0; i < p; i++) {
 recvcounts[i] = m;
 }
 // Store on each process the local (rectangular matrix) and the local part of
 // x
 for (int i = 0; i < m; i++) { // get all the rows now
 for (int j = 0; j < local_n; j++) { // only local_n columns
 local_matrix[i][j] = matrix[i][j + rank * local_n];
 }
 }
 // Compute the local mat-vec product, the result is a vector of length m on
 // each process.
 for (int i = 0; i < m; i++) {
 local_y[i] = 0.;
 for (int j = 0; j < local_n; j++) {
 local_y[i] += local_matrix[i][j] * x[j + rank * local_n];
 }
 }
 MPI_Reduce_scatter(local_y, y, recvcounts, MPI_DOUBLE, MPI_SUM,
 MPI_COMM_WORLD);
 if (rank == 0) {
 printf("The expected result is \n");
 double result[n];
 for (int i = 0; i < m; i++) {
 result[i] = 0.;
 for (int j = 0; j < n; j++) {
 result[i] += matrix[i][j] * x[j];
 }
 printf("result[%d] is %lf \n", i, result[i]);
 }
 printf("The result obtained is being printed on process %d: \n", rank);
 for (int i = 0; i < m; i++)
 printf("y[%d] = %lf \n", i, y[i]);
 }
 MPI_Finalize();
 return 0;
}

Question 2

Since m and n are exactly the same, why not use only 1 of them?

Question 3

Observe that each process only multiplies with a small part of x, so you should declare x[local_n].

Next, think about context. If you are doing a power method or so, your output functions as the input for the next matrix-vector multiplication. So inductively, if you start out using only part of x as input, you need only part of y as output, because that will be your next input. (You could also make the argument that one should never store redundant information unless strictly necessary.) In other words: your recvcounts vector should be filled with local_m, not the global m. (This way you are actually doing more of a "reduce-bcast" than a reduce scatter, which you could've done with an allreduce.)

Aside: you are using static allocation, which goes on the stack. It is better to do dynamic allocation on the heap.

But otherwise, nicely done. Finally, let me remark that Pacheco's book is quite old by now. May I suggest https://theartofhpc.com/pcse.html ?

Question 4

Thanks for your review. I'd like to be sure to understand how Reduce_scatter works in this case. I have p-1 vectors of size m whose sum is the solution, one per process. The reduction is applied then entry-wise on each vector, ending up with process 0 having the correct result, and then the result is split in p segments, where the i-th segment has size local_m and sent to rank i. Is that correct?

Question 5

That book is also in the bibliography of my course (thanks for this). I was referring to Pacheco's book "An Introduction to Parallel Programming", the one of 2011, not the original one. Do you think this one is old too? @VictorEijkhout

Question 6

I think 2011 predates both MPI 3 and (of course) 4. But for the basics you can indeed use that book.

Question 7

Your interpretation is largely correct. It is indeed a big pointwise reduction, followed by a scatter. However, the intermediate result of the reduction is not stored on process zero or anywhere. The only thing you know is the inputs and the final output. Indeed, for efficiency the intermediate result is probably not assembled anywhere. Btw, you have p vectors, not p-1.

Question 8

@bobinthebox For my own information, which course are you taking? Institution, number?

score 2 · Accepted Answer · 2023-01-04 16:26:29Z

2

\$\begingroup\$

Observe that each process only multiplies with a small part of x, so you should declare x[local_n].

Next, think about context. If you are doing a power method or so, your output functions as the input for the next matrix-vector multiplication. So inductively, if you start out using only part of x as input, you need only part of y as output, because that will be your next input. (You could also make the argument that one should never store redundant information unless strictly necessary.) In other words: your recvcounts vector should be filled with local_m, not the global m. (This way you are actually doing more of a "reduce-bcast" than a reduce scatter, which you could've done with an allreduce.)

Aside: you are using static allocation, which goes on the stack. It is better to do dynamic allocation on the heap.

But otherwise, nicely done. Finally, let me remark that Pacheco's book is quite old by now. May I suggest https://theartofhpc.com/pcse.html ?

Share

edited Jan 4, 2023 at 18:22

answered Jan 4, 2023 at 16:26

Victor Eijkhout's user avatar

Victor Eijkhout Victor EijkhoutVictor Eijkhout

4792 silver badges5 bronze badges

\$\endgroup\$

6

\$\begingroup\$ Thanks for your review. I'd like to be sure to understand how Reduce_scatter works in this case. I have p-1 vectors of size m whose sum is the solution, one per process. The reduction is applied then entry-wise on each vector, ending up with process 0 having the correct result, and then the result is split in p segments, where the i-th segment has size local_m and sent to rank i. Is that correct? \$\endgroup\$

bobinthebox
– bobinthebox

2023年01月05日 15:15:21 +00:00
Commented Jan 5, 2023 at 15:15
\$\begingroup\$ That book is also in the bibliography of my course (thanks for this). I was referring to Pacheco's book "An Introduction to Parallel Programming", the one of 2011, not the original one. Do you think this one is old too? @VictorEijkhout \$\endgroup\$

bobinthebox
– bobinthebox

2023年01月05日 15:17:17 +00:00
Commented Jan 5, 2023 at 15:17
\$\begingroup\$ I think 2011 predates both MPI 3 and (of course) 4. But for the basics you can indeed use that book. \$\endgroup\$

Victor Eijkhout
– Victor Eijkhout

2023年01月05日 15:44:09 +00:00
Commented Jan 5, 2023 at 15:44
\$\begingroup\$ Your interpretation is largely correct. It is indeed a big pointwise reduction, followed by a scatter. However, the intermediate result of the reduction is not stored on process zero or anywhere. The only thing you know is the inputs and the final output. Indeed, for efficiency the intermediate result is probably not assembled anywhere. Btw, you have p vectors, not p-1. \$\endgroup\$

Victor Eijkhout
– Victor Eijkhout

2023年01月05日 15:47:29 +00:00
Commented Jan 5, 2023 at 15:47
\$\begingroup\$ @bobinthebox For my own information, which course are you taking? Institution, number? \$\endgroup\$

Victor Eijkhout
– Victor Eijkhout

2023年01月07日 14:17:43 +00:00
Commented Jan 7, 2023 at 14:17

| Show 1 more comment

Stack Exchange Network

Matrix-vector multiplication with block-column distribution and MPI_Reduce_scatter

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Matrix-vector multiplication with block-column distribution and MPI_Reduce_scatter

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions