3
\$\begingroup\$

I am trying to compute a matrix-matrix product of N stacked complex double N x N matrices. For simplicity, I assume N = 512. I have written code in C++ parallelized with OMP and using OpenBLAS for the matrix-matrix product. This is the code:

#include <iostream>
#include <complex>
#include <chrono>
#include <openblas/cblas.h>
typedef std::complex<double> cx;
int main() {
 const int NUM_REPS = 10; // Number of speed test repetitions
 const int N = 512;
 const int N3 = N * N * N;
 const int N2 = N * N; // Distance between matrices
 cx *A = new cx[N3];
 cx *B = new cx[N3];
 cx *C = new cx[N3];
 cx alpha = 1.0;
 cx beta = 0.0;
 for (int i = 0; i < NUM_REPS; i++) {
 
 #pragma omp parallel for
 for (int j = 0; j < N3; j++) { 
 A[j] = cx(1, 1);
 B[j] = cx(0, 1);
 }
 auto start = std::chrono::steady_clock::now();
 #pragma omp parallel for
 for (int k = 0; k < N; k++) {
 cblas_zgemm(CblasRowMajor, CblasConjTrans, CblasNoTrans, N, N, N, 
 &alpha, 
 &(A[k * N2]), N, 
 &(B[k * N2]), N, 
 &beta, 
 &(C[k * N2]), N);
 }
 auto end = std::chrono::steady_clock::now();
 std::cout << "Delta [ms] = " 
 << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() 
 << std::endl;
 cx checksum = 0.0;
 for (int j = 0; j < N3; j++)
 checksum += C[j];
 std::cout << checksum << std::endl;
 }
 
 return 0;
 
}

Note that in the example, the first matrix should be transposed before the multiplication and the memory layout is row major.

The "checksum" for the output I receive is (6.87195e+10,6.87195e+10). On my laptop, each iteration in i takes about 2.381s.

Since this feels a little bit long, I was wondering whether this can be improved by using the GPU for the computation. The main problem here is that a 5123 matrix stack does not fit into the GPU memory all at once since it takes 2 GB of memory per stack. My GPU (Nvidia GeForce MX450) has exactly 2 GB of memory, so that will not work. The only solution would be to copy the stack in chunks into the GPU memory, perform the multiplication of this sub-stack and copy the results back into host memory which means that it becomes an "out of core" operation, see also for example this answer, which describes splitting a large matrix into blocks and multiplying blockwise on the GPU.

For this, I am using the CLBlast library, more specifically the GemmStridedBatched method. Here is the full code:

#include <iostream>
#include <complex>
#include <chrono>
#include <clblast.h>
#include "cl.hpp"
typedef std::complex<double> cx;
int main()
{
 const int NUM_REPS = 10; // Number of speed test repetitions
 const int N = 512;
 const int N3 = N * N * N;
 const int N2 = N * N; // Distance between matrices
 const int block_len = 128; // Length of each block, must divide N
 const auto platform_id = 0;
 const auto device_id = 0; // This selects the Nvidia GPU
 cx *A = new cx[N3];
 cx *B = new cx[N3];
 cx *C = new cx[N3];
 cx alpha = 1.0;
 cx beta = 0.0;
 // Get OpenCL platform and device
 std::vector<cl::Platform> platforms = std::vector<cl::Platform>();
 cl::Platform::get(&platforms);
 if (platforms.size() == 0 || platform_id >= platforms.size()) {
 std::cout << "No platform." << std::endl;
 return -1; 
 }
 cl::Platform platform = platforms[platform_id];
 std::vector<cl::Device> devices = std::vector<cl::Device>();
 platform.getDevices(CL_DEVICE_TYPE_ALL, &devices);
 if (devices.size() == 0 || device_id >= devices.size()) {
 std::cout << "No device." << std::endl;
 return -1;
 }
 cl::Device device = devices[device_id];
 auto device_as_vector = std::vector<cl::Device>{device};
 auto context = cl::Context(device_as_vector);
 auto queue = cl::CommandQueue(context, device);
 cl_event event;
 auto device_a2 = cl::Buffer(context, CL_MEM_READ_ONLY, N2 * block_len * sizeof(cx));
 auto device_b = cl::Buffer(context, CL_MEM_READ_ONLY, N2 * block_len * sizeof(cx));
 auto device_c = cl::Buffer(context, CL_MEM_WRITE_ONLY, N2 * block_len * sizeof(cx));
 // Repeat the speed test NUM_REPS times
 for (int i = 0; i < NUM_REPS; i++) {
 for (int j = 0; j < N3; j++) {
 A[j] = cx(1, 1);
 B[j] = cx(0, 1);
 }
 auto start = std::chrono::steady_clock::now();
 // Matrix-matrix product over blocks of length block_len
 // in GPU memory
 for (int k = 0; k < N; k += block_len) {
 queue.enqueueWriteBuffer(device_a2, CL_TRUE, 0, N2 * block_len * sizeof(cx), &(A[k * N2]));
 queue.enqueueWriteBuffer(device_b, CL_TRUE, 0, N2 * block_len * sizeof(cx), &(B[k * N2]));
 auto queue_plain = queue();
 auto status = clblast::GemmStridedBatched(
 clblast::Layout::kRowMajor,
 clblast::Transpose::kConjugate, clblast::Transpose::kNo,
 N, N, N, 
 alpha,
 device_a2(), 0, N, N2,
 device_b(), 0, N, N2,
 beta,
 device_c(), 0, N, N2,
 block_len,
 &queue_plain, &event);
 // Record the execution time
 if (status == clblast::StatusCode::kSuccess) {
 clWaitForEvents(1, &event);
 clReleaseEvent(event);
 }
 else {
 std::cout << "(1) FAILURE: " << (int)status << std::endl;
 return -1;
 }
 queue.enqueueReadBuffer(device_c, CL_TRUE, 0, N2 * block_len * sizeof(cx), &(C[k * N2]));
 }
 auto end = std::chrono::steady_clock::now();
 std::cout << "Delta [ms] = " 
 << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() 
 << std::endl;
 cx checksum = 0.0;
 for (int j = 0; j < N3; j++)
 checksum += C[j];
 std::cout << checksum << std::endl;
 }
 return 0;
 
}

Here, block_len defines the number of matrices from the stack which lives in the GPU memory at the same time. This number must divide N so as not to "miss" any matrices. (In a more general implementation this can of course be adapted to allow arbitrary stack sizes.)

The cl.hpp header can be found here. I have used the OpenGL code from this example from the CLBlast library as a basis for this code.

My problem is, that this version actually takes 10.838s per iteration in i. That is unexpected and I honestly don't know if I can then make the code actually faster than 2s per iteration using OpenBLAS. However, since I am new to OpenCL there may be some way to actually improve the speed of my code. This would be my first question. More generally, is there maybe another way to improve the speed of this stacked matrix-matrix product? I might have missed something important.

asked Apr 28, 2022 at 20:40
\$\endgroup\$
1
  • 1
    \$\begingroup\$ You need to keep GPU pipeline fed. GPU commands have insane latency which is overlapped by making a lot of compute requests. Doing it without a queue of depth of at least 3-5 is very suboptimal. \$\endgroup\$ Commented Apr 30, 2022 at 11:38

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.