Matlab vs C: Tensorproduct or Vec-trick (multiple times)

Question 1

I am searching for a more efficient way to calculate the so called vec-trick used in Tensor algebra, see Wikipedia.

Introduction:

Suppose you have a matrix vector multiplication, where a matrix C with size (np x mq) is constructed by a Kronecker product of matrices A with size (n x m) and B with size (p x q). The vector is denoted v with size (mp x 1) or its vectorized version X with size (m x p) . In two dimensions this operation can be performed with O(npq+qnm) operations instead of O(mqnp) operations.

Expensive variant (in case of flops):

Cheap variant (in case of flops):

See again Kronecker properties.

Main question:

I want to perform many of these operations at ones, e.g. 2500000. Example: n=m=p=q=7 with A=size(7x7), B=size(7x7), v=size(49x2500000).

I have implemented a MeX-C version of the cheap variant which is quite slower than a Matlab version of the expensive variant.

Is it possible to improve the performance of the C-code in order outperform Matlab?

Note that: The same question was ask some months ago in the Matlab Forum

My current MeX-C file implementation:

/*************************************************
 * CALLING:
 *
 * out = tensorproduct(A,B,vector) 
 * 
 *************************************************/
#include "mex.h"
#include "omp.h"
#define PP_A prhs[0]
#define PP_B prhs[1]
#define PP_vector prhs[2]
#define PP_out plhs[0]
#define n 7 
#define m 7 
#define p 7 
#define q 7
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
{
 const mwSize *dim;
 int i,j,k,s,l;
 double temp[n*q];
 double *A = mxGetPr(PP_A);
 double *B = mxGetPr(PP_B);
 double *vector = mxGetPr(PP_vector);
 dim = mxGetDimensions(PP_vector);
 l = dim[1];
 PP_out = mxCreateDoubleMatrix(l*n*p,1,mxREAL);
 double *out = mxGetPr(PP_out);
 
 #pragma omp parallel for private(i,j,k,s,temp)
 for(k=0; k<l; k++){
 for(i=0; i<n; i++){
 for(j=0; j<q; j++){
 temp[i+j*n]=0;
 }
 }
 for(s=0; s<m; s++){
 for(i=0; i<n; i++){
 for(j=0; j<m; j++){
 temp[i+s*n]+=A[i+j*n]*vector[j+s*m+k*m*p];
 }
 }
 }
 for(s=0; s<n; s++){
 for(i=0; i<p; i++){
 for(j=0; j<q; j++){
 out[i*n+s+k*n*p]+=B[i+j*p]*temp[j*n+s];
 }
 }
 }
 }
}

The code can be compiled with:

mex CFLAGS='$CFLAGS -fopenmp -Ofast -funroll-loops' ... 
 LDFLAGS='$LDFLAGS -fopenmp -Ofast -funroll-loops' ... 
 COPTIMFLAGS='$COPTIMFLAGS -fopenmp -Ofast -funroll-loops' ... 
 LDOPTIMFLAGS='$LDOPTIMFLAGS -fopenmp -Ofast -funroll-loops' ... 
 DEFINES='$DEFINES -fopenmp -Ofast -funroll-loops' ... 
 tensorproduct.c

Currently on my Notebook: (Ubuntu 18.04, GCC 7.5.0, 4 Cores)

Mex-C file implementation: Cheap variant with O(npq+qnm)

A = rand(7,7);
B = rand(7,7);
vector = rand(49,2500000);
n = 50;
tic
for i=1:n
 vector_out = reshape(tensorproduct(A,B,vector),size(vector));
end
toc
% Elapsed time is 26.209770 seconds.

A quite simple Matlab implementation: Expensive variant with O(mqnp)

C=kron(B,A);
tic
for i=1:n
 vector_out = reshape(C*vector(:,:),size(vector));
end
toc
% Elapsed time is 38.670186 seconds.

Matlab improvement without memory copy: Expensive variant with O(mqnp)

tic
for i=1:n
 vector_out = reshape(C*reshape(vector,49,[]),size(vector));
end
toc
% Elapsed time is 15.001515 seconds.

Question 2

You’re using a .cpp extension, which mex interprets as C++ code, but setting C compile flags. I would suggest you start by ensuring you are using an optimized build.

Question 3

@Cris Luengo Thank you for your first remark. That's a stupid mistake. I will check, if it influences the results.

Question 4

@CrisLuengo I made some small changes and fixed the compile flags. This has increased the performance of the C-code a bit (from 33s to 26s). However, the Matlab code is still much faster. Note that the Matlab variant computes much more flops.

Question 5

I compiled your MEX-file using just mex tensorproduct.c (no OpenMP, no special optimization flags, nothing). Then I compared running time to the "expensive" variant:

n = 7;
m = 7;
p = 7;
q = 7;
A = rand(n, m);
B = rand(p, q);
N = 1e6;
X = rand(m, p, N);
B = B.'; % both methods expect a transposed B
R0 = kron(B, A) * reshape(X, m*p, N);
R1 = tensorproduct(A, B, reshape(X, m*p, N));
assert(all(R0(:) - R1(:) < 1e-13))
disp(timeit(@()kron(B.', A) * reshape(X, m*p, N)))
disp(timeit(@()tensorproduct(A, B, reshape(X, m*p, N))))

Note several things here:

We're measuring running time using timeit. It is always better to do this than a plain loop with tic/toc, it is much more precise.
Your version of the "expensive" code involves all sorts of unnecessary reshaping:
```
reshape(C*reshape(vector,49,[]),size(vector))
```
is the same as
```
C*vector
```
because the reshapes you perform don't actually change any shapes, the requested shapes are the same as the shapes that the matrices already have.
I created a matrix X as suggested in the math, and vectorized it using reshape, so I could compare the results with an explicit implementation in MATLAB.

The timing above yielded 0.1385 s and 0.5168 s. Indeed the C code is quite a bit slower (because I'm not using OpenMP nor explicit loop unrolling, the difference I measure is larger than the difference OP measured).

Simply swapping the order of the loops over i and j in the C code (for all three loops) improved processing time to 0.4618 s (~10% less). Removing the unnecessary initialization of temp to 0 further improves to 0.4376 s. If you really want to zero out an array, use memset(), which is typically faster than a loop.

Next, the loop could be reordered to not need all of temp = A*vector computed. It suffices to compute one row to temp, to produce one row of out. Using smaller buffers usually leads to faster compute times.

Beyond that, it becomes really hard to improve computational efficiency. See for example the Q&A "Why is MATLAB so fast in matrix multiplication?" on Stack Overflow (especially the most upvoted answer, which is not the accepted answer at the top).

I did notice some confusion with m, n, p and q in the C code, I suggest you give them each a different value and verify the result is correct (and you don't index out of bounds). If you don't care about getting that right, use a single size variable m=7 for all four sizes. This prevents you or someone else in the future modifying the code for different sizes and getting unexpected wrong results, and potentially crashing MATLAB.

I would also suggest you assert that the input matrices are double-precision floats, and of the right sizes. If I would input the wrong matrices, the MEX-file (and usually all of MATLAB) would crash.

Question 6

Thank you for the effort. I found that the Vec-trick is useful only for larger sizes of n,m,p,q. See my related topic on the Matlab Forum. de.mathworks.com/matlabcentral/answers/…

Cris Luengo Cris Luengo 7,0011 gold badge14 silver badges37 bronze badges · Accepted Answer · 2021-08-25 20:03:02Z

I compiled your MEX-file using just mex tensorproduct.c (no OpenMP, no special optimization flags, nothing). Then I compared running time to the "expensive" variant:

n = 7;
m = 7;
p = 7;
q = 7;
A = rand(n, m);
B = rand(p, q);
N = 1e6;
X = rand(m, p, N);
B = B.'; % both methods expect a transposed B
R0 = kron(B, A) * reshape(X, m*p, N);
R1 = tensorproduct(A, B, reshape(X, m*p, N));
assert(all(R0(:) - R1(:) < 1e-13))
disp(timeit(@()kron(B.', A) * reshape(X, m*p, N)))
disp(timeit(@()tensorproduct(A, B, reshape(X, m*p, N))))

Note several things here:

We're measuring running time using timeit. It is always better to do this than a plain loop with tic/toc, it is much more precise.
Your version of the "expensive" code involves all sorts of unnecessary reshaping:
```
reshape(C*reshape(vector,49,[]),size(vector))
```
is the same as
```
C*vector
```
because the reshapes you perform don't actually change any shapes, the requested shapes are the same as the shapes that the matrices already have.
I created a matrix X as suggested in the math, and vectorized it using reshape, so I could compare the results with an explicit implementation in MATLAB.

The timing above yielded 0.1385 s and 0.5168 s. Indeed the C code is quite a bit slower (because I'm not using OpenMP nor explicit loop unrolling, the difference I measure is larger than the difference OP measured).

Simply swapping the order of the loops over i and j in the C code (for all three loops) improved processing time to 0.4618 s (~10% less). Removing the unnecessary initialization of temp to 0 further improves to 0.4376 s. If you really want to zero out an array, use memset(), which is typically faster than a loop.

Next, the loop could be reordered to not need all of temp = A*vector computed. It suffices to compute one row to temp, to produce one row of out. Using smaller buffers usually leads to faster compute times.

Beyond that, it becomes really hard to improve computational efficiency. See for example the Q&A "Why is MATLAB so fast in matrix multiplication?" on Stack Overflow (especially the most upvoted answer, which is not the accepted answer at the top).

I did notice some confusion with m, n, p and q in the C code, I suggest you give them each a different value and verify the result is correct (and you don't index out of bounds). If you don't care about getting that right, use a single size variable m=7 for all four sizes. This prevents you or someone else in the future modifying the code for different sizes and getting unexpected wrong results, and potentially crashing MATLAB.

I would also suggest you assert that the input matrices are double-precision floats, and of the right sizes. If I would input the wrong matrices, the MEX-file (and usually all of MATLAB) would crash.

Thank you for the effort. I found that the Vec-trick is useful only for larger sizes of n,m,p,q. See my related topic on the Matlab Forum. de.mathworks.com/matlabcentral/answers/…

Stack Exchange Network

Matlab vs C: Tensorproduct or Vec-trick (multiple times)

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Matlab vs C: Tensorproduct or Vec-trick (multiple times)

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions