I am searching for a more efficient way to calculate the so called vec-trick used in Tensor algebra, see Wikipedia.
Introduction:
Suppose you have a matrix vector multiplication, where a matrix C with size (np x mq) is constructed by a Kronecker product of matrices A with size (n x m) and B with size (p x q). The vector is denoted v with size (mp x 1) or its vectorized version X with size (m x p) . In two dimensions this operation can be performed with O(npq+qnm) operations instead of O(mqnp) operations.
Expensive variant (in case of flops):
Cheap variant (in case of flops):
See again Kronecker properties.
Main question:
I want to perform many of these operations at ones, e.g. 2500000. Example: n=m=p=q=7 with A=size(7x7), B=size(7x7), v=size(49x2500000).
I have implemented a MeX-C version of the cheap variant which is quite slower than a Matlab version of the expensive variant.
Is it possible to improve the performance of the C-code in order outperform Matlab?
Note that: The same question was ask some months ago in the Matlab Forum
My current MeX-C file implementation:
/*************************************************
* CALLING:
*
* out = tensorproduct(A,B,vector)
*
*************************************************/
#include "mex.h"
#include "omp.h"
#define PP_A prhs[0]
#define PP_B prhs[1]
#define PP_vector prhs[2]
#define PP_out plhs[0]
#define n 7
#define m 7
#define p 7
#define q 7
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
{
const mwSize *dim;
int i,j,k,s,l;
double temp[n*q];
double *A = mxGetPr(PP_A);
double *B = mxGetPr(PP_B);
double *vector = mxGetPr(PP_vector);
dim = mxGetDimensions(PP_vector);
l = dim[1];
PP_out = mxCreateDoubleMatrix(l*n*p,1,mxREAL);
double *out = mxGetPr(PP_out);
#pragma omp parallel for private(i,j,k,s,temp)
for(k=0; k<l; k++){
for(i=0; i<n; i++){
for(j=0; j<q; j++){
temp[i+j*n]=0;
}
}
for(s=0; s<m; s++){
for(i=0; i<n; i++){
for(j=0; j<m; j++){
temp[i+s*n]+=A[i+j*n]*vector[j+s*m+k*m*p];
}
}
}
for(s=0; s<n; s++){
for(i=0; i<p; i++){
for(j=0; j<q; j++){
out[i*n+s+k*n*p]+=B[i+j*p]*temp[j*n+s];
}
}
}
}
}
The code can be compiled with:
mex CFLAGS='$CFLAGS -fopenmp -Ofast -funroll-loops' ...
LDFLAGS='$LDFLAGS -fopenmp -Ofast -funroll-loops' ...
COPTIMFLAGS='$COPTIMFLAGS -fopenmp -Ofast -funroll-loops' ...
LDOPTIMFLAGS='$LDOPTIMFLAGS -fopenmp -Ofast -funroll-loops' ...
DEFINES='$DEFINES -fopenmp -Ofast -funroll-loops' ...
tensorproduct.c
Currently on my Notebook: (Ubuntu 18.04, GCC 7.5.0, 4 Cores)
Mex-C file implementation: Cheap variant with O(npq+qnm)
A = rand(7,7);
B = rand(7,7);
vector = rand(49,2500000);
n = 50;
tic
for i=1:n
vector_out = reshape(tensorproduct(A,B,vector),size(vector));
end
toc
% Elapsed time is 26.209770 seconds.
A quite simple Matlab implementation: Expensive variant with O(mqnp)
C=kron(B,A);
tic
for i=1:n
vector_out = reshape(C*vector(:,:),size(vector));
end
toc
% Elapsed time is 38.670186 seconds.
Matlab improvement without memory copy: Expensive variant with O(mqnp)
tic
for i=1:n
vector_out = reshape(C*reshape(vector,49,[]),size(vector));
end
toc
% Elapsed time is 15.001515 seconds.
1 Answer 1
I compiled your MEX-file using just mex tensorproduct.c
(no OpenMP, no special optimization flags, nothing). Then I compared running time to the "expensive" variant:
n = 7;
m = 7;
p = 7;
q = 7;
A = rand(n, m);
B = rand(p, q);
N = 1e6;
X = rand(m, p, N);
B = B.'; % both methods expect a transposed B
R0 = kron(B, A) * reshape(X, m*p, N);
R1 = tensorproduct(A, B, reshape(X, m*p, N));
assert(all(R0(:) - R1(:) < 1e-13))
disp(timeit(@()kron(B.', A) * reshape(X, m*p, N)))
disp(timeit(@()tensorproduct(A, B, reshape(X, m*p, N))))
Note several things here:
We're measuring running time using
timeit
. It is always better to do this than a plain loop withtic
/toc
, it is much more precise.Your version of the "expensive" code involves all sorts of unnecessary reshaping:
reshape(C*reshape(vector,49,[]),size(vector))
is the same as
C*vector
because the reshapes you perform don't actually change any shapes, the requested shapes are the same as the shapes that the matrices already have.
I created a matrix
X
as suggested in the math, and vectorized it usingreshape
, so I could compare the results with an explicit implementation in MATLAB.
The timing above yielded 0.1385 s and 0.5168 s. Indeed the C code is quite a bit slower (because I'm not using OpenMP nor explicit loop unrolling, the difference I measure is larger than the difference OP measured).
Simply swapping the order of the loops over i
and j
in the C code (for all three loops) improved processing time to 0.4618 s (~10% less). Removing the unnecessary initialization of temp
to 0 further improves to 0.4376 s. If you really want to zero out an array, use memset()
, which is typically faster than a loop.
Next, the loop could be reordered to not need all of temp = A*vector
computed. It suffices to compute one row to temp
, to produce one row of out
. Using smaller buffers usually leads to faster compute times.
Beyond that, it becomes really hard to improve computational efficiency. See for example the Q&A "Why is MATLAB so fast in matrix multiplication?" on Stack Overflow (especially the most upvoted answer, which is not the accepted answer at the top).
I did notice some confusion with m
, n
, p
and q
in the C code, I suggest you give them each a different value and verify the result is correct (and you don't index out of bounds). If you don't care about getting that right, use a single size variable m=7
for all four sizes. This prevents you or someone else in the future modifying the code for different sizes and getting unexpected wrong results, and potentially crashing MATLAB.
I would also suggest you assert that the input matrices are double-precision floats, and of the right sizes. If I would input the wrong matrices, the MEX-file (and usually all of MATLAB) would crash.
-
\$\begingroup\$ Thank you for the effort. I found that the Vec-trick is useful only for larger sizes of n,m,p,q. See my related topic on the Matlab Forum. de.mathworks.com/matlabcentral/answers/… \$\endgroup\$ConvexHull– ConvexHull2021年08月25日 20:32:24 +00:00Commented Aug 25, 2021 at 20:32
mex
interprets as C++ code, but setting C compile flags. I would suggest you start by ensuring you are using an optimized build. \$\endgroup\$