next up previous contents
Next: What to do if Up: Speeding up the Level Previous: More timing info Contents

Complex gemmK

Vainly hoping we were done, eh? Nope, complex codes are special. ATLAS actually uses the real matrix multiply generator in order to do complex multiplication. It needs some tricks to do this, obviously. The first thing to note is that if $X_r$ denotes the real elements of $X$, and $X_i$ indicates the imaginary components, then complex matrix-matrix multiply of the form $C \leftarrow A B + \beta C$ may be accomplished by the following four real matrix multiplies:
  1. $C_r \leftarrow A_i B_i - \beta C_r$
  2. $C_i \leftarrow A_i B_r + \beta C_i$
  3. $C_r \leftarrow A_r B_r - C_r$
  4. $C_i \leftarrow A_r B_i + C_i$

This works fine assuming $\beta$ is real, otherwise $\beta$ must be applied explicitly as a complex scalar, and then set $\beta=1$ in the above outline.

Therefore, in order to use this trick, upon copying $A$ and $B$ to block-major storage, ATLAS also splits the arrays into real and imaginary components. The only matrix not expressed as two real matrices is then $C$, and to fix this problem, ATLAS demands that the complex gemmK stride $C$ by 2. An example will solidify the confusion.

A simple 3-loop implementation of an ATLAS complex gemmK is:

void ATL_USERMM
 (const int M, const int N, const int K, const double alpha,
 const double *A, const int lda, const double *B, const int ldb,
 const double beta, double *C, const int ldc)
{
 int i, j, k;
 register double c00;
 for (j=0; j < N; j++)
 {
 for (i=0; i < M; i++)
 {
 #ifdef BETA0
 c00 = 0.0;
 #else
 c00 = C[2*(i+j*ldc)];
 #ifdef BETAX
 c00 *= beta;
 #endif
 #endif
 for (k=0; k < K; k++) c00 += A[k+i*lda] * B[k+j*ldb];
 C[2*(i+j*ldc)] = c00;
 }
 }
}

First we test that it produces the right answer:

make cmmutstcase pre=z nb=40 mmrout=CASES/ATL_cmm1x1x1.c beta=0
make cmmutstcase pre=z nb=40 mmrout=CASES/ATL_cmm1x1x1.c beta=1
make cmmutstcase pre=z nb=40 mmrout=CASES/ATL_cmm1x1x1.c beta=8

Then we scope the awesome performance:

make cmmucase pre=z nb=40 mmrout=CASES/ATL_cmm1x1x1.c beta=1
zNB=40, ldc=40, mu=4, nu=4, ku=1, lat=4: time=1.830000, mflop=53.438251
zNB=40, ldc=40, mu=4, nu=4, ku=1, lat=4: time=1.830000, mflop=53.438251
zNB=40, ldc=40, mu=4, nu=4, ku=1, lat=4: time=1.830000, mflop=53.438251

Now, since we are clearly gluttons for punishment, we compare our masterwork to ATLAS's generated kernel:

speedy. cat res/zMMRES 
MULADD LAT NB MU NU KU FFTCH IFTCH NFTCH MFLOP
 0 5 36 2 1 36 0 2 1 186.42
 10
speedy. make mmcase muladd=0 lat=5 nb=36 mu=2 nu=1 ku=36 beta=1
dNB=36, ldc=36, mu=2, nu=1, ku=36, lat=5: time=0.500000, mflop=195.581952
dNB=36, ldc=36, mu=2, nu=1, ku=36, lat=5: time=0.490000, mflop=199.573420
dNB=36, ldc=36, mu=2, nu=1, ku=36, lat=5: time=0.490000, mflop=199.573420


next up previous contents
Next: What to do if Up: Speeding up the Level Previous: More timing info Contents
Clint Whaley 2012年07月10日

AltStyle によって変換されたページ (->オリジナル) /