Complex gemmK

Vainly hoping we were done, eh? Nope, complex codes are special. ATLAS actually uses the real matrix multiply generator in order to do complex multiplication. It needs some tricks to do this, obviously. The first thing to note is that if $X_r$ denotes the real elements of $X$, and $X_i$ indicates the imaginary components, then complex matrix-matrix multiply of the form $C \leftarrow A B + \beta C$ may be accomplished by the following four real matrix multiplies:
  1. $C_r \leftarrow A_i B_i - \beta C_r$
  2. $C_i \leftarrow A_i B_r + \beta C_i$
  3. $C_r \leftarrow A_r B_r - C_r$
  4. $C_i \leftarrow A_r B_i + C_i$

This works fine assuming $\beta$ is real, otherwise $\beta$ must be applied explicitly as a complex scalar, and then set $\beta=1$ in the above outline.

Therefore, in order to use this trick, upon copying $A$ and $B$ to block-major storage, ATLAS also splits the arrays into real and imaginary components. The only matrix not expressed as two real matrices is then $C$, and to fix this problem, ATLAS demands that the complex gemmK stride $C$ by 2. An example will solidify the confusion.

A simple 3-loop implementation of an ATLAS complex gemmK is:

void ATL_USERMM
   (const int M, const int N, const int K, const double alpha,
    const double *A, const int lda, const double *B, const int ldb,
    const double beta, double *C, const int ldc)
{
   int i, j, k;
   register double c00;

   for (j=0; j < N; j++)
   {
      for (i=0; i < M; i++)
      {
         #ifdef BETA0
            c00 = 0.0;
         #else
            c00 = C[2*(i+j*ldc)];
            #ifdef BETAX
               c00 *= beta;
            #endif
         #endif
         for (k=0; k < K; k++) c00 += A[k+i*lda] * B[k+j*ldb];
         C[2*(i+j*ldc)] = c00;
      }
   }
}

First we test that it produces the right answer:

make cmmutstcase pre=z nb=40 mmrout=CASES/ATL_cmm1x1x1.c beta=0
make cmmutstcase pre=z nb=40 mmrout=CASES/ATL_cmm1x1x1.c beta=1
make cmmutstcase pre=z nb=40 mmrout=CASES/ATL_cmm1x1x1.c beta=8

Then we scope the awesome performance:

make cmmucase pre=z nb=40 mmrout=CASES/ATL_cmm1x1x1.c beta=1
zNB=40, ldc=40, mu=4, nu=4, ku=1, lat=4: time=1.830000, mflop=53.438251
zNB=40, ldc=40, mu=4, nu=4, ku=1, lat=4: time=1.830000, mflop=53.438251
zNB=40, ldc=40, mu=4, nu=4, ku=1, lat=4: time=1.830000, mflop=53.438251

Now, since we are clearly gluttons for punishment, we compare our masterwork to ATLAS's generated kernel:

speedy. cat res/zMMRES 
MULADD  LAT  NB  MU  NU  KU  FFTCH  IFTCH  NFTCH    MFLOP
     0    5  36   2   1  36      0      2      1   186.42
     10

speedy. make mmcase muladd=0 lat=5 nb=36 mu=2 nu=1 ku=36 beta=1
dNB=36, ldc=36, mu=2, nu=1, ku=36, lat=5: time=0.500000, mflop=195.581952
dNB=36, ldc=36, mu=2, nu=1, ku=36, lat=5: time=0.490000, mflop=199.573420
dNB=36, ldc=36, mu=2, nu=1, ku=36, lat=5: time=0.490000, mflop=199.573420

Clint Whaley 2012-07-10