This works fine assuming
is real, otherwise
must be applied
explicitly as a complex scalar, and then set
in the above outline.
Therefore, in order to use this trick, upon copying
and
to block-major
storage, ATLAS also splits the arrays into real and imaginary components.
The only matrix not expressed as two real matrices is then
, and to
fix this problem, ATLAS demands that the complex gemmK stride
by
2. An example will solidify the confusion.
A simple 3-loop implementation of an ATLAS complex gemmK is:
void ATL_USERMM
(const int M, const int N, const int K, const double alpha,
const double *A, const int lda, const double *B, const int ldb,
const double beta, double *C, const int ldc)
{
int i, j, k;
register double c00;
for (j=0; j < N; j++)
{
for (i=0; i < M; i++)
{
#ifdef BETA0
c00 = 0.0;
#else
c00 = C[2*(i+j*ldc)];
#ifdef BETAX
c00 *= beta;
#endif
#endif
for (k=0; k < K; k++) c00 += A[k+i*lda] * B[k+j*ldb];
C[2*(i+j*ldc)] = c00;
}
}
}
First we test that it produces the right answer:
make cmmutstcase pre=z nb=40 mmrout=CASES/ATL_cmm1x1x1.c beta=0 make cmmutstcase pre=z nb=40 mmrout=CASES/ATL_cmm1x1x1.c beta=1 make cmmutstcase pre=z nb=40 mmrout=CASES/ATL_cmm1x1x1.c beta=8
Then we scope the awesome performance:
make cmmucase pre=z nb=40 mmrout=CASES/ATL_cmm1x1x1.c beta=1 zNB=40, ldc=40, mu=4, nu=4, ku=1, lat=4: time=1.830000, mflop=53.438251 zNB=40, ldc=40, mu=4, nu=4, ku=1, lat=4: time=1.830000, mflop=53.438251 zNB=40, ldc=40, mu=4, nu=4, ku=1, lat=4: time=1.830000, mflop=53.438251
Now, since we are clearly gluttons for punishment, we compare our masterwork to ATLAS's generated kernel:
speedy. cat res/zMMRES
MULADD LAT NB MU NU KU FFTCH IFTCH NFTCH MFLOP
0 5 36 2 1 36 0 2 1 186.42
10
speedy. make mmcase muladd=0 lat=5 nb=36 mu=2 nu=1 ku=36 beta=1
dNB=36, ldc=36, mu=2, nu=1, ku=36, lat=5: time=0.500000, mflop=195.581952
dNB=36, ldc=36, mu=2, nu=1, ku=36, lat=5: time=0.490000, mflop=199.573420
dNB=36, ldc=36, mu=2, nu=1, ku=36, lat=5: time=0.490000, mflop=199.573420
Clint Whaley 2012-07-10