Testing and timing ger2_k

The rank-2 update kernel ger2_k performs the operation $A \leftarrow xy^T + wz^T + A$, and has the API:
void ATL_UGER2K
   (ATL_CINT M, ATL_CINT N, const TYPE *X, const TYPE *Y,
    const TYPE *W, const TYPE *Z, TYPE *A, ATL_CINT lda)
The rank-2 update kernel ger2_k uses the exact same testing and timing methodology as described for ger_k in the previous section, except the kernel must be stored in the R2CASES/ subdirectory, and you substute ``r2'' for ``r1'' in the testing and timing commands, and ``R2'' for ``R1'' in the compiler and flag macros.

A simple GER2 implememtation would be:

#include "atlas_misc.h"  /* define TYPE macros */
void ATL_UGER2K
   (ATL_CINT M, ATL_CINT N, const TYPE *X, const TYPE *Y,
    const TYPE *W, const TYPE *Z, TYPE *A, ATL_CINT lda)
{
   register ATL_INT i, j;

   for (j=0; j < N; j++)
   {
      const register TYPE y0=Y[j], z0=Z[j];
      for (i=0; i < M; i++)
         A[i] += X[i]*y0 + W[i]*z0;
      A += lda;  /* finished with this column */
   }
}

Assuming I save the above file to R2CASES/r2k.c, I would test:

>make sr2ktest mu=1 nu=1 r2rout=r2k.c
.... bunch of compilation, etc ....
   TEST CONJ=0, M=297, N=177, lda=297, incY=1, STARTED
   TEST CONJ=0, M=297, N=177, lda=297, incY=1, PASSED

And time the single precision real kernel without cache flushing with:

>make sr2ktime mu=1 nu=1 r2rout=r2k.c
.... bunch of compilation, etc ....
GER2: M=1000, N=1000, lda=1000, AF=[16,16,16], AM=[0,0,0], alpha=1.000000e+00:
   M=1000, N=1000, lda=1000, nreps=1, time=9.489282e-04, mflop=4217.39
   M=1000, N=1000, lda=1000, nreps=1, time=9.714776e-04, mflop=4119.50
   M=1000, N=1000, lda=1000, nreps=1, time=9.486141e-04, mflop=4218.79
NREPS=3, MAX=4218.79, MIN=4119.50, AVG=4185.22, MED=4217.39
<1528>>

Clint Whaley 2012-07-10