Testing and timing mvn

Testing and timing `mvn_k`

The API of this routine is given by:

void ATL_UGEMV(ATL_CINT M, ATL_CINT N, const TYPE *A, ATL_CINT lda,
               const TYPE *X, TYPE *Y)

If the routine is compiled with the macro BETA0 defined, then it should perform the operation $y \leftarrow Ax$ ; if this macro is not defined, then it should perform $y \leftarrow Ax + y$ .

is a column-major $lda \times N$ contiguous array where $lda \ge M$ .

specifies both the number of rows of

and the length of the vector

provides the number of columns in

, and the length of the vector

Since this is a operation, and A is $M \times N$ in size, this algorithm is dominated by the load of . No reuse of is possible, so the best we can do is reuse the vectors (through operations like register and cache blocking).

Each element of can obviously be computed by doing a dot product of the corresponding row of with . However, this accesses the matrix along the non-contiguous dimension, which leads to terrible memory heirarchy usage for any reasonably sized matrix. Therefore, mvn_k cannot be written in this fashion.

In order to optimize the access of (dominant algorithmic cost) we must access it in a column-major fashion. Therefore, instead of writing it as a series of dot products (ddots), we can write it as a series of axpys (recall that axpy does the operation ).

Therefore, instead of being composed of a series of ddots, mvn_k should be implemented as a series of axpys. In this formulation, we access columns of (the contiguous dimension) at a time, and we write update all elements of

A simple implementation of this kernel (for real precisions) is:

#include "atlas_misc.h"  /* define TYPE macros */
void ATL_UGEMV(ATL_CINT M, ATL_CINT N, const TYPE *A, ATL_CINT lda,
               const TYPE *X, TYPE *Y)
{
   register int j, i;

/*
 * Peel 1st iteration of N-loop if BETA0 is defined
 */
   #ifdef BETA0
   {
      const register TYPE x0 = *X;
      for (i=0; i != M; i++)
         Y[i] = A[i] * x0;
      j=1;
      A += lda;   /* done with this column of A */
   }
   #else
      j=0;
   #endif
   for (; j != N; j++)
   {
      const TYPE register x0 = X[j];
      register TYPE y0 = Y[i];

      for (i=0; i != M; i++) 
         Y[i] += A[i] * x0;
      A += lda;   /* done with this column of A */
   }
}

To time and test mvn_k kernel, its implementation must be stored in the
OBJdir/tune/blas/gemv/MVNCASES directory. Assuming I saved the above implementation to mvn.c in the above directory, I can test the kernel from the OBJdir/tune/blas/gemv directory with the command: make <pre>mvnktest, where <pre> specifies the type/precision and is one of : s, d, c, or z.

This target uses the following make variables which you can change to vary the type of testing done; number in parens are the default values that will be used if no command-line override is given:

mu (1): Unrolling on M dimension
nu (1): Unrolling on N dimension
mvnrout: the filename of your kernel implementation
Mt (297): Number of rows of (elements of )
Nt (177): Number of columns of (elements of )
ldat (Mt): leading dimension for the matrix
align (-Fx 16 -Fy 16 -Fa 16): alignments that matrix and vectors must adhere to.
<pre>MVCC (ATLAS default compiler) : compiler required to compile your kernel
<pre>MVCFLAGS (ATLAS default flags) : compiler flags required to compile your kernel
beta (1) : 0 or 1 (scale of )

Therefore, to test the $\beta=1.0$ case of double precision real:

>make dmvnktest mu=1 nu=1 mvnrout=mvn.c
.... bunch of compilation, etc ....
   TEST M=997, N=177, lda=1111, STARTED
   TEST M=997, N=177, be=0.00, lda=1111, incXY=1,1 PASSED

To test single precision $\beta=0.0$ with a different shape:

>make smvnktest mu=1 nu=1 mvnrout=mvn.c beta=0 Mt=801 Nt=55 ldat=1000
.... bunch of compilation, etc ....
   TEST M=801, N=55, lda=1000, STARTED
   TEST M=801, N=55, be=0.00, lda=1000, incXY=1,1 PASSED

Now we are ready to time this masterpiece. We have two choices for timers. The first such timer simply calls your newly-written kernel directly. It does no cache flushing at all: it initializes the operands (bringing them into the fastest level of heirarchy into which they fit), and times the operations. This is the timer to use when you want to preload the problem to a given level of cache and see how fast your kernel is in isolation. The make target for this methodology is <pre>mvnktime.

The second make target calls ATLAS's GEMV driver that builds the full GEMV from the kernels. This timer does cache flushing, and is what you should use to estimate how fast the complete GEMV will be. This make target is <pre>mvntime.

Both of these targets take largely the same make macros:

mu (1): Unrolling on M dimension
nu (1): Unrolling on N dimension
mvnrout: the filename of your kernel implementation
M (1000): Number of rows of (elements of )
N (1000): Number of columns of (elements of )
lda (M): leading dimension for the matrix
align (-Fx 16 -Fy 16 -Fa 16): alignments that matrix and vectors must adhere to.
<pre>MVCC (ATLAS default compiler) : compiler required to compile your kernel
<pre>MVCFLAGS (ATLAS default flags) : compiler flags required to compile your kernel; pass "-x assembler-with-cpp" (along with any architecture-specific flags) if your kernel is written in assembly (assuming your compiler is gcc).
beta (1) : 0 or 1 (scale of )
flushKB : for mvntime, kilobytes of memory to flush.

Therefore, to time the probable speed of a complete GEMV of size 1000x32 while flushing 16MB of memory, I would issue:

>make dmvntime mu=1 nu=1 mvnrout=mvn.c M=1000 N=32 flushKB=16384
GEMV: M=1000, N=32, lda=1008, AF=[16,16,16], AM=[0,0,0], beta=1.000000e+00, alpha=1.000000e+00:
   M=1000, N=32, lda=1008, nreps=57, time=2.011856e-05, mflop=3230.85
   M=1000, N=32, lda=1008, nreps=57, time=2.011042e-05, mflop=3232.15
   M=1000, N=32, lda=1008, nreps=57, time=2.006722e-05, mflop=3239.11
NREPS=3, MAX=3239.11, MIN=3230.85, AVG=3234.04, MED=3232.15

Clint Whaley 2012-07-10

Testing and timing mvn_k

Testing and timing `mvn_k`