Testing and timing mvt_k

The API of this routine is given by:
               const TYPE *X, TYPE *Y)
If the routine is compiled with the macro BETA0 defined, then it should perform the operation $y \leftarrow A^Tx$; if this macro is not defined, then it should perform $y \leftarrow A^Tx + y$. $A$ is a column-major $lda \times N$ contiguous array where $lda \ge M$. $M$ specifies both the number of rows of $A$ and the length of the vector $X$. $N$ provides the number of columns in $A$, and the length of the vector $Y$.

Since this is a $O(MN)$ operation, and A is $M \times N$ in size, this algorithm is dominated by the load of $A$. No reuse of $A$ is possible, so the best we can do is reuse the vectors (through operations like register and cache blocking).

Each element of $Y$ can obviously be computed by doing a dot product of the corresponding row of $A$ with $X$, and this gives us the simplist implementation possible:

#include "atlas_misc.h"  /* define TYPE macros */
               const TYPE *X, TYPE *Y)
   register int j, i;
   for (j=0; j < N; j++)
      #ifdef BETA0
         register TYPE y0 = 0.0;
         register TYPE y0 = Y[j];
      for (i=0; i < M; i++)
         y0 += A[i] * X[i];
      Y[j] = y0;
      A += lda;  /* done with this column */

To time and test a mvt_k kernel, its implementation must be stored in the
OBJdir/tune/blas/gemv/MVTCASES directory. Assuming I saved the above implementation to mvt.c in the above directory, I can test the kernel from the OBJdir/tune/blas/gemv directory with the command: make <pre>mvtktest, where <pre> specifies the type/precision and is one of : s, d, c, or z.

This target uses the following make variables which you can change to vary the type of testing done; number in parens are the default values that will be used if no command-line override is given:

Therefore, to test the $\beta=1.0$ case of double precision real:

>make dmvtktest mu=1 nu=1 mvtrout=mvt.c
.... bunch of compilation, etc ....
   TEST M=997, N=177, lda=1111, STARTED
   TEST M=997, N=177, be=1.00, lda=1111, incXY=1,1 PASSED

We have two choices for timers. The first such timer simply calls your newly-written kernel directly. It does no cache flushing at all: it initializes the operands (bringing them into the fastest level of heirarchy into which they fit), and times the operations. This is the timer to use when you want to preload the problem to a given level of cache and see how fast your kernel is in isolation. The make target for this methodology is <pre>mvtktime.

The second make target calls ATLAS's GEMV driver that builds the full GEMV from the kernels. This timer does cache flushing, and is what you should use to estimate how fast the complete GEMV will be. This make target is <pre>mvttime.

Both of these targets take largely the same make macros:

Therefore, to time the probable speed of a complete GEMV of size 1000x32 while flushing 16MB of memory, I would issue:

>make dmvttime mu=1 nu=1 mvtrout=mvt.c M=1000 N=32 flushKB=16384
GEMV: M=1000, N=32, lda=1008, AF=[16,16,16], AM=[0,0,0], beta=1.000000e+00, alpha=1.000000e+00:
   M=1000, N=32, lda=1008, nreps=57, time=2.809835e-05, mflop=2313.30
   M=1000, N=32, lda=1008, nreps=57, time=2.812956e-05, mflop=2310.74
   M=1000, N=32, lda=1008, nreps=57, time=2.807517e-05, mflop=2315.21
NREPS=3, MAX=2315.21, MIN=2310.74, AVG=2313.08, MED=2313.30

Clint Whaley 2012-07-10