void ATL_UGEMV(ATL_CINT M, ATL_CINT N, const TYPE *A, ATL_CINT lda, const TYPE *X, TYPE *Y)If the routine is compiled with the macro

Since this is a operation, and A is in size, this algorithm is dominated by the load of . No reuse of is possible, so the best we can do is reuse the vectors (through operations like register and cache blocking).

Each element of can obviously be computed by doing a dot product of the corresponding row of with , and this gives us the simplist implementation possible:

#include "atlas_misc.h" /* define TYPE macros */ void ATL_UGEMV(ATL_CINT M, ATL_CINT N, const TYPE *A, ATL_CINT lda, const TYPE *X, TYPE *Y) { register int j, i; for (j=0; j < N; j++) { #ifdef BETA0 register TYPE y0 = 0.0; #else register TYPE y0 = Y[j]; #endif for (i=0; i < M; i++) y0 += A[i] * X[i]; Y[j] = y0; A += lda; /* done with this column */ } }

To time and test a `mvt_k` kernel, its implementation must be stored in
the
`OBJdir/tune/blas/gemv/MVTCASES` directory. Assuming I saved the
above implementation to `mvt.c` in the above directory, I can test
the kernel from the `OBJdir/tune/blas/gemv` directory with the
command: `make <pre>mvtktest`

, where `<pre>`

specifies the
type/precision and is one of : `s`, `d`, `c`, or `z`.

This target uses the following `make` variables which you can change
to vary the type of testing done; number in parens are the default values
that will be used if no command-line override is given:

`mu`

(1): Unrolling on M dimension`nu`

(1): Unrolling on N dimension`mvtrout`

: the filename of your kernel implementation`Mt`

(297): Number of rows of (elements of )`Nt`

(177): Number of columns of (elements of )`ldat`

(`Mt`

): leading dimension for the matrix`align`

(`-Fx 16 -Fy 16 -Fa 16`

): alignments that matrix and vectors must adhere to.`<pre>MVCC`

(ATLAS default compiler) : compiler required to compile your kernel`<pre>MVCFLAGS`

(ATLAS default flags) : compiler flags required to compile your kernel`beta`

(1) : 0 or 1 (scale of )

Therefore, to test the case of double precision real:

>make dmvtktest mu=1 nu=1 mvtrout=mvt.c .... bunch of compilation, etc .... TEST M=997, N=177, lda=1111, STARTED TEST M=997, N=177, be=1.00, lda=1111, incXY=1,1 PASSED

We have two choices for timers.
The first such timer simply calls your newly-written kernel directly.
It does no cache flushing at all: it initializes the operands
(bringing them into the fastest level of heirarchy into which they fit),
and times the operations. This is the timer to use when you want to preload
the problem to a given level of cache and see how fast your kernel is in
isolation. The make target for this methodology is `<pre>mvtktime`

.

The second make target calls ATLAS's GEMV driver that builds the full
GEMV from the kernels. This timer does cache flushing, and is what you
should use to estimate how fast the complete GEMV will be. This make target
is `<pre>mvttime`

.

Both of these targets take largely the same make macros:

`mu`

(1): Unrolling on M dimension`nu`

(1): Unrolling on N dimension`mvtrout`

: the filename of your kernel implementation`M`

(1000): Number of rows of (elements of )`N`

(1000): Number of columns of (elements of )`lda`

(`M`

): leading dimension for the matrix`align`

(`-Fx 16 -Fy 16 -Fa 16`

): alignments that matrix and vectors must adhere to.`<pre>MVCC`

(ATLAS default compiler) : compiler required to compile your kernel`<pre>MVCFLAGS`

(ATLAS default flags) : compiler flags required to compile your kernel; pass "-x assembler-with-cpp" (along with any architecture-specific flags) if your kernel is written in assembly (assuming your compiler is`gcc`).`beta`

(1) : 0 or 1 (scale of )`flushKB`

: for`mvntime`, kilobytes of memory to flush.

Therefore, to time the probable speed of a complete GEMV of size 1000x32 while flushing 16MB of memory, I would issue:

>make dmvttime mu=1 nu=1 mvtrout=mvt.c M=1000 N=32 flushKB=16384 GEMV: M=1000, N=32, lda=1008, AF=[16,16,16], AM=[0,0,0], beta=1.000000e+00, alpha=1.000000e+00: M=1000, N=32, lda=1008, nreps=57, time=2.809835e-05, mflop=2313.30 M=1000, N=32, lda=1008, nreps=57, time=2.812956e-05, mflop=2310.74 M=1000, N=32, lda=1008, nreps=57, time=2.807517e-05, mflop=2315.21 NREPS=3, MAX=2315.21, MIN=2310.74, AVG=2313.08, MED=2313.30

Clint Whaley 2012-07-10