Testing and timing ger

Testing and timing `ger_k`

The API of ger_k is given by:

void ATL_UGERK(ATL_CINT M, ATL_CINT N, const TYPE *X, const TYPE *Y, 
               TYPE *A, ATL_CINT lda)

This kernel performs the outer product operation $A \leftarrow xy^T + A$ .

is a column-major $lda \times N$ contiguous array where $lda \ge M$ .

specifies both the number of rows of

and the length of the vector

provides the number of columns in

, and the length of the vector

Since this is a operation, and A is $M \times N$ in size, this algorithm is dominated by the load of . No reuse of is possible, so the best we can do is reuse the vectors (through operations like register and cache blocking).

The simplist implementation of this operation is done by doing a axpy operation for each column of the matrix:

#include "atlas_misc.h"  /* define TYPE macros */
void ATL_UGERK(ATL_CINT M, ATL_CINT N, const TYPE *X, const TYPE *Y, 
               TYPE *A, ATL_CINT lda)
{
   register int j, i;
   
   for (j=0; j < N; j++)
   {
      const register TYPE y0 = Y[j];
      for (i=0; i < M; i++)
         A[i] += X[i] * y0;
      A += lda;  /* done with this column */
   }
}

To time and test a ger_k kernel, its implementation must be stored in the
OBJdir/tune/blas/gemv/R1CASES directory. Assuming I saved the above implementation to ger.c in the above directory, I can test the kernel from the OBJdir/tune/blas/ger directory with the command: make <pre>r1ktest, where <pre> specifies the type/precision and is one of : s, d, c, or z.

This target uses the following make variables which you can change to vary the type of testing done; number in parens are the default values that will be used if no command-line override is given:

mu (1): Unrolling on M dimension
nu (1): Unrolling on N dimension
r1rout: the filename of your kernel implementation
Mt (297): Number of rows of (elements of )
Nt (177): Number of columns of (elements of )
ldat (Mt): leading dimension for the matrix
align (-Fx 16 -Fy 16 -Fa 16): alignments that matrix and vectors must adhere to.
<pre>R1CC (ATLAS default compiler) : compiler required to compile your kernel
<pre>R1CFLAGS (ATLAS default flags) : compiler flags required to compile your kernel

To test this kernel with a 511x220 matrix, I would issue:

>make dr1ktest mu=1 nu=1 r1rout=ger.c Mt=511 Nt=220
.... bunch of compilation, etc ....
   TEST CONJ=0, M=511, N=220, lda=511, incY=1, STARTED
   TEST CONJ=0, M=511, N=220, lda=511, incY=1, PASSED

We have two choices for timers. The first such timer simply calls your newly-written kernel directly. It does no cache flushing at all: it initializes the operands (bringing them into the fastest level of heirarchy into which they fit), and times the operations. This is the timer to use when you want to preload the problem to a given level of cache and see how fast your kernel is in isolation. The make target for this methodology is <pre>r1ktime.

The second make target calls ATLAS's GER driver that builds the full GER from the kernels. This timer does cache flushing, and is what you should use to estimate how fast the complete GER will be. This make target is <pre>r1time.

Both of these targets take largely the same make macros:

mu (1): Unrolling on M dimension
nu (1): Unrolling on N dimension
r1rout: the filename of your kernel implementation
M (1000): Number of rows of (elements of )
N (1000): Number of columns of (elements of )
lda (M): leading dimension for the matrix
align (-Fx 16 -Fy 16 -Fa 16): alignments that matrix and vectors must adhere to.
<pre>R1CC (ATLAS default compiler) : compiler required to compile your kernel
<pre>R1CFLAGS (ATLAS default flags) : compiler flags required to compile your kernel; pass "-x assembler-with-cpp" (along with any architecture-specific flags) if your kernel is written in assembly (assuming your compiler is gcc).
beta (1) : 0 or 1 (scale of )
flushKB : for mvntime, kilobytes of memory to flush.

Therefore, to time the probable speed of a complete GER of size 1000x220 while flushing 16MB of memory, I would issue:

>make dr1time mu=1 nu=1 r1rout=ger.c M=800 Nt=220 flushKB=16384
.... bunch of compilation, etc ....
GER1: M=800, N=1000, lda=800, AF=[16,16,16], AM=[0,0,0], alpha=1.000000e+00:
   M=800, N=1000, lda=800, nreps=3, time=8.589835e-04, mflop=1863.60
   M=800, N=1000, lda=800, nreps=3, time=8.340690e-04, mflop=1919.27
   M=800, N=1000, lda=800, nreps=3, time=8.746753e-04, mflop=1830.16
NREPS=3, MAX=1919.27, MIN=1830.16, AVG=1871.01, MED=1863.60

To time the kernel alone without cache flushing:

>make dr1ktime mu=1 nu=1 r1rout=ger.c M=800 Nt=220
.... bunch of compilation, etc ....
GER1: M=800, N=1000, lda=800, AF=[16,16,16], AM=[0,0,0], alpha=1.000000e+00:
   M=800, N=1000, lda=800, nreps=3, time=6.657494e-04, mflop=2404.51
   M=800, N=1000, lda=800, nreps=3, time=6.664957e-04, mflop=2401.82
   M=800, N=1000, lda=800, nreps=3, time=6.842208e-04, mflop=2339.60
NREPS=3, MAX=2404.51, MIN=2339.60, AVG=2381.97, MED=2401.82

Clint Whaley 2012-07-10

Testing and timing ger_k

Testing and timing `ger_k`