Writing a GEMV kernel

There are several assumptions that need to hold true for a user-supplied GEMV primitive. First, the loop ordering must be that implied by the <flag> setting the user supplies in the primitive description file, as discussed in Section [*]. Each primitive makes assumptions about the arguments it handles, and these assumptions are reflected in the routine name. The function name of a GEMV primitive is:

   ATL_<pre>gemv<Trans>_a1_x1_<betanam>_y1
where:

For a given gemv primitive (either NoTranspose or Transpose), if the cpp macro Conj_ is defined we want the conjugate form of that transpose setting (i.e., Nc or C).

Each file is further compiled with differing cpp settings to generate the various beta cases. The beta macro settings and their meanings are:

CPP MACRO MEANING
BETA0 Primitive should provide $y \leftarrow Ax$
BETA1 Primitive should provide $y \leftarrow y + A x$
BETAX Primitive should provide $y \leftarrow \beta y + A x$
BETAXI0 For complex only, primitive should provide $y \leftarrow \beta y + A x$,
  where the imaginary component of beta is zero.

In terms of the BLAS API, the GEMV kernels additionally assume

Higher level ATLAS routines ensure these assumptions are true before calling the primitive.

Therefore, the routine:

   ATL_dgemvN_a1_x1_b0_y1
supplies a primitive doing notranspose gemv, on a column-major array with $\alpha = 1$, $\beta = 0$, incX = 1 and incY = 1. while:
   ATL_cgemvNc_a1_x1_bXi0_y1:
supplies a primitive doing notranspose gemv, on a column-major array whose elements should be conjugated before the multiplication, with $\alpha = 1$, incX = 1, incY = 1, and $\beta$ whose real component is unknown, but whose imaginary component is known to be zero.

For greater understanding of how these CPP macros are used to compile multiple primitives from one file, examine the provided CASES files.

The API of the primitive is:

   ATL_<pre>gemv<Trans>_a1_x1_<betanam>_y1
   (
      const int M,       /* length of Y vector */
      const int N,       /* length of X vector */
      const SCALAR alpha,/* ignored, assumed to be one */
      const TYPE *A,     /* pointer to column-major matrix */
      const int lda,     /* leading dimension of A, or row-stride */
      const TYPE *X,     /* vector to multiply A by */
      const int incX,    /* ignored, assumed to be one */
      const SCALAR beta, /* value of beta */
      TYPE *Y,           /* output vector */
      const int incY     /* ignored, assumed to be one */
   );

where,
<pre> : s d c z
SCALAR float double float* float*
TYPE float double float float

Note that the meaning of M and N are slightly different than that used by the Fortran77 API, in that they give the vector lengths, not array dimensions.

Clint Whaley 2012-07-10