### Writing a GEMV kernel

There are several assumptions that need to hold true for a user-supplied GEMV primitive. First, the loop ordering must be that implied by the <flag> setting the user supplies in the primitive description file, as discussed in Section . Each primitive makes assumptions about the arguments it handles, and these assumptions are reflected in the routine name. The function name of a GEMV primitive is:

   ATL_<pre>gemv<Trans>_a1_x1_<betanam>_y1

where:
• <pre> is replaced by the precision prefix: s, d, c, or z.
• <Trans> is replaced by transpose specifier:
• N : NoTranspose
• T : Transpose
• C : Conjugate Transpose
• Nc : NoTranspose, with conjugation
• <betanam> is replaced by the beta specifier this kernel supplies. All GEMV kernels must supply the following beta specifiers and names:
• b0 :
• b1 :
• bXi0 : for complex GEMV only, specifies when and , but the imaginary component is zero.
• bX : beta is a input variable without known characteristics.

For a given gemv primitive (either NoTranspose or Transpose), if the cpp macro Conj_ is defined we want the conjugate form of that transpose setting (i.e., Nc or C).

Each file is further compiled with differing cpp settings to generate the various beta cases. The beta macro settings and their meanings are:

 CPP MACRO MEANING BETA0 Primitive should provide BETA1 Primitive should provide BETAX Primitive should provide BETAXI0 For complex only, primitive should provide , where the imaginary component of beta is zero.

In terms of the BLAS API, the GEMV kernels additionally assume

• incX = 1
• incY = 1
• Column-major storage of A

Higher level ATLAS routines ensure these assumptions are true before calling the primitive.

Therefore, the routine:

   ATL_dgemvN_a1_x1_b0_y1

supplies a primitive doing notranspose gemv, on a column-major array with , , incX = 1 and incY = 1. while:
   ATL_cgemvNc_a1_x1_bXi0_y1:

supplies a primitive doing notranspose gemv, on a column-major array whose elements should be conjugated before the multiplication, with , incX = 1, incY = 1, and whose real component is unknown, but whose imaginary component is known to be zero.

For greater understanding of how these CPP macros are used to compile multiple primitives from one file, examine the provided CASES files.

The API of the primitive is:

   ATL_<pre>gemv<Trans>_a1_x1_<betanam>_y1
(
const int M,       /* length of Y vector */
const int N,       /* length of X vector */
const SCALAR alpha,/* ignored, assumed to be one */
const TYPE *A,     /* pointer to column-major matrix */
const int lda,     /* leading dimension of A, or row-stride */
const TYPE *X,     /* vector to multiply A by */
const int incX,    /* ignored, assumed to be one */
const SCALAR beta, /* value of beta */
TYPE *Y,           /* output vector */
const int incY     /* ignored, assumed to be one */
);


where,
 
:  s      d       c       z
SCALAR             float  double  float*  float*
TYPE               float  double  float   float

Note that the meaning of M and N are slightly different than that used by the Fortran77 API, in that they give the vector lengths, not array dimensions.

Clint Whaley 2012-07-10