The Kernel Description File

The most important file in ATLAS/tune/blas/gemv/CASES is the primitive description file, <pre>cases.dsc. Each precision has its own description file (as indicated by <pre>), and this file describes all of the routines to time in order to find the best. For instance, for double precision, we see:

speedy. cat CASES/dcases.dsc 
  1  8  0  0 ATL_gemvN_mm.c     "R. Clint Whaley"
  2  0  1  1 ATL_gemvN_1x1_1.c  "R. Clint Whaley"
  3 16 32  1 ATL_gemvN_1x1_1a.c "R. Clint Whaley"
  4  0  4  2 ATL_gemvN_4x2_0.c  "R. Clint Whaley"
  5  0  4  4 ATL_gemvN_4x4_1.c  "R. Clint Whaley"
  6  0  8  4 ATL_gemvN_8x4_1.c  "R. Clint Whaley"
  7  0 16  2 ATL_gemvN_16x2_1.c "R. Clint Whaley"
  8  0 16  4 ATL_gemvN_16x4_1.c "R. Clint Whaley"
  9 16 32  4 ATL_gemvN_32x4_1.c "R. Clint Whaley"
101 8  0  0 ATL_gemvT_mm.c      "R. Clint Whaley"
102 0  2  8 ATL_gemvT_2x8_0.c   "R. Clint Whaley"
103 0  4  8 ATL_gemvT_4x8_1.c   "R. Clint Whaley"
104 0  4 16 ATL_gemvT_4x16_1.c  "R. Clint Whaley"
105 0  2 16 ATL_gemvT_2x16_1.c  "R. Clint Whaley"
106 0  1  1 ATL_gemvT_1x1_1.c   "R. Clint Whaley"

The first number (in this case 9) is the number of NoTranspose primitives to time. This is followed by that number of primitive lines describing those NoTrans primitives, and then we supply the number of Transpose primitives to time (in this example, 6), followed by that number of primitive lines describing the Transpose primitives.

As you can see, each line supplies four integers and a filename to the search routine. The filename is the filename of the primitive to time. The first integer provides a unique integer ID (must be greater than zero) for each primative line, and the other three supply information necessary in order for the higher level routines to do blocking.

This is the first piece of important information about these primitive routines: no blocking should be done in them. The appropriate blocking is done by higher level ATLAS routines. Most primitives employ some kind of loop unrolling, and when these higher level routines block in order to reuse vectors or matrices, it is important that this blocking does not conflict with the primitives' unrolling factors (for instance, if the primitive unrolls a given dimension by 8, but ATLAS blocks that dimension to 3, ATLAS would always call the cleanup code). So this is the information conveyed by these three integers.

The form of a GEMV primitive line is:

<ID> <flag> <Yunroll> <Xunroll> <filename> "<author(s)>"

As mentioned previously, <filename> is the primitive source file. <Yunroll> is the unrolling used for the loop that loops over the $Y$ vector, and <Xunroll> is the unrolling used for the loop that loops over the $X$ vector. <flag> is a less obvious parameter which is used to tell the search script about special properties of a kernel.

It is assumed that the user has supplied a "inner-product" based GEMV implementation (i.e., an implementation which basically does <Yunroll> simultaneous dot products). This default state is expressed to the search by a <flag> value of 0. However, since the inner product formulation of NoTranspose GEMV loops across the non-contiguous dimension of the matrix, some architectures need to employ an "outer-product" based NoTranspose GEMV (i.e., a GEMV which is performed by doing <Xunroll> simultaneous axpy's). This is indicated by a <flag> value of 16. Finally, since ATLAS's GEMM has a code generator which allows it to achieve very good portable performance, it is always worth seeing how optimal a GEMV can be obtained by simply making the appropriate call to GEMM. <flag> of 8 indicates that this is what the kernel is doing.

In summary:

0 Normal
8 GEMM-based primitive
16 Outer-product or AXPY-based primitive (only valid for Notranspose GEMV)

Clint Whaley 2012-07-10