gemmK description file

In the install process, ATLAS first searches through the gemmK implementations provided by the ATLAS matmul generator. When the best generated code is found, the user contributed codes are timed to see if they can beat the generated code. The gemmK multiple implementation search script opens a description file for each precision (scases.dsc, dcases.dsc, ccases.dsc, zcases.dsc) in the BLDdir/tune/blas/gemm/ directory, to see what user-contributed codes are available. This master index file is actually generated based on several user-supplied files from ATLAS/tune/blas/gemm/CASES (see Section 2.2.4 for the names and definitions of these files). The format for all these files is the same, and is described in the following paragraphs.

The first line of each file is a comment line, and is ignored. The next line indicates the number of user-contributed codes to search, and each subsequent line supplies information about a given user-supplied gemmK. The form of these lines is:
<ID> <flag> <mb> <nb> <kb> <muladd> <lat> <mu> <nu> <ku> <rout> "<author>"

<rout> and <author>" are strings, and the rest of the parameters are signed integers.

The meaning of these parameters are:

Table 1 summarizes the presently defined flag values.

Table 1: Matmul index routine flag variables
FLAG MEANING
0 Normal
8 Do not consider this kernel for cleanup
16 Consider this kernel for cleanup only
32 lda and ldb are not restricted to KB
64 mb provides run-time constraint, not compile-time
128 nb provides run-time constraint, not compile-time
256 kb provides run-time constraint, not compile-time
512 This kernel needs $4 N_b \le cacheelts$


Here's an example:

<ID> <flag> <mb> <nb> <kb> <muladd> <lat> <mu> <nu> <ku> <rout> "<Contributer>"
3
 1 0 0 0 0 1 1 1 1 1 ATL_mm1x1x1.c "R. Clint Whaley"
 2 0 1 1 1 1 1 1 1 1 ATL_mm1x1x1b.c "R. Clint Whaley"
 3 0 1 1 8 1 1 1 1 4 ATL_mm2.c "R. Clint Whaley"

So, we have 3 user-supplied routines, all written by me. The first loops over $M$, $N$, and $K$, but the following two routines loop over the cpp macros MB, NB, KB. The third routine insists that KB be a multiple of 8. The first two routines don't unroll any of the loops, while the third unrolls the K loop to a depth of 4. They all use a combined muladd style of programming, and don't worry about latency.

Clint Whaley 2012-07-10