ATLAS therefore normally generates a variable number of cleanup cases, with the number of generated codes minimally being , and the maximum number being . The number of generated codes can vary because the cleanup routines are special, sometimes requiring different codes to handle efficiently, as we will see below.
ATLAS splits the generated cleanup into these categories
So we see that K-cleanup is special in several ways. First, it is the most general cleanup routine, since it can handle multiple dimensions not being less than , whereas the M- and N-cleanup routines can only have their respective dimensions less then . The second thing to note is that we compile only the most general BETA case for K-cleanup; this is due to the fact that we may need different routines to handle K-cleanup efficiently, and multiplying this number of routines by three seems counterproductive.
The final difference in the K-cleanup is the fact that it potentially requires different routines to support. This is due to several factors. Firstly, in ATLAS, the innermost loop in gemm is the K-loop, making it very important for performance. On systems without good loop handling, such as the x86, heavy K unrollings are critical. Secondly, the leading dimensions of the and matrices are fixed to KB due to the data copy, which allows for more efficient indexing of these matrices. If a routine takes run-time (rather than compile-time, as when the dimension is fixed to KB), it must also take run-time lda and ldb, and this extra indexing is too costly on many architectures.
Clint Whaley 2012-07-10