ATLAS therefore normally generates a variable number of cleanup cases, with the number of generated codes minimally being , and the maximum number being . The number of generated codes can vary because the cleanup routines are special, sometimes requiring different codes to handle efficiently, as we will see below.

ATLAS splits the generated cleanup into these categories

**M-cleanup**&& : 3 routines, corresponding to`BETA`= 0, 1 and arbitrary**N-cleanup**&& : 3 routines, corresponding to`BETA`= 0, 1 and arbitrary**K-cleanup**&& && : Only one`BETA`case (arbitrary), but may compile special case for each possible value, resulting in at least 1, and at most K-cleanup routines

So we see that K-cleanup is special in several ways. First, it is the
most general cleanup routine, since it can handle multiple dimensions not
being less than , whereas the M- and N-cleanup routines can only have
their respective dimensions less then . The second thing to note is
that we compile only the most general `BETA` case for K-cleanup; this
is due to the fact that we may need different routines to handle
K-cleanup efficiently, and multiplying this number of routines by three seems
counterproductive.

The final difference in the K-cleanup is the fact that it potentially requires
different routines to support. This is due to several factors.
Firstly, in ATLAS, the innermost loop in gemm is the K-loop, making it
very important for performance. On systems without good loop handling,
such as the x86, heavy K unrollings are critical.
Secondly, the leading dimensions of the and matrices are fixed
to `KB` due to the data copy, which allows for more efficient indexing
of these matrices. If a routine takes run-time (rather than compile-time,
as when the dimension is fixed to `KB`), it must also take run-time
`lda` and `ldb`, and this extra indexing is too costly on
many architectures.

Clint Whaley 2012-07-10