ATLAS and cleanup

As we have seen, ATLAS's main performance kernel is an L1 cache contained matmul with fixed dimension

, and when a given problem dimension of the general matmul is not a multiple of

, cleanup code must be called. It should be apparent that generating codes with all dimensions fixed at compile time, as we do with the full kernel, is not a good idea for cleanup, since it would result in roughly ${N_B}^3$ cleanup routines. Not only would this make the average executable huge, but it would also probably result in performance degredation due to constant instruction load.

ATLAS therefore normally generates a variable number of cleanup cases, with the number of generated codes minimally being , and the maximum number being . The number of generated codes can vary because the cleanup routines are special, sometimes requiring different codes to handle efficiently, as we will see below.

ATLAS splits the generated cleanup into these categories

M-cleanup && : 3 routines, corresponding to BETA = 0, 1 and arbitrary
N-cleanup && : 3 routines, corresponding to BETA = 0, 1 and arbitrary
K-cleanup && $M \leq N_B$ && $N \leq N_B$ : Only one BETA case (arbitrary), but may compile special case for each possible value, resulting in at least 1, and at most K-cleanup routines

So we see that K-cleanup is special in several ways. First, it is the most general cleanup routine, since it can handle multiple dimensions not being less than , whereas the M- and N-cleanup routines can only have their respective dimensions less then . The second thing to note is that we compile only the most general BETA case for K-cleanup; this is due to the fact that we may need different routines to handle K-cleanup efficiently, and multiplying this number of routines by three seems counterproductive.

The final difference in the K-cleanup is the fact that it potentially requires different routines to support. This is due to several factors. Firstly, in ATLAS, the innermost loop in gemm is the K-loop, making it very important for performance. On systems without good loop handling, such as the x86, heavy K unrollings are critical. Secondly, the leading dimensions of the and matrices are fixed to KB due to the data copy, which allows for more efficient indexing of these matrices. If a routine takes run-time (rather than compile-time, as when the dimension is fixed to KB), it must also take run-time lda and ldb, and this extra indexing is too costly on many architectures.

Clint Whaley 2012-07-10