ATLAS and cleanup

As we have seen, ATLAS's main performance kernel is an L1 cache contained matmul with fixed dimension $N_B$, and when a given problem dimension of the general matmul is not a multiple of $N_B$, cleanup code must be called. It should be apparent that generating codes with all dimensions fixed at compile time, as we do with the full kernel, is not a good idea for cleanup, since it would result in roughly ${N_B}^3$ cleanup routines. Not only would this make the average executable huge, but it would also probably result in performance degredation due to constant instruction load.

ATLAS therefore normally generates a variable number of cleanup cases, with the number of generated codes minimally being $7$, and the maximum number being $6 + N_B$. The number of generated codes can vary because the $K$ cleanup routines are special, sometimes requiring $N_B$ different codes to handle efficiently, as we will see below.

ATLAS splits the generated cleanup into these categories

  1. M-cleanup $M < N_B$ && $N = K = N_B$: 3 routines, corresponding to BETA = 0, 1 and arbitrary
  2. N-cleanup $N < N_B$ && $M = K = N_B$: 3 routines, corresponding to BETA = 0, 1 and arbitrary
  3. K-cleanup $K <= N_B$ && $M \leq N_B$ && $N \leq N_B$: Only one BETA case (arbitrary), but may compile special case for each possible $K$ value, resulting in at least 1, and at most $N_B$ K-cleanup routines

So we see that K-cleanup is special in several ways. First, it is the most general cleanup routine, since it can handle multiple dimensions not being less than $N_B$, whereas the M- and N-cleanup routines can only have their respective dimensions less then $N_B$. The second thing to note is that we compile only the most general BETA case for K-cleanup; this is due to the fact that we may need $N_B$ different routines to handle K-cleanup efficiently, and multiplying this number of routines by three seems counterproductive.

The final difference in the K-cleanup is the fact that it potentially requires $N_B$ different routines to support. This is due to several factors. Firstly, in ATLAS, the innermost loop in gemm is the K-loop, making it very important for performance. On systems without good loop handling, such as the x86, heavy K unrollings are critical. Secondly, the leading dimensions of the $A$ and $B$ matrices are fixed to KB due to the data copy, which allows for more efficient indexing of these matrices. If a routine takes run-time $K$ (rather than compile-time, as when the dimension is fixed to KB), it must also take run-time lda and ldb, and this extra indexing is too costly on many architectures.

Clint Whaley 2012-07-10