User supplied cleanup

Users can supply cleanup code for the following three cases only, all of which come in the three BETA variants:
  1. M-cleanup $M < N_B$ && $N = K = N_B$
  2. N-cleanup $N < N_B$ && $M = K = N_B$
  3. K-cleanup $K < N_B$ && $M = N = N_B$

The generated code handles all cleanup where more than one dimension is less than the blocking factor. This simplification allows ATLAS to avoid having to test ${N_B}^3$ cases when selecting user cleanup. Once the matrices in question are larger than $N_B$, cleanup with more than one dimension less than $N_B$ rapidly stops being a performance factor. Small matrices where this cleanup is a factor are almost certainly going to be handled by ATLAS's small-case code anyway, so it seems unlikely that this simplification will hurt performance in practice. Section 2.7.5 shows this in a more formal way.

Users need to be very careful when supplying cleanup, because if the user indicates that a dimension must be a compile-time variable, rather than a runtime variable, ATLAS will generate up to $N_B$ routines to handle user cleanup, and since user routines are compiled with all BETA variants, it is possible to generate $9 N_B$ cleanup cases, in addition to ATLAS's generated cases. It is therefore recommended that the user supply cleanup that uses run-time arguments whenever possible, and indicate kernels taking compile-time dimensions as not to be used for cleanup.

Clint Whaley 2012-07-10