1 480 4 4 1 1 1 4 4 2 ATL_mm4x4x2US.c "V. Nguyen & P. Strazdins"

OK, as always, we can read this to see that `MB` and `NB`
must be multiples of 4, and that `KB` can be any value. With
no flag modifiers, if we wanted to use the routine for K cleanup,
we would have to compile it into different routines, since
loop dimensions are compile-time parameters by default. However,
this routine is modified by a flag value of 480. What does this mean?
Consulting table 1, we see that
,
which means lda and ldb are not restricted to `KB` (i.e., they are
run-time parameters to the routine), the M-loop is controlled by a run-time
variable, the N-loop is controlled by a run-time variable, and the K-loop
is controlled by a run-time variable. We therefore know that we can
use this routine for all cleanups (M-, N-, and K-cleanup), and we need only
one routine to do so (i.e., we do not have to compile routines to handle
all cases). However, it can only be used for M- and N- cleanup cases where
the respective dimension is a multiple of 4. Therefore, assuming this
kernel is superior to the generated code, it will be used for all K cleanup
routines. However, for M and N cleanup, there will be something corresponding
to the following pseudocode:

if (M % 4 == 0) call ATL_mm4x4x2US else call generated M cleanup

It is clear that without overloading the flag value to an even more ludicrous degree, that cleanup will eventually need to have it's own index file. For instance, it would be nice to be able to insist that a particular K-cleanup code be used only when , for instance, in addition to insisting it be a multiple of a particular value. The fact that cleanup does not already have such a seperate file simply represents a design failure on my part; it was not until I had already produced the system working as it does now that I saw its shortcomings, and then it was too late to change for the release. Subsequent developer releases will probably address this shortcoming.

Clint Whaley 2012-07-10