1 480 4 4 1 1 1 4 4 2 ATL_mm4x4x2US.c "V. Nguyen & P. Strazdins"
OK, as always, we can read this to see that MB and NB
must be multiples of 4, and that KB can be any value. With
no flag modifiers, if we wanted to use the routine for K cleanup,
we would have to compile it into different routines, since
loop dimensions are compile-time parameters by default. However,
this routine is modified by a flag value of 480. What does this mean?
Consulting table 1, we see that
,
which means lda and ldb are not restricted to KB (i.e., they are
run-time parameters to the routine), the M-loop is controlled by a run-time
variable, the N-loop is controlled by a run-time variable, and the K-loop
is controlled by a run-time variable. We therefore know that we can
use this routine for all cleanups (M-, N-, and K-cleanup), and we need only
one routine to do so (i.e., we do not have to compile
routines to handle
all cases). However, it can only be used for M- and N- cleanup cases where
the respective dimension is a multiple of 4. Therefore, assuming this
kernel is superior to the generated code, it will be used for all K cleanup
routines. However, for M and N cleanup, there will be something corresponding
to the following pseudocode:
if (M % 4 == 0) call ATL_mm4x4x2US else call generated M cleanup
It is clear that without overloading the flag value to an even more
ludicrous degree, that cleanup will eventually need to have it's own
index file. For instance, it would be nice to be able to insist that
a particular K-cleanup code be used only when , for instance,
in addition to insisting it be a multiple of a particular value. The fact
that cleanup does not already have such a seperate file simply represents
a design failure on my part; it was not until I had already produced the
system working as it does now that I saw its shortcomings, and then it
was too late to change for the release. Subsequent developer releases
will probably address this shortcoming.
Clint Whaley 2012-07-10