ATLAS therefore normally generates a variable number of cleanup cases,
with the number of generated codes minimally being , and the maximum
number being
. The number of generated codes can vary because
the
cleanup routines are special, sometimes requiring
different
codes to handle efficiently, as we will see below.
ATLAS splits the generated cleanup into these categories
So we see that K-cleanup is special in several ways. First, it is the
most general cleanup routine, since it can handle multiple dimensions not
being less than , whereas the M- and N-cleanup routines can only have
their respective dimensions less then
. The second thing to note is
that we compile only the most general BETA case for K-cleanup; this
is due to the fact that we may need
different routines to handle
K-cleanup efficiently, and multiplying this number of routines by three seems
counterproductive.
The final difference in the K-cleanup is the fact that it potentially requires
different routines to support. This is due to several factors.
Firstly, in ATLAS, the innermost loop in gemm is the K-loop, making it
very important for performance. On systems without good loop handling,
such as the x86, heavy K unrollings are critical.
Secondly, the leading dimensions of the
and
matrices are fixed
to KB due to the data copy, which allows for more efficient indexing
of these matrices. If a routine takes run-time
(rather than compile-time,
as when the dimension is fixed to KB), it must also take run-time
lda and ldb, and this extra indexing is too costly on
many architectures.
Clint Whaley 2012-07-10