The assumptions behind this kernel are that the input operands are
loaded to L1 only one time (i.e., the blocking guarantees that all
of the matrix accessed in the inner loop plus the active panel of
the matrix in the outer loop fits in L1). For very large caches, all
three operands may fit into cache, but this is typically not the
case. Because this gemmK is called by routines that place
as the innermost loop, the output operand will typically come
from the L2 cache (except, obviously, on the first of the
such calls). ATLAS uses the JIK loop variant of
on-chip multiply, and thus all of fits in cache, with nu columns
of . To take an example, say you are using mu = nu = 4, with
, then the idea is that the piece of ,
along with the piece of (the active panel of ),
and the section of all fit into cache at once, with
enough room for the load of the next step, and any junk the algorithm
might have in L1. That panel of is applied to all of , and then
a new panel is loaded. Since the panel has been applied to all , it
will never be reloaded, and thus we see that is loaded to L1 only
one time. Since all of fits in L1, and we keep it there across all
panels of , it is also loaded to L1 only one time.
If written appropriately, loading all of with a few rows
of should theoretically be just as efficient (i.e., the IJK variant
of matmul). However, the variants where is not the innermost loop
are unlikely to work well in ATLAS, if for no other reason than the
transpose settings we have chosen militate against it.
Note that the case must not read , since the memory may
legally be unitialized.
Clint Whaley
2012-07-10