The performance kernel for the entire Level 3 BLAS is matrix multiply.
Matrix multiply is written in terms of a lower-level building block that
we call gemmK. gemmK is a special matrix multiply where
the input dimensions are fixed at
, where the blocking
factor
is chosen in order to maximize L1 cache reuse, for a loose
enough definition of L1 cache (typically, we use it to mean the first
level of cache accessible by the FPU, which may be the L2 cache on
some systems).
ATLAS actually has two different classes of GEMM kernels: one for copied
matrices (gemmK), and one that operates directly on the user's matrices
without a copy. For matrices of
sufficient size, ATLAS copies the input matrix into block-major storage.
In block-major storage, the
blocks operated on by the
gemmK are actually contiguous. This optimization prevents unnecessary
cache misses, cache conflicts, and TLB problems. However, for sufficiently
small matrices, the cost of this data copy is prohibitively expensive,
and thus ATLAS has kernels that operate on non-copied data. However,
without the copy to simplify the process, there are multiple non-copy
kernels (differing kernels for differing transpose settings, for instance).
Since the non-copy kernels are typically only used for very small problems,
and they are much more complex, ATLAS presently accepts contributed code
only for the copy matmul kernel. For most problems, well over 98% of ATLAS
time is spent in the copy matmul kernel, so this should not be much of
a problem.