Speeding up the Level 3 BLAS

The performance kernel for the entire Level 3 BLAS is matrix multiply. Matrix multiply is written in terms of a lower-level building block that we call gemmK. gemmK is a special matrix multiply where the input dimensions are fixed at $M = N = K = N_B$, where the blocking factor $N_B$ is chosen in order to maximize L1 cache reuse, for a loose enough definition of L1 cache (typically, we use it to mean the first level of cache accessible by the FPU, which may be the L2 cache on some systems).

ATLAS actually has two different classes of GEMM kernels: one for copied matrices (gemmK), and one that operates directly on the user's matrices without a copy. For matrices of sufficient size, ATLAS copies the input matrix into block-major storage. In block-major storage, the $N_B \times N_B$ blocks operated on by the gemmK are actually contiguous. This optimization prevents unnecessary cache misses, cache conflicts, and TLB problems. However, for sufficiently small matrices, the cost of this data copy is prohibitively expensive, and thus ATLAS has kernels that operate on non-copied data. However, without the copy to simplify the process, there are multiple non-copy kernels (differing kernels for differing transpose settings, for instance). Since the non-copy kernels are typically only used for very small problems, and they are much more complex, ATLAS presently accepts contributed code only for the copy matmul kernel. For most problems, well over 98% of ATLAS time is spent in the copy matmul kernel, so this should not be much of a problem.

Clint Whaley 2012-07-10