Discussion of timing targets

Presently, ATLAS times mostly kernel routines, which are used to build higher level routines that then appear in the BLAS or LAPACK. kSelMM is the matrix multiply kernel that is being used for large GEMM calls, which will be the best kernel found in the generator and multiple implementation searches. Therefore this kernel may be written in assembly on some platforms. kGenMM is the fastest generated kernel that matches kSelMM, and it may be used for some types of cleanup. All generated kernels are written in ANSI C, and thus their peak performance will strongly depend on the compiler being used.

kMM_NT and kMM_TN are two of the four generated kernels that will be used for small-case GEMM when we cannot afford to copy the input matrices. The last two characters indicate the transpose settings. The other two kernels' performance lies between these extremes: NT is typically the slowest kernel (all non-contiguous access), and TN is typically the fastest (all contiguous access).

BIG_MM is the only non-kernel timing we presently report, and it is the speed found when doing a large GEMM call. ``Large'' can vary by platform: it is typically $M=N=K=1600$, except where we were unable to allocate that much memory, where it will be less. On many machines, this line gives you a rough asymptotic bound on BLAS performance.

The next three lines report Level 2 BLAS kernel performance (the Level 2 BLAS' performance will follow these kernels in roughly the same way that the Level 3 follow the GEMM kernels).

See Appendix [*] for details on more extensive auto-benchmarking.

R. Clint Whaley 2016-07-28