kMM_NT and kMM_TN are two of the four generated kernels that will be used for small-case GEMM when we cannot afford to copy the input matrices. The last two characters indicate the transpose settings. The other two kernels' performance lies between these extremes: NT is typically the slowest kernel (all non-contiguous access), and TN is typically the fastest (all contiguous access).
BIG_MM is the only non-kernel timing we presently report, and it is the speed found when doing a large GEMM call. ``Large'' can vary by platform: it is typically , except where we were unable to allocate that much memory, where it will be less. On many machines, this line gives you a rough asymptotic bound on BLAS performance.
The next three lines report Level 2 BLAS kernel performance (the Level 2 BLAS' performance will follow these kernels in roughly the same way that the Level 3 follow the GEMM kernels).
See Appendix for details on more extensive auto-benchmarking.
R. Clint Whaley 2016-07-28