kMM_NT and kMM_TN are two of the four generated kernels that will be used for small-case GEMM when we cannot afford to copy the input matrices. The last two characters indicate the transpose settings. The other two kernels' performance lies between these extremes: NT is typically the slowest kernel (all non-contiguous access), and TN is typically the fastest (all contiguous access).
BIG_MM is the only non-kernel timing we presently report, and it is
the speed found when doing a large GEMM call.  ``Large'' can vary by platform:
it is typically  , except where we were unable to allocate that
much memory, where it will be less.  On many machines, this line gives you
a rough asymptotic bound on BLAS performance.
, except where we were unable to allocate that
much memory, where it will be less.  On many machines, this line gives you
a rough asymptotic bound on BLAS performance.
The next three lines report Level 2 BLAS kernel performance (the Level 2 BLAS' performance will follow these kernels in roughly the same way that the Level 3 follow the GEMM kernels).
See Appendix ![[*]](crossref.png) for details on more extensive 
auto-benchmarking.
 for details on more extensive 
auto-benchmarking.
R. Clint Whaley 2016-07-28