Finding a good NB for GEMM

One of the things I do most frequently with user-submitted kernels is reduce the blocking factor that the user has chosen. I often choose smaller NB than the best for asymptotic GEMM performance, and even more often choose one that does not yield the best performance in the kernel timer. To understand why, you must understand the following points, explained in turn below:
  1. Better kernel timing (eg. make ummcase in your <OBJdir>/tune/blas/gemm/ directory) does not always yield better total GEMM performance
  2. Large NB means significantly more time in cleanup code
  3. Large NB means significantly more time in unblocked application code


Clint Whaley 2012-07-10