One bad news about choosing a large NB is that applications will spend
more of their time in cleanup. Let us say you choose a block factor of
120. In this case, many applications will never even call your optimized
kernel, but spend all their time in GEMM cleanup. Some applications are
staticly blocked, and if their NB is smaller than yours, they can spend
their entire time in cleanup even for large problems.
Therefore, if you must choose a large NB in order to get adequate GEMM
performance, you must pay an unusual amount of attention to cleanup
optimization. However, as the next section will discuss, even if
cleanup ran at the same speed as your best kernel, this will yield
poor performance for many codes.
Clint Whaley
2012-07-10