Large NB means more time in unblocked application code

Probably the worst thing about choosing a large NB is that many applications use Level 1 and 2 BLAS in order to do the unblocked part of the computation. These BLAS are usually at least an order of magnitude slower than GEMM. Therefore, as you increase NB, for applications with unblocked portions, you increase the proportion of time spent in this order-of-magnitude slower code. Therefore, even with perfect cleanup, a large NB may result in an application running at less than half speed, even though GEMM performance is quite good.

To get an idea of this, simply scope the factorizations provided by LAPACK. These applications are staticly blocked, so that the column factorizations (eg., DGETF2 for LU) are used until NB is reached. If ILAENV returns a blocking factor smaller than your GEMM, the applications will stay in cleanup even for large problems. Even worse, some applications (eg., QR) require workspace proportional to NB, and since dynamic memory is not used, it is possible even if you hack ILAENV to use the correct blocking factor, they will be forced to a smaller one.

Clint Whaley 2012-07-10