For machines with very large L1 caches, often several blocking factors that fit into L1 have roughly the same performance. In such a case, it is very likely that you want to choose the smallest achieving that rough performance, as it will allow more blocks to fit into the L2 blocking to be done later.
If a kernel appears to get much better performance with a large NB, the best idea is to build a full GEMM using both the best-performing small NB, and the best performing large NB, and seeing what the gap truly is. Very often, the small kernel will actually be better even asymptotically, and if it is not, it will often be so much better for smaller problems that it makes sense to use it anyway.
Even beyond these explanations, it is sometimes the case that the kernel timer predicts good performance that is not realized when the full GEMM is built. This is usually due to inadequate cache flushing, leading to overprediction of performance because things are retained more in the cache than they are in practice. Therefore, I usually pump up the flushing mechanism (set L2SIZE of your Make.inc to ridiculously large levels). No matter what, actual full GEMM performance is the final arbiter. If it is not as high as predicted by the kernel timer, it may be worthwhile to see if other, smaller NB, cases achieve the same full-gemm performance.
Clint Whaley 2012-07-10