Better kernel timing does not always yield faster GEMM

The kernel timer (invoked by one of the make mmcase variants available in <OBJdir>/tune/blas/gemm/) tries to mimic the way ATLAS calls the kernel. However, it does not do everything the same way. First, there is no cleanup, so it is always calling the kernel only. More importantly, CacheEdge has not yet been determined, so no Level 2 Cache blocking is being performed. Therefore, it may sometimes look like you are better off to block the kernel for the L2 when using these kernel timers, when in fact, if you instead block for the Level 1 cache, CacheEdge will then further speed things up later, and thus the smaller NB achieves better GEMM performance, even when it runs slower in the kernel timer.

For machines with very large L1 caches, often several blocking factors that fit into L1 have roughly the same performance. In such a case, it is very likely that you want to choose the smallest achieving that rough performance, as it will allow more blocks to fit into the L2 blocking to be done later.

If a kernel appears to get much better performance with a large NB, the best idea is to build a full GEMM using both the best-performing small NB, and the best performing large NB, and seeing what the gap truly is. Very often, the small kernel will actually be better even asymptotically, and if it is not, it will often be so much better for smaller problems that it makes sense to use it anyway.

Even beyond these explanations, it is sometimes the case that the kernel timer predicts good performance that is not realized when the full GEMM is built. This is usually due to inadequate cache flushing, leading to overprediction of performance because things are retained more in the cache than they are in practice. Therefore, I usually pump up the flushing mechanism (set L2SIZE of your to ridiculously large levels). No matter what, actual full GEMM performance is the final arbiter. If it is not as high as predicted by the kernel timer, it may be worthwhile to see if other, smaller NB, cases achieve the same full-gemm performance.

Clint Whaley 2012-07-10