Finding a good NB

I will call the first level of cache accessed by the floating point unit the Level 1 cache, regardless of whether it is the first level of cache of the machine (there are a number of machines, such as the P4 (prescott) and Itanium where the FPU skips the Level 1 cache). Let $N_e$ be the number of elements of the data type of interest in this cache. If this cache is write-through, then a rough guess for a good upper bound is $N_B \le \sqrt{N_e}$. If the cache is not write-through, this is still the upper bound, but many larger caches often benefit from using a smaller $N_B$, one roughly $N_B < \frac{\sqrt{N_e}}{3}$. We can describe this more exactly, but these bounds are easy to compute during tuning.

You should not choose an $N_B$ that is a power of 2, as this could occasionally cause nasty cache conflicts. There's often a small advantage to choosing $N_B$ that are a multiple of cache line size; this can sometimes be critical, depending on the arch.

So, the basic idea is to start looking at $N_B$ given by the above two computations, and then try a little smaller and larger using the kernel timer. If you get two that tie for out-of-cache performance, always take the smaller. If best performance is achieved with very large $N_B$ (say $N_B \ge 80$), then always confirm that it yields better GEMM performance than a smaller $N_B$, and that application performance is not severely impacted, particularly for smaller problems.

The way I usually time application performance is to time ATLAS's LU. This actually gives you a very rosy picture of how a large block factor will effect performance, in that it uses recursion rather than staticly blocking. This means that ATLAS's LU does not have any unblocked code, and thus doesn't slow down the way LAPACK's LU will for large $N_B$. However, if even this code shows performance loss for smaller sizes, you know your cleanup needs to get a lot better, or you need to reduce $N_B$, even if it results in a slight reduction in GEMM performance. If you want to get a better idea of how most applications will perform, time one of LAPACK's factorizations instead.

Under no circumstances should you choose a blocking factor much larger than 120. I confine the ATLAS search to a maximal size of 80 for the above reasons, but occasionally go a little higher for machines without effective L1 caches. However, this can absolutely kill application performance. Further, it is never a good idea to completely fill an Level 2 cache with your block. It may look good in GEMM, but it will die in any application, both for the reasons above, and the following: The L2 cache is shared instruction/data. Filling it with data will often lead to instruction loading/flushing cycle when a larger application is calling. Remember that GEMM is of interest because of all the applications that are built from it, not when used in isolation.

If a NB larger than 60 only gives you a few percent, always choose a smaller one; only go above 80 for significant advantage, and essentially don't go above 120 unless absolutely necessary, and then you can expect slowdown in many applications, even once you have fully optimized all cleanup cases.

Clint Whaley 2012-07-10