One of the things I do most frequently with user-submitted kernels is reduce
the blocking factor that the user has chosen. I often choose smaller NB
than the best for asymptotic GEMM performance, and even more often choose
one that does not yield the best performance in the kernel timer.
To understand why, you must understand the following points, explained in
turn below:
- Better kernel timing (eg. make ummcase in your
<OBJdir>/tune/blas/gemm/ directory) does not always yield better
total GEMM performance
- Large NB means significantly more time in cleanup code
- Large NB means significantly more time in unblocked application code
Subsections
Clint Whaley
2012-07-10