ATLAS presently empirically tunes four kernels to optimize the various L2BLAS, as shown in Table 2. This table shows matvec used to tune SYMV and HEMV, and this is mostly true. For essentially any modern x86 or ARM, GEMV will be used speed up HEMV and SYMV, but on other systems, ATLAS will simply use the reference implementation, which may be faster. The problem is that to support HEMV/SYMV optimally, we need a kernel which we have not yet empirically tuned. You can build SYMV and HEMV out of GEMV, but in doing so you bring the matrix into registers twice (once from memory, and once from cache). Since the load of is the dominant cost in these operations, that is not good news. However, for places where vector instructions speed up memory access (modern x86) and where the compiler can't do a great job (ARM), using a tuned GEMV in this fashion is still faster than just calling a reference version. So, for most systems, speeding up the GEMV kernels will cause a corresponding speedup in SYMV and GEMV, but these operations are the least well-tuned BLAS that ATLAS supports (as a percentage of achievable peak, not raw MFLOP, of course). All other L2BLAS operations should be well optimized, particularly for large problems.
In this section, we use some macros that are automatically defined
by the ATLAS build system. The first is ATL_CINT which is
presently an alias for const int. The macros SCALAR
and TYPE are defined according the precision being being compiled:
<pre> : |
s | d | c | z |
SCALAR | float | double | float* | float* |
TYPE | float | double | float | float |