Speeding up the Level 2 BLAS

ATLAS presently empirically tunes four kernels to optimize the various L2BLAS, as shown in Table 2. This table shows matvec used to tune SYMV and HEMV, and this is mostly true. For essentially any modern x86 or ARM, GEMV will be used speed up HEMV and SYMV, but on other systems, ATLAS will simply use the reference implementation, which may be faster. The problem is that to support HEMV/SYMV optimally, we need a kernel which we have not yet empirically tuned. You can build SYMV and HEMV out of GEMV, but in doing so you bring the matrix $A$ into registers twice (once from memory, and once from cache). Since the load of $A$ is the dominant cost in these operations, that is not good news. However, for places where vector instructions speed up memory access (modern x86) and where the compiler can't do a great job (ARM), using a tuned GEMV in this fashion is still faster than just calling a reference version. So, for most systems, speeding up the GEMV kernels will cause a corresponding speedup in SYMV and GEMV, but these operations are the least well-tuned BLAS that ATLAS supports (as a percentage of achievable peak, not raw MFLOP, of course). All other L2BLAS operations should be well optimized, particularly for large problems.

Table 2: Level-2 BLAS kernels, and the L2BLAS routines they improve
Mneum operations Used to support
mvn_k $y \leftarrow Ax$, $y \leftarrow Ax + y$ GEMV, TRMV, TRSV, HEMV, SYMV
mvt_k $y \leftarrow A^Tx$, $y \leftarrow A^Tx + y$ GEMV, TRMV, TRSV, HEMV SYMV
ger_k $A \leftarrow xy^T + A$ GER, GERU, GERC, SYR, HER
ger2_k $A \leftarrow xy^T + wz^T + A$, GER2, GER2U, GER2C, SYR2, HER2

In this section, we use some macros that are automatically defined by the ATLAS build system. The first is ATL_CINT which is presently an alias for const int. The macros SCALAR and TYPE are defined according the precision being being compiled:
<pre> : s d c z
SCALAR float double float* float*
TYPE float double float float

Clint Whaley 2012-07-10