Speeding up the Level 2 BLAS

ATLAS presently empirically tunes four kernels to optimize the various L2BLAS, as shown in Table 2. This table shows matvec used to tune SYMV and HEMV, and this is mostly true. For essentially any modern x86 or ARM, GEMV will be used speed up HEMV and SYMV, but on other systems, ATLAS will simply use the reference implementation, which may be faster. The problem is that to support HEMV/SYMV optimally, we need a kernel which we have not yet empirically tuned. You can build SYMV and HEMV out of GEMV, but in doing so you bring the matrix

into registers twice (once from memory, and once from cache). Since the load of

is the dominant cost in these operations, that is not good news. However, for places where vector instructions speed up memory access (modern x86) and where the compiler can't do a great job (ARM), using a tuned GEMV in this fashion is still faster than just calling a reference version. So, for most systems, speeding up the GEMV kernels will cause a corresponding speedup in SYMV and GEMV, but these operations are the least well-tuned BLAS that ATLAS supports (as a percentage of achievable peak, not raw MFLOP, of course). All other L2BLAS operations should be well optimized, particularly for large problems.

Mneum	operations	Used to support
`mvn_k`	$y \leftarrow Ax$ , $y \leftarrow Ax + y$	GEMV, TRMV, TRSV, HEMV, SYMV
`mvt_k`	$y \leftarrow A^Tx$ , $y \leftarrow A^Tx + y$	GEMV, TRMV, TRSV, HEMV SYMV
`ger_k`	$A \leftarrow xy^T + A$	GER, GERU, GERC, SYR, HER
`ger2_k`	$A \leftarrow xy^T + wz^T + A$ ,	GER2, GER2U, GER2C, SYR2, HER2

In this section, we use some macros that are automatically defined by the ATLAS build system. The first is ATL_CINT which is presently an alias for const int. The macros SCALAR and TYPE are defined according the precision being being compiled:

`<pre>` :	`s`	`d`	`c`	`z`
SCALAR	`float`	`double`	`float*`	`float*`
TYPE	`float`	`double`	`float`	`float`