ATLAS presently empirically tunes four kernels to optimize the various
L2BLAS, as shown in Table 2.
This table shows matvec used to tune SYMV and HEMV, and this is mostly
true. For essentially any modern x86 or ARM, GEMV will be used speed up
HEMV and SYMV, but on other systems, ATLAS will simply use the reference
implementation, which may be faster. The problem is that to support
HEMV/SYMV optimally, we need a kernel which we have not yet
empirically tuned. You can build SYMV and HEMV out of GEMV, but in
doing so you bring the matrix into registers twice (once from
memory, and once from cache). Since the load of
is the dominant
cost in these operations, that is not good news. However, for places
where vector instructions speed up memory access (modern x86) and where
the compiler can't do a great job (ARM), using a tuned GEMV in this
fashion is still faster than just calling a reference version. So,
for most systems, speeding up the GEMV kernels will cause a corresponding
speedup in SYMV and GEMV, but these operations are the least well-tuned
BLAS that ATLAS supports (as a percentage of achievable peak, not raw
MFLOP, of course). All other L2BLAS operations should be well optimized,
particularly for large problems.
In this section, we use some macros that are automatically defined
by the ATLAS build system. The first is ATL_CINT which is
presently an alias for const int. The macros SCALAR
and TYPE are defined according the precision being being compiled:
<pre> : |
s | d | c | z |
SCALAR | float | double | float* | float* |
TYPE | float | double | float | float |