GEMV kernel notes

All routines except SYMV call the GEMV kernel in the same fashion. Other than SYMV, all routines cannot reduce the load of $A$ from $O(N^2)$, but can reduce the memory access of both $X$ and $Y$ from $O(N^2)$ to $O(N)$. In general, the $Y$ access is reduced by register blocking in the GEMV kernel. Therefore, the higher level routines block $X$ such it is reused across kernel invocations in L1 (if you write a axpy-based notranspose GEMV kernel, $Y$ is blocked instead of $X$). What this amounts to is partitioning $X$ via: $N_p = \frac{S_1 - R_y}{R_y+1}$, where $S_1$ is the size, in elements, of the Level 1 cache, $N_p$ is the partitioning of $X$, and $R_y$ corresponds to the Yunroll of your input file. The equation is actually a little more complicated than this, as ATLAS may want to use less than the full $S_1$ to avoid cache thrashing and throwing useful sections away between kernel calls. However, this gives the user some idea of the importance of these parameters. In particular, it shows that Yunroll should not be allowed to grow too large, for fear of causing the $X$ loop to be too short to support good optimization.

Also, note that after the first invocation of the kernel, $X$ will come from L1, leaving $A$ the dominant data cost.

At present, SYMV does a different blocking that blocks both $X$ and $Y$ (all other routines block only one dimension), so that $A$ is reused between calls to the Transpose and NoTranspose kernels. This may eventually change as greater sophistication is achieved (as you might imagine, you get two very different GEMV kernels if one is expecting $A$ from main memory, and the other expects $A$ to come from L1, as in this case; this means we may at some time generate a specialized L1-contained GEMV kernel).

Note that the $\beta = 0$ case must not read $Y$, since the memory may legally be unitialized.

Clint Whaley 2012-07-10