Building the General Matrix Multiply From gemmK

This section describes the code necessary to build the BLAS's general matrix-matrix multiply using an L1 cache-contained matmul (hereafter referred to as gemmK).

For our present discussion, it is enough to know that ATLAS has at its disposal highly optimized routines for doing matrix multiplies whose dimensions are chosen such that cache blocking is not required (i.e., the hand-written code discussed in this section deals with cache blocking; the generated/user supplied kernel assumes things fit into cache).

When the user calls GEMM, ATLAS must decide whether the problem is large enough to tolerate copying the input matrices $A$ and $B$. If the matrices are large enough to support this $O(N^2)$ overhead, ATLAS will copy $A$ and $B$ into block-major format. ATLAS's block-major format breaks up the input matrices into contiguous blocks of a fixed size $N_B$, where $N_B$ is chosen in order to maximize L1 cache reuse. Once in block-major format, the blocks are contiguous, which eliminates TLB problems, minimizes cache thrashing and maximizes cache line use. It also allows ATLAS to apply alpha (if alpha is not already one) to the smaller of $A$ or $B$, thus minimizing this cost as well. Finally, the package can use the copy to transform the problem to a particular transpose setting, which for load and indexing optimization, is set so A is copied to transposed form, and B is in normal (non-transposed) form. This means our L1-cache contained code is of the form $C \leftarrow A^T B$, $C \leftarrow A^T B + C$, and $C \leftarrow A^T B + \beta C$, where all dimensions, including the non-contiguous stride, are known to be $N_B$. Knowing all of the dimensions of the loops allows for arbitrary unrollings (i.e., if the instruction cache could support it, ATLAS could unroll all loops completely, so that the L1 cache-contained multiply had no loops at all). Further, when the code generator knows leading dimension of the matrices (i.e., the row stride), all indexing can be done up front, without the need for expensive integer or pointer computations.

If the matrices are too small, the $O(N^2)$ data copy cost can actually dominate the algorithm cost, even though the computation cost is $O(N^3)$. For these matrices, ATLAS will call an gemm kernel which operates on non-copied matrices (i.e. directly on the user's operands). The non-copy matmul kernels will generally not be as efficient as the even the generated copy gemmK; at this problem size the main drawback is the additional pointer arithmetic required in order to support the user-supplied leading dimension and its affect on the cost of the memory load (which varies according to transpose settings, as well as architectural features).

The choice of when a copy is dictated and when it is prohibitively expensive is an AEOS parameter; it turns out that this crossover point depends strongly both on the particular architecture, and the shape of the operands (matrix shape effectively sets limits on which matrix dimensions can enjoy cache reuse). To handle this problem, ATLAS simply compares the speed of the copy and non-copy matmul kernels for variously shaped matrices, varying the problem size until the copy code provides a speedup (on some platforms, and with some shapes, this point is never reached). These crossover points are determined at install time, and then used to make this decision at runtime. Because it is the dominant case, this paper describes only the copied matmul algorithm in detail.

Figure 1 shows the necessary steps for computing a $N_B \times N_B$ section of $C$ using gemmK.

Figure 1: One step of matrix-matrix multiply
\begin{figure}\begin{picture}(430,100)(-20,0)
\put(20,15){\framebox{(}20,20){$C_...
..._{2,2}$}}
\put(250,10){\framebox{(}20,10){$B_{3,2}$}}
\end{picture}
\end{figure}

More formally, the following actions are performed in order to calculate the $N_B \times N_B$ block $C_{i,j}$, where $i$ and $j$ are in the range $0 \leq i < \lceil M/N_B \rceil $, $0 \leq j < \lceil N/N_B \rceil $:

  1. Call gemmK of the correct form based on user-defined $\beta$ (eg. if $\beta == 0$, use $C \leftarrow A B$) to multiply block $0$ of the row panel $i$ of $A$ with block $0$ of the column panel $j$ of $B$.
  2. Call gemmK of form $C \leftarrow A B + C$ to multiply block $k$ of the row panel $i$ of $A$ with block $k$ of the column panel $j$ of $B$, $\forall k, 1 \leq k < \lceil K/N_B \rceil $.

As this example demonstrates, if a given dimension is not a multiple of the L1 blocking factor $N_B$, partial blocks results. ATLAS has special routines that handle cases where one or more dimension is less than $N_B$; these routines are referred to as cleanup codes.

Clint Whaley 2012-07-10