R. Clint Whaley 1
This paper describes the method by which users can speed up ATLAS (Automatically Tuned Linear Algebra Software) for themselves, as well as contribute any such speedup to the ATLAS project.
ATLAS is an implementation of a new style of high performance software production/maintenance called Automated Empirical Optimization of Software (AEOS). In an AEOS-enabled library, many different ways of performing a given kernel operation are supplied, and timers are used to empirically determine which implementation is best for a given architectural platform. ATLAS uses two techniques for supplying different implementations of kernel operations: multiple implementation and code generation.
In code generation, a highly-parameterized program is written that can generate many different kernel implementations. The matrix multiply code generator is an example of this. The second method is multiple implementation, and this, as its name suggests, is simply supplying various hand-written implementations of the same kernel.
ATLAS provides a standard way for users to help with multiple implementation. ATLAS is designed such that several kernel routines supply performance for the entire library. The entire Level 3 BLAS may be speeded up by improving the one simple kernel, which we will refer to as gemmK, to distinguish it from a full GEMM routine. The Level 2 routines make similarly be sped up by providing GER and GEMV kernels (there are several of these, as discussed later). ATLAS has standard timers which can call user-programmed versions of these kernels, and automatically use them throughout the library when they are superior to the ATLAS-produced versions.
One thing to consider when getting started is to take the best ATLAS kernel found, and, for instance, add some prefetch instructions, and see if you can get noticeable improvements.
If you produce a substantially better kernel than ATLAS presently has, we hope that you will contribute it to the ATLAS project. Code submissions and discussions of this type should take place on the math-atlas-devel@lists.sourceforge.net mailing list. Anyone can sign up for this list, and we hope that interested parties will correspond using it. In this way someone interested in supplying a particular enhancement to ATLAS can find if someone else is already working on it, find interested collaborators, get coding help from a wider pool than with the standard atlas mailing list, etc.
http://math-atlas.sourceforge.net/faq.html#lists.
ATLAS uses a BSD-style license for its distribution, and thus any code that you wish to contribute must have a compatible license. This effectively rules out GPL and even LGPL licenses, since if the ATLAS group redistributed any routines with these licenses, the conditions of their licenses would effectively change the entire ATLAS library to GPL or LGPL.
Note that you, the author, retain the copyright, and thus you can relicense your code in any way you wish. So, anything released as part of the ATLAS tarfile will need a BSD-compatible license, but you could then issue the code yourself under a different license. Alternatively, you can provide your codes as ``patches'' to standard ATLAS, and thus avoid the license issue altogether, at the cost of not being part of the standard tarfile. Even if you decide not to license your contribution in a way that the ATLAS group can redistribute, submitting it to the mailing list is still recommended. Other people can see your work, and we can point to it if we think it is cool enough.
We have set in place some infrastructure in order to give contributers credit for their work. Firstly, if your code is included in the standard tarfile, your contribution will be noted in ATLAS/doc/AtlasCredits.txt. As the author, you retain the copyright, and thus your name is in the contributed code. We have also modified the installation logger to print the author name for user-provided routines (the exception is the Level 1 kernels, which are simply too numerous for logging). We will make a good faith effort to give due credit for work, but in general, we can't guarantee anything.
As an aside, if fame and fortune are your major motivations, you may want to consider contributing to something else anyway. As the founder of the ATLAS project, I'm still waiting for my first interview with CNN, NBC, etc., or indeed, anyone other then my relatives (and all those interviews go like, "You're in computers, can't you get those wordperfect guys to make it easier to use?").
If you submit an improvement to math-atlas-devel@lists.sourceforge.net, we hope that other ATLAS users will try it out, and give comments. Ultimately, someone in the ATLAS group (currently, that would be me) will decide whether to include it in the standard tarfile or not. Such a decision will be based on how hard it is to get to work with our distribution (this pain should be minimal if the instructions in this note are followed), how much of an improvement it is, etc. Obviously, there are speed improvements that some users may utilize that the ATLAS group can't indulge in (examples include speedups that possibly lose accuracy, such as, for instance, using Strassen's algorithm for matrix multiply). If there appears to be a great demand for such lossy optimizations, we may consider ways of allowing users to select if they want to risk them.
Note that even if we are ungrateful morons who do not release your submission, you, and anyone who wants to use your kernel routines, can use the ATLAS infrastructure to get a complete, fast BLAS implementation.
The ATLAS team encourages:
The performance kernel for the entire Level 3 BLAS is matrix multiply.
Matrix multiply is written in terms of a lower-level building block that
we call gemmK. gemmK is a special matrix multiply where
the input dimensions are fixed at
, where the blocking
factor
is chosen in order to maximize L1 cache reuse, for a loose
enough definition of L1 cache (typically, we use it to mean the first
level of cache accessible by the FPU, which may be the L2 cache on
some systems).
ATLAS actually has two different classes of GEMM kernels: one for copied
matrices (gemmK), and one that operates directly on the user's matrices
without a copy. For matrices of
sufficient size, ATLAS copies the input matrix into block-major storage.
In block-major storage, the
blocks operated on by the
gemmK are actually contiguous. This optimization prevents unnecessary
cache misses, cache conflicts, and TLB problems. However, for sufficiently
small matrices, the cost of this data copy is prohibitively expensive,
and thus ATLAS has kernels that operate on non-copied data. However,
without the copy to simplify the process, there are multiple non-copy
kernels (differing kernels for differing transpose settings, for instance).
Since the non-copy kernels are typically only used for very small problems,
and they are much more complex, ATLAS presently accepts contributed code
only for the copy matmul kernel. For most problems, well over 98% of ATLAS
time is spent in the copy matmul kernel, so this should not be much of
a problem.
This section describes the code necessary to build the BLAS's general matrix-matrix multiply using an L1 cache-contained matmul (hereafter referred to as gemmK).
For our present discussion, it is enough to know that ATLAS has at its disposal highly optimized routines for doing matrix multiplies whose dimensions are chosen such that cache blocking is not required (i.e., the hand-written code discussed in this section deals with cache blocking; the generated/user supplied kernel assumes things fit into cache).
When the user calls GEMM, ATLAS must decide whether the problem
is large enough to tolerate copying the input matrices
and
.
If the matrices are large enough to support this
overhead, ATLAS will
copy
and
into block-major format. ATLAS's block-major format breaks up the input
matrices into contiguous blocks of a fixed size
, where
is chosen
in order to maximize L1 cache
reuse. Once in block-major format, the blocks are contiguous, which
eliminates TLB problems, minimizes cache thrashing and maximizes cache line
use. It also allows ATLAS to apply alpha (if alpha is not already one) to the
smaller of
or
, thus minimizing this cost as well. Finally, the
package can use the copy to transform the problem to a particular transpose
setting, which for load and indexing optimization, is set so
A is copied to transposed form, and B is in normal (non-transposed) form.
This means our L1-cache contained code is of the form
,
,
and
,
where all dimensions, including the non-contiguous stride, are known
to be
. Knowing all of the dimensions of the loops allows for
arbitrary unrollings (i.e., if the instruction cache could support it, ATLAS
could unroll all loops completely, so that the L1 cache-contained multiply
had no loops at all). Further, when the code generator knows leading dimension
of the matrices (i.e., the row stride), all indexing can be done up front,
without the need for expensive integer or pointer computations.
If the matrices are too small, the
data copy cost can actually
dominate the algorithm cost, even though the computation cost is
.
For these matrices, ATLAS will call an gemm kernel which operates on non-copied
matrices (i.e. directly on the user's operands). The non-copy matmul kernels
will generally not be as efficient as the even the generated copy gemmK;
at this problem
size the main drawback is the additional pointer arithmetic required in order
to support the user-supplied leading dimension and its affect on the cost of
the memory load (which varies according to transpose settings, as well as
architectural features).
The choice of when a copy is dictated and when it is prohibitively expensive is an AEOS parameter; it turns out that this crossover point depends strongly both on the particular architecture, and the shape of the operands (matrix shape effectively sets limits on which matrix dimensions can enjoy cache reuse). To handle this problem, ATLAS simply compares the speed of the copy and non-copy matmul kernels for variously shaped matrices, varying the problem size until the copy code provides a speedup (on some platforms, and with some shapes, this point is never reached). These crossover points are determined at install time, and then used to make this decision at runtime. Because it is the dominant case, this paper describes only the copied matmul algorithm in detail.
Figure 1 shows the necessary steps for computing a
section of
using gemmK.
More formally, the following actions are performed in order to
calculate the
block
, where
and
are in
the range
,
:
As this example demonstrates, if a given dimension is not a multiple of
the L1 blocking factor
, partial blocks results. ATLAS has special
routines that handle cases where one or more dimension is less than
;
these routines are referred to as cleanup codes.
First, ATLAS defines the macro ATL_USERMM to the appropriate ATLAS internal kernel name. Second, it defines one of SREAL, DREAL, SCPLX, DCPLX, according to the data type being compiled (single precision real, double precision real, single precision complex, double precision complex, respectively).
Similarly, ATLAS defines a macro indicating the
case being
compiled, BETA1 (
should be assumed to be 1.0), BETA0
(
should be assumed to be 0.0), and BETAX
(
neither zero or one, and should be handled as an input parameter).
Finally, the fixed blocking factors for each dimension are defined
MB, NB, KB. Note that for our gemmK,
MB = NB = KB =
; they are separated out for
support of the cleanup codes, where they can be different. ATLAS also
defines the macros MB2, NB2, KB2, which are simply
two times the appropriate blocking factor.
#if defined(SREAL) || defined(SCPLX)
#elif defined(DREAL) || defined(DCPLX)
#endif
The first line of each file is a comment line, and is ignored. The next
line indicates the number of user-contributed codes to search, and
each subsequent line supplies information about a given user-supplied
gemmK. The form of these lines is:
<ID> <flag> <mb> <nb> <kb> <muladd> <lat> <mu> <nu> <ku> <rout> "<author>"
<rout> and <author>" are strings, and the rest of the
parameters are signed integers.
The meaning of these parameters are:
ID: Strictly positive integer which uniquely identifies this
descriptor line. ID must by unique only within a precision.
<flag>: flag indicating special conditions. See table below.
<mb>, <nb>, <kb>:
Used to indicate restriction on the input parameter <mb> (i.e., the blocking factor cannot be
varied using a macro). If the value is positive, the blocking factor
can be varied by setting the appropriate macro
(MB NB, KB, resp.), but the blocking factor must be
a multiple of the value. Therefore, setting <mb> = 4, indicates
that MB must be a multiple of 4, while setting it to 1 indicates
that MB is an arbitrary compile-time constant.
<muladd>:
Set to zero if you are using separate multiply and add instructions, 1
otherwise. If you don't know the answer, put 1.
<lat>:
Set to the latency you use between floating point instructions.
If you don't know the answer, put 1.
<mu>:
Unrolling you are using for the <nu>:
Unrolling you are using for the <ku>:
Unrolling you are using for the <rout>:
The filename of the user-contributed routine, relative to the path
ATLAS/tune/blas/gemm/CASES. Maximum length 64 chars.
<author>:
The name of the author or authors, enclosed in quotes.
Maximum length 64 chars.
Table 1 summarizes the presently defined flag values.
|
Here's an example:
<ID> <flag> <mb> <nb> <kb> <muladd> <lat> <mu> <nu> <ku> <rout> "<Contributer>" 3 1 0 0 0 0 1 1 1 1 1 ATL_mm1x1x1.c "R. Clint Whaley" 2 0 1 1 1 1 1 1 1 1 ATL_mm1x1x1b.c "R. Clint Whaley" 3 0 1 1 8 1 1 1 1 4 ATL_mm2.c "R. Clint Whaley"
So, we have 3 user-supplied routines, all written by me. The first loops
over
,
, and
, but the following two routines loop over the cpp
macros MB, NB, KB. The third routine insists that
KB be a multiple of 8. The first two routines don't unroll
any of the loops, while the third unrolls the K loop to a depth of 4.
They all use a combined muladd style of programming, and don't worry
about latency.
Let's say we decide to cover the basics, the classical 3 do loop implementation of matmul would be:
void ATL_USERMM
(const int M, const int N, const int K, const double alpha,
const double *A, const int lda, const double *B, const int ldb,
const double beta, double *C, const int ldc)
{
int i, j, k;
register double c00;
for (j=0; j < N; j++)
{
for (i=0; i < M; i++)
{
#ifdef BETA0
c00 = 0.0;
#elif defined(BETA1)
c00 = C[i+j*ldc];
#else
c00 = C[i+j*ldc] * beta;
#endif
for (k=0; k < K; k++) c00 += A[k+i*lda] * B[k+j*ldb];
C[i+j*ldc] = c00;
}
}
}
We then save this paragon of performance to ATLAS/tune/blas/gemm/CASES/ATL_mm1x1x1.c. From BLDdir/tune/blas/gemm/, we can test that it gets the right answer by:
make mmutstcase pre=d nb=40 mmrout=CASES/ATL_mm1x1x1.c beta=0 make mmutstcase pre=d nb=40 mmrout=CASES/ATL_mm1x1x1.c beta=1 make mmutstcase pre=d nb=40 mmrout=CASES/ATL_mm1x1x1.c beta=7
We pass four arguments to mmutstcase, a precision specifier
(d : double precision real; s : single precision real;
z : double precision complex; c : single precision complex),
the size of the blocking parameter
, the beta value to test
(0, 1, and other), and finally, the filename to test.
If these messages pass the test, we can then see what kind of performance we get by (this is the actual output on my 266Mhz PII):
make ummcase pre=d nb=40 mmrout=CASES/ATL_mm1x1x1.c beta=1 dNB=40, ldc=40, mu=4, nu=4, ku=1, lat=4: time=1.820000, mflop=53.731868 dNB=40, ldc=40, mu=4, nu=4, ku=1, lat=4: time=1.810000, mflop=54.028729 dNB=40, ldc=40, mu=4, nu=4, ku=1, lat=4: time=1.830000, mflop=53.438251
This is the same timing repeated three times (this just tries to ensure timings are repeatable), and the only output of real interest is the MFLOP rate at the end. The values the timer prints (mu, nu, ku, lat) are all defaults because we didn't specify them; specifying them has no effect when the timer is used in this way, so don't worry about them.
Now we can trivially improve the implementation by using the macro constants in order to let the compiler unroll the loops:
void ATL_USERMM
(const int M, const int N, const int K, const double alpha,
const double *A, const int lda, const double *B, const int ldb,
const double beta, double *C, const int ldc)
{
int i, j, k;
register double c00;
for (j=0; j < NB; j++)
{
for (i=0; i < MB; i++)
{
#ifdef BETA0
c00 = 0.0;
#elif defined(BETA1)
c00 = C[i+j*ldc];
#else
c00 = C[i+j*ldc] * beta;
#endif
for (k=0; k < KB; k++) c00 += A[k+i*KB] * B[k+j*KB];
C[i+j*ldc] = c00;
}
}
}
We save this to ATL_mm1x1x1b.c, and then time:
make ummcase pre=d nb=40 mmrout=CASES/ATL_mm1x1x1b.c beta=1 dNB=40, ldc=40, mu=4, nu=4, ku=1, lat=4: time=1.670000, mflop=58.558084 dNB=40, ldc=40, mu=4, nu=4, ku=1, lat=4: time=1.660000, mflop=58.910843 dNB=40, ldc=40, mu=4, nu=4, ku=1, lat=4: time=1.670000, mflop=58.558084
OK, maybe a little explicit loop unrolling will make things work better:
void ATL_USERMM
(const int M, const int N, const int K, const double alpha, const double *A, const int lda, const double *B, const int ldb, const double beta, double *C, const int ldc)
{
int i, j, k;
register double c00, c10, b0;
const double *pA0, *pB=B;
#if ( (KB / 8)*8 != KB ) || (MB / 2)*2 != MB
create syntax error!$@@&
#endif
for (j=0; j < NB; j++, pB += KB)
{
pA0 = A;
for (i=0; i < MB; i += 2, pA0 += KB2)
{
#ifdef BETA0
c00 = c10 = 0.0;
#elif defined(BETA1)
c00 = C[i+j*ldc];
c10 = C[i+1+j*ldc];
#else
c00 = beta*C[i+j*ldc];
c10 = beta*C[i+1+j*ldc];
#endif
for (k=0; k < KB; k += 8)
{
b0 = pB[k];
c00 += pA0[k] * b0;
c10 += pA0[KB+k] * b0;
b0 = pB[k+1];
c00 += pA0[k+1] * b0;
c10 += pA0[KB+k+1] * b0;
b0 = pB[k+2];
c00 += pA0[k+2] * b0;
c10 += pA0[KB+k+2] * b0;
b0 = pB[k+3];
c00 += pA0[k+3] * b0;
c10 += pA0[KB+k+3] * b0;
b0 = pB[k+4];
c00 += pA0[k+4] * b0;
c10 += pA0[KB+k+4] * b0;
b0 = pB[k+5];
c00 += pA0[k+5] * b0;
c10 += pA0[KB+k+5] * b0;
b0 = pB[k+6];
c00 += pA0[k+6] * b0;
c10 += pA0[KB+k+6] * b0;
b0 = pB[k+7];
c00 += pA0[k+7] * b0;
c10 += pA0[KB+k+7] * b0;
}
C[i+j*ldc] = c00;
C[i+1+j*ldc] = c10;
}
}
}
And with this ode to beauty and elegance we get (after checking that it still gets the right answer, of course):
make ummcase pre=d nb=40 mmrout=CASES/ATL_mm2x1x8.c beta=1 dNB=40, ldc=40, mu=4, nu=4, ku=1, lat=4: time=0.720000, mflop=135.822222 dNB=40, ldc=40, mu=4, nu=4, ku=1, lat=4: time=0.710000, mflop=137.735211 dNB=40, ldc=40, mu=4, nu=4, ku=1, lat=4: time=0.710000, mflop=137.735211
Its interesting to see the effects of differing
on the code:
make ummcase pre=d nb=40 mmrout=CASES/ATL_mm2x1x8a.c beta=0 dNB=40, ldc=40, mu=4, nu=4, ku=1, lat=4: time=0.700000, mflop=139.702857 dNB=40, ldc=40, mu=4, nu=4, ku=1, lat=4: time=0.700000, mflop=139.702857 dNB=40, ldc=40, mu=4, nu=4, ku=1, lat=4: time=0.700000, mflop=139.702857 make ummcase pre=d nb=40 mmrout=CASES/ATL_mm2x1x8a.c beta=7 dNB=40, ldc=40, mu=4, nu=4, ku=1, lat=4: time=0.720000, mflop=135.822222 dNB=40, ldc=40, mu=4, nu=4, ku=1, lat=4: time=0.730000, mflop=133.961644 dNB=40, ldc=40, mu=4, nu=4, ku=1, lat=4: time=0.720000, mflop=135.822222
As well as differing blocking factors:
make ummcase pre=d mmrout=CASES/ATL_mm2x1x8a.c beta=1 nb=16 dNB=16, ldc=16, mu=4, nu=4, ku=1, lat=4: time=0.850000, mflop=115.112056 dNB=16, ldc=16, mu=4, nu=4, ku=1, lat=4: time=0.860000, mflop=113.773544 dNB=16, ldc=16, mu=4, nu=4, ku=1, lat=4: time=0.850000, mflop=115.112056 make ummcase pre=d mmrout=CASES/ATL_mm2x1x8a.c beta=1 nb=32 dNB=32, ldc=32, mu=4, nu=4, ku=1, lat=4: time=0.730000, mflop=134.034586 dNB=32, ldc=32, mu=4, nu=4, ku=1, lat=4: time=0.740000, mflop=132.223308 dNB=32, ldc=32, mu=4, nu=4, ku=1, lat=4: time=0.730000, mflop=134.034586 make ummcase pre=d mmrout=CASES/ATL_mm2x1x8a.c beta=1 nb=48 dNB=48, ldc=48, mu=4, nu=4, ku=1, lat=4: time=0.820000, mflop=119.223571 dNB=48, ldc=48, mu=4, nu=4, ku=1, lat=4: time=0.820000, mflop=119.223571 dNB=48, ldc=48, mu=4, nu=4, ku=1, lat=4: time=0.820000, mflop=119.223571
If we wanted to have ATLAS try these crappy implementations during the ATLAS search, we would have the following ATLAS/tune/blas/gemm/CASES/dcases.dsc:
<ID> <flag> <mb> <nb> <kb> <muladd> <lat> <mu> <nu> <ku> <rout> "<Contributer>" 3 1 0 0 0 0 1 1 1 1 1 ATL_mm1x1x1.c "R. Clint Whaley" 2 0 1 1 1 1 1 1 1 1 ATL_mm1x1x1b.c "R. Clint Whaley" 3 0 2 1 8 1 1 2 1 8 ATL_mm2x1x8a.c "R. Clint Whaley"
speedy. cat res/dMMRES
MULADD LAT NB MU NU KU FFTCH IFTCH NFTCH MFLOP
0 5 40 2 1 40 0 2 1 197.94
16
We generate and time this case by:
make mmcase muladd=0 lat=5 nb=40 mu=2 nu=1 ku=40 beta=1 dNB=40, ldc=40, mu=2, nu=1, ku=40, lat=5: time=0.490000, mflop=199.575510 dNB=40, ldc=40, mu=2, nu=1, ku=40, lat=5: time=0.500000, mflop=195.584000 dNB=40, ldc=40, mu=2, nu=1, ku=40, lat=5: time=0.490000, mflop=199.575510
We test that the generator isn't out of its mind by:
make mmtstcase muladd=0 lat=5 nb=40 mu=2 nu=1 ku=40 beta=1
Note that when timing/testing the generator, varying the parameters such as mu, nu, ku, beta, etc., generates different codes, and thus different performance numbers:
make mmcase muladd=0 lat=4 nb=40 mu=2 nu=1 ku=4 beta=1 dNB=40, ldc=40, mu=2, nu=1, ku=4, lat=4: time=0.760000, mflop=128.673684 dNB=40, ldc=40, mu=2, nu=1, ku=4, lat=4: time=0.770000, mflop=127.002597 dNB=40, ldc=40, mu=2, nu=1, ku=4, lat=4: time=0.770000, mflop=127.002597 make mmcase muladd=0 lat=4 nb=40 mu=2 nu=1 ku=8 beta=1 dNB=40, ldc=40, mu=2, nu=1, ku=8, lat=4: time=0.640000, mflop=152.800000 dNB=40, ldc=40, mu=2, nu=1, ku=8, lat=4: time=0.630000, mflop=155.225397 dNB=40, ldc=40, mu=2, nu=1, ku=8, lat=4: time=0.630000, mflop=155.225397 make mmcase muladd=0 lat=4 nb=40 mu=2 nu=2 ku=40 beta=1 dNB=40, ldc=40, mu=2, nu=2, ku=40, lat=4: time=2.550000, mflop=38.349804 dNB=40, ldc=40, mu=2, nu=2, ku=40, lat=4: time=2.560000, mflop=38.200000 dNB=40, ldc=40, mu=2, nu=2, ku=40, lat=4: time=2.550000, mflop=38.349804
This works fine assuming
is real, otherwise
must be applied
explicitly as a complex scalar, and then set
in the above outline.
Therefore, in order to use this trick, upon copying
and
to block-major
storage, ATLAS also splits the arrays into real and imaginary components.
The only matrix not expressed as two real matrices is then
, and to
fix this problem, ATLAS demands that the complex gemmK stride
by
2. An example will solidify the confusion.
A simple 3-loop implementation of an ATLAS complex gemmK is:
void ATL_USERMM
(const int M, const int N, const int K, const double alpha,
const double *A, const int lda, const double *B, const int ldb,
const double beta, double *C, const int ldc)
{
int i, j, k;
register double c00;
for (j=0; j < N; j++)
{
for (i=0; i < M; i++)
{
#ifdef BETA0
c00 = 0.0;
#else
c00 = C[2*(i+j*ldc)];
#ifdef BETAX
c00 *= beta;
#endif
#endif
for (k=0; k < K; k++) c00 += A[k+i*lda] * B[k+j*ldb];
C[2*(i+j*ldc)] = c00;
}
}
}
First we test that it produces the right answer:
make cmmutstcase pre=z nb=40 mmrout=CASES/ATL_cmm1x1x1.c beta=0 make cmmutstcase pre=z nb=40 mmrout=CASES/ATL_cmm1x1x1.c beta=1 make cmmutstcase pre=z nb=40 mmrout=CASES/ATL_cmm1x1x1.c beta=8
Then we scope the awesome performance:
make cmmucase pre=z nb=40 mmrout=CASES/ATL_cmm1x1x1.c beta=1 zNB=40, ldc=40, mu=4, nu=4, ku=1, lat=4: time=1.830000, mflop=53.438251 zNB=40, ldc=40, mu=4, nu=4, ku=1, lat=4: time=1.830000, mflop=53.438251 zNB=40, ldc=40, mu=4, nu=4, ku=1, lat=4: time=1.830000, mflop=53.438251
Now, since we are clearly gluttons for punishment, we compare our masterwork to ATLAS's generated kernel:
speedy. cat res/zMMRES
MULADD LAT NB MU NU KU FFTCH IFTCH NFTCH MFLOP
0 5 36 2 1 36 0 2 1 186.42
10
speedy. make mmcase muladd=0 lat=5 nb=36 mu=2 nu=1 ku=36 beta=1
dNB=36, ldc=36, mu=2, nu=1, ku=36, lat=5: time=0.500000, mflop=195.581952
dNB=36, ldc=36, mu=2, nu=1, ku=36, lat=5: time=0.490000, mflop=199.573420
dNB=36, ldc=36, mu=2, nu=1, ku=36, lat=5: time=0.490000, mflop=199.573420
make ummcase pre=d mmrout=CASES/myassembler.c nb=16 \
DMC=gcc DMCFLAGS="-x assembler-with-cpp"
ATLAS therefore normally generates a variable number of cleanup cases,
with the number of generated codes minimally being
, and the maximum
number being
. The number of generated codes can vary because
the
cleanup routines are special, sometimes requiring
different
codes to handle efficiently, as we will see below.
ATLAS splits the generated cleanup into these categories
So we see that K-cleanup is special in several ways. First, it is the
most general cleanup routine, since it can handle multiple dimensions not
being less than
, whereas the M- and N-cleanup routines can only have
their respective dimensions less then
. The second thing to note is
that we compile only the most general BETA case for K-cleanup; this
is due to the fact that we may need
different routines to handle
K-cleanup efficiently, and multiplying this number of routines by three seems
counterproductive.
The final difference in the K-cleanup is the fact that it potentially requires
different routines to support. This is due to several factors.
Firstly, in ATLAS, the innermost loop in gemm is the K-loop, making it
very important for performance. On systems without good loop handling,
such as the x86, heavy K unrollings are critical.
Secondly, the leading dimensions of the
and
matrices are fixed
to KB due to the data copy, which allows for more efficient indexing
of these matrices. If a routine takes run-time
(rather than compile-time,
as when the dimension is fixed to KB), it must also take run-time
lda and ldb, and this extra indexing is too costly on
many architectures.
The generated code handles all cleanup where more
than one dimension is less than the blocking factor. This simplification
allows ATLAS to avoid having to test
cases when selecting user
cleanup. Once the matrices in question are larger than
, cleanup
with more than one dimension less than
rapidly stops being a
performance factor. Small matrices where this cleanup is a factor are
almost certainly going to be handled by ATLAS's small-case code anyway,
so it seems unlikely that this simplification will hurt performance in
practice. Section 2.7.5 shows this in a more formal way.
Users need to be very careful when supplying cleanup, because if the user
indicates that a dimension must be a compile-time variable, rather than
a runtime variable, ATLAS will generate up to
routines to handle
user cleanup, and since user routines are compiled with all BETA
variants, it is possible to generate
cleanup cases, in addition
to ATLAS's generated cases. It is therefore recommended that the user
supply cleanup that uses run-time arguments whenever possible, and indicate
kernels taking compile-time dimensions as not to be used for cleanup.
1 480 4 4 1 1 1 4 4 2 ATL_mm4x4x2US.c "V. Nguyen & P. Strazdins"
OK, as always, we can read this to see that MB and NB
must be multiples of 4, and that KB can be any value. With
no flag modifiers, if we wanted to use the routine for K cleanup,
we would have to compile it into
different routines, since
loop dimensions are compile-time parameters by default. However,
this routine is modified by a flag value of 480. What does this mean?
Consulting table 1, we see that
,
which means lda and ldb are not restricted to KB (i.e., they are
run-time parameters to the routine), the M-loop is controlled by a run-time
variable, the N-loop is controlled by a run-time variable, and the K-loop
is controlled by a run-time variable. We therefore know that we can
use this routine for all cleanups (M-, N-, and K-cleanup), and we need only
one routine to do so (i.e., we do not have to compile
routines to handle
all cases). However, it can only be used for M- and N- cleanup cases where
the respective dimension is a multiple of 4. Therefore, assuming this
kernel is superior to the generated code, it will be used for all K cleanup
routines. However, for M and N cleanup, there will be something corresponding
to the following pseudocode:
if (M % 4 == 0) call ATL_mm4x4x2US else call generated M cleanup
It is clear that without overloading the flag value to an even more
ludicrous degree, that cleanup will eventually need to have it's own
index file. For instance, it would be nice to be able to insist that
a particular K-cleanup code be used only when
, for instance,
in addition to insisting it be a multiple of a particular value. The fact
that cleanup does not already have such a seperate file simply represents
a design failure on my part; it was not until I had already produced the
system working as it does now that I saw its shortcomings, and then it
was too late to change for the release. Subsequent developer releases
will probably address this shortcoming.
Here is an example of testing an M-cleanup routine, insisting that M is a run-time argument:
make mmutstcase mmrout=CASES/ATL_mm1x1x1.c pre=d M=17 N=40 K=40 \
mb=0 nb=40 kb=40
Here is timing the same routine, but insisting that the M-loop is fixed at compile time:
make ummcase mmrout=CASES/ATL_mm1x1x1.c pre=d M=17 N=40 K=40 \
mb=17 nb=40 kb=40
Here's testing a K-cleanup routine, taking run-time K and leading dimensions:
make mmutstcase mmrout=CASES/ATL_mm1x1x1.c pre=d M=40 N=40 K=27 \
mb=40 nb=40 kb=0 lda=0 ldb=0
The same test taking compile-time K and leading dimensions:
make mmutstcase mmrout=CASES/ATL_mm1x1x1.c pre=d M=40 N=40 K=27 \
mb=40 nb=40 kb=27 lda=27 ldb=27
This is shown more formally below. Define
mod
, let
be the dimensional arguments to the gemmK and/or cleanup,
and remember that matrix multiplication takes
flops, and we see
that the flop count for each catagory is:
Note that the simplified equations to the right of
assume the square case, i.e.
. The above analysis can now
be grouped into the catagories of interest as in:
With this analysis, we can easily see why it is not important for the
user to be able to contribute 2D and 3D cleanup cases: remember that
all of these kernels are for ATLAS's large-case gemm. ATLAS has a
seperate small-case gemm, which is invoked when the problem is so small
that the
copy cost is significant compared to the
computational
costs. So, in the cases where the
2D cleanup or
3D cleanup
costs are prohibitive, this large-case gemm will probably not be used anyway.
If written appropriately, loading all of
with a few rows
of
should theoretically be just as efficient (i.e., the IJK variant
of matmul). However, the variants where
is not the innermost loop
are unlikely to work well in ATLAS, if for no other reason than the
transpose settings we have chosen militate against it.
Note that the
case must not read
, since the memory may
legally be unitialized.
rm res/<pre>u* rm res/<pre>MMRES ./xmmsearch -p <pre> make <pre>install
gunzip -c Core2Duo64SSE3.tgz | tar xvf - rm Core2Duo64SSE3/gemm/gcc/<pre>u* rm Core2Duo64SSE3/gemm/gcc/<pre>MMRES rm Core2Duo64SSE3.tgz tar cvf Core2Duo64SSE3.tar Core2Duo64SSE3 gzip Core2Duo64SSE3.tar mv Core2Duo64SSE3.tar.gz Core2Duo64SSE3.tgz
Now, continue install as normal, and your kernel should be used if it beats what ATLAS is presently using. Note that this assumes you are using gcc as the gemmK compiler, which is the default on most systems. If you are using a different compiler, you would substitute its name instead of gcc in the above lines. If there is no subdirectory with the name of your compiler in the tarfile, ATLAS has no architectural defaults for that compiler, and thus you need to make no changes to the tarfile.
rm -rf Core2Duo64SSE3If such a subdirectory existed.
From your BLDdir directory, then issue:
rm bin/INSTALL_LOG/* rm tune/blas/gemm/res/<pre>u* rm tune/blas/gemm/res/<pre>MMRES make build
Contributing an L1 kernel is the prefered method of user contribution for Level 3 BLAS speedup, but it is not the only one supported by ATLAS. ATLAS also allows a user to contribute a complete system-specific GEMM implementation. This method of contribution is far less desirable than kernel contribution, and thus the standards of acceptance are correspondingly higher.
When only a kernel is contributed, it is only used when timings indicate it is superior to the best ATLAS-supplied routine for a given architecture. Because kernel routines are called in known ways by the ATLAS infrastructure, the timer can be made to accurately reflect typical usage. A full GEMM, which is to all intents called directly by the user, has no ``typical'' usage, and the timer is thus not able to ensure that the user's full GEMM is superior to that supplied by ATLAS in a system-independent way, even if the additional installation time required to choose amoung full GEMM implementations were allowed. Thus, full GEMM implementations will be used only when ATLAS's configuration detects a known architecture where the ATLAS team has certified the full GEMM to be significantly better than ATLAS's native GEMM, across the entire spectrum of problem shapes and sizes (with the exception of those shapes and sizes handled by ATLAS's non-copy code, as explained below).
As explained in Section 2.1, ATLAS has both a small-case matmul, which does not copy the user's input operands, and a large-case code that does. The user contributed GEMM replaces ATLAS's large-case GEMM, and then timings are used as normal to determine the crossover points at which the contributed GEMM outperforms ATLAS's small-case code.
In this directory, there must be a master makefile, called Makefile, which minimally contains the following targets:
For each precision, ATLAS calls the user's GEMM using this API:
int ATL_U<pre>usergemm(const enum ATLAS_TRANS TA, const enum ATLAS_TRANS TB,
const int M, const int N, const int K,
const SCALAR alpha, const TYPE *A, const int lda,
const TYPE *B, const int ldb, const SCALAR beta,
TYPE *C, const int ldc)
where,
<pre> : |
s | d | c | z |
| SCALAR | float | double | float* | float* |
| TYPE | float | double | float | float |
This routine should return 0 upon successful invocation, and -1 if unable to malloc enough memory. Other errors may be signaled by returning a value of 2. On error in this routine, ATLAS will call the no-copy code to get the answer, so a return value of 1 indicates that ATLAS should do this. If a fatal error occurs, or if an error occurs after operands have been modified (i.e., calling the no-copy code will no longer produce the correct answer), then execution should be halted.
ATLAS's interface routines have already done all required error checking, so the user need not check the input arguments in this routine, or any of the lower-level user contributed routines.
Remember that what ATLAS is doing is substituting your GEMM for its own large-case GEMM. However, ATLAS's large-case GEMM is still compiled in the library, it is just not being used. The following code will call ATLAS's large case code for all precisions:
#include "atlas_misc.h"
#include "atlas_lvl3.h"
int Mjoin(Mjoin(ATL_U,PRE),usergemm)
(const enum ATLAS_TRANS TA, const enum ATLAS_TRANS TB,
const int M, const int N, const int K, const SCALAR alpha,
const TYPE *A, const int lda, const TYPE *B, const int ldb,
const SCALAR beta, TYPE *C, const int ldc)
{
int ierr;
if (N >= M)
{
if ( ierr = Mjoin(PATL,mmJIK)(TA, TB, M, N, K, alpha, A, lda, B, ldb,
beta, C, ldc) )
ierr = Mjoin(PATL,mmIJK)(TA, TB, M, N, K, alpha, A, lda, B, ldb,
beta, C, ldc);
}
else
{
if ( ierr = Mjoin(PATL,mmIJK)(TA, TB, M, N, K, alpha, A, lda, B, ldb,
beta, C, ldc) )
ierr = Mjoin(PATL,mmJIK)(TA, TB, M, N, K, alpha, A, lda, B, ldb,
beta, C, ldc);
}
return (ierr);
}
So, you can use the above code to have ATLAS call it's normal routines for those precisions you do not supply, and call your own otherwise. In order to help understand this process, ATLAS includes a ``user-supplied'' GEMM that simply calls ATLAS's own GEMM for all precisions, in ATLAS/src/blas/gemm/UMMEXAMPLE. If, for instance, you have a single precision real GEMM but nothing else, you can take this directory as a starting point, adding your own commands for single precision real in the Makefile, etc, and leaving everything else alone.
In order to do this, you should do the following:
If ATLAS detects you are on a platform where a contributed full GEMM is superior to ATLAS's large-case GEMM, ATLAS will automatically handle the details of making ATLAS call the user-contributed routine. If, however, you wish to force ATLAS to use your GEMM (for instance, you are testing your code before contribution, or just want to utilize ATLAS for complete BLAS coverage with your GEMM), you should take the following steps, after creating the appropriate subdirectory and API as previously described:
All Level 2 BLAS are written in terms of 3 kernel routines:
For complex codes, these kernels must also supply conjugate cases.
The most important file in ATLAS/tune/blas/gemv/CASES
is the primitive description
file, <pre>cases.dsc. Each precision has its own description file (as
indicated by <pre>), and this file describes all of the routines to
time in order to find the best. For instance, for double precision, we see:
speedy. cat CASES/dcases.dsc 9 1 8 0 0 ATL_gemvN_mm.c "R. Clint Whaley" 2 0 1 1 ATL_gemvN_1x1_1.c "R. Clint Whaley" 3 16 32 1 ATL_gemvN_1x1_1a.c "R. Clint Whaley" 4 0 4 2 ATL_gemvN_4x2_0.c "R. Clint Whaley" 5 0 4 4 ATL_gemvN_4x4_1.c "R. Clint Whaley" 6 0 8 4 ATL_gemvN_8x4_1.c "R. Clint Whaley" 7 0 16 2 ATL_gemvN_16x2_1.c "R. Clint Whaley" 8 0 16 4 ATL_gemvN_16x4_1.c "R. Clint Whaley" 9 16 32 4 ATL_gemvN_32x4_1.c "R. Clint Whaley" 6 101 8 0 0 ATL_gemvT_mm.c "R. Clint Whaley" 102 0 2 8 ATL_gemvT_2x8_0.c "R. Clint Whaley" 103 0 4 8 ATL_gemvT_4x8_1.c "R. Clint Whaley" 104 0 4 16 ATL_gemvT_4x16_1.c "R. Clint Whaley" 105 0 2 16 ATL_gemvT_2x16_1.c "R. Clint Whaley" 106 0 1 1 ATL_gemvT_1x1_1.c "R. Clint Whaley"
The first number (in this case 9) is the number of NoTranspose primitives to time. This is followed by that number of primitive lines describing those NoTrans primitives, and then we supply the number of Transpose primitives to time (in this example, 6), followed by that number of primitive lines describing the Transpose primitives.
As you can see, each line supplies four integers and a filename to the search routine. The filename is the filename of the primitive to time. The first integer provides a unique integer ID (must be greater than zero) for each primative line, and the other three supply information necessary in order for the higher level routines to do blocking.
This is the first piece of important information about these primitive routines: no blocking should be done in them. The appropriate blocking is done by higher level ATLAS routines. Most primitives employ some kind of loop unrolling, and when these higher level routines block in order to reuse vectors or matrices, it is important that this blocking does not conflict with the primitives' unrolling factors (for instance, if the primitive unrolls a given dimension by 8, but ATLAS blocks that dimension to 3, ATLAS would always call the cleanup code). So this is the information conveyed by these three integers.
The form of a GEMV primitive line is:
<ID> <flag> <Yunroll> <Xunroll> <filename> "<author(s)>"
As mentioned previously, <filename> is the primitive source file.
<Yunroll>
is the unrolling used for the loop that loops over the
vector, and
<Xunroll> is the unrolling used for the loop that loops over the
vector. <flag> is a less obvious parameter which is used
to tell the search script about special properties of a kernel.
It is assumed that the user has supplied a "inner-product" based GEMV
implementation (i.e., an implementation which basically does <Yunroll>
simultaneous dot products). This default state is expressed to the search
by a <flag> value of 0. However, since the inner product formulation of
NoTranspose GEMV loops across the non-contiguous dimension of the matrix,
some architectures need to employ an "outer-product" based NoTranspose GEMV
(i.e., a GEMV which is performed by doing <Xunroll> simultaneous axpy's).
This is indicated by a <flag> value of 16. Finally, since ATLAS's
GEMM has
a code generator which allows it to achieve very good portable performance,
it is always worth seeing how optimal a GEMV can be obtained by simply
making the appropriate call to GEMM. <flag> of 8 indicates
that this is what the kernel is doing.
In summary:
| FLAG | MEANING |
| 0 | Normal |
| 8 | GEMM-based primitive |
| 16 | Outer-product or AXPY-based primitive (only valid for Notranspose GEMV) |
There are several assumptions that need to hold true for a user-supplied GEMV
primitive. First, the loop ordering must be that implied by the
<flag> setting
the user supplies in the primitive description file, as discussed in
Section 3.1.1.
Each primitive makes assumptions about the arguments it handles,
and these assumptions are reflected in the routine name. The function name of
a GEMV primitive is:
ATL_<pre>gemv<Trans>_a1_x1_<betanam>_y1where:
<pre> is replaced by the precision prefix:
s, d, c, or z.
<Trans> is replaced by transpose specifier:
<betanam> is replaced by the beta specifier this kernel
supplies. All GEMV kernels must supply the following beta specifiers and
names:
For a given gemv primitive (either NoTranspose or Transpose), if the cpp macro
Conj_
is defined we want the conjugate form of that transpose setting
(i.e., Nc or C).
Each file is further compiled with differing cpp settings to generate the various beta cases. The beta macro settings and their meanings are:
| CPP MACRO | MEANING |
| BETA0 | Primitive should provide
|
| BETA1 | Primitive should provide
|
| BETAX | Primitive should provide
|
| BETAXI0 | For complex only, primitive should provide
|
| where the imaginary component of beta is zero. |
In terms of the BLAS API, the GEMV kernels additionally assume
incX = 1
incY = 1
Higher level ATLAS routines ensure these assumptions are true before calling the primitive.
Therefore, the routine:
ATL_dgemvN_a1_x1_b0_y1supplies a primitive doing notranspose gemv, on a column-major array with
ATL_cgemvNc_a1_x1_bXi0_y1:supplies a primitive doing notranspose gemv, on a column-major array whose elements should be conjugated before the multiplication, with
For greater understanding of how these CPP macros are used to compile multiple primitives from one file, examine the provided CASES files.
The API of the primitive is:
ATL_<pre>gemv<Trans>_a1_x1_<betanam>_y1
(
const int M, /* length of Y vector */
const int N, /* length of X vector */
const SCALAR alpha,/* ignored, assumed to be one */
const TYPE *A, /* pointer to column-major matrix */
const int lda, /* leading dimension of A, or row-stride */
const TYPE *X, /* vector to multiply A by */
const int incX, /* ignored, assumed to be one */
const SCALAR beta, /* value of beta */
TYPE *Y, /* output vector */
const int incY /* ignored, assumed to be one */
);
where,
<pre> : |
s | d | c | z |
| SCALAR | float | double | float* | float* |
| TYPE | float | double | float | float |
Note that the meaning of M and N are slightly different than that used by the Fortran77 API, in that they give the vector lengths, not array dimensions.
Probably the simplest, notranspose GEMV kernel implementation is:
#ifdef BETA0
void ATL_dgemvN_a1_x1_b0_y1
#elif defined (BETA1)
void ATL_dgemvN_a1_x1_b1_y1
#else
void ATL_dgemvN_a1_x1_bX_y1
#endif
(const int M, const int N, const double alpha, const double *A,
const int lda, const double *X, const int incX, const double beta,
double *Y, const int incY)
{
int i, j;
register double y0;
for (i=0; i != M; i++)
{
#ifdef BETA0
y0 = 0.0;
#elif defined(BETA1)
y0 = Y[i];
#elif defined(BETAX)
y0 = Y[i] * beta;
#endif
for (j=0; j != N; j++) y0 += A[i+j*lda] * X[j];
Y[i] = y0;
}
}
Saving this file to
ATLAS/tune/blas/gemv/CASES/ATL_dgemvN_1x1_1.c,
from the
BLDdir/tune/blas/gemv/ directory, we can test and time
the implementation by:
make dmvtstcaseN mvrout=CASES/ATL_dgemvN_1x1_1.c yu=1 xu=1
BEGINNING GEMV PRIMITIVE TESTING:
TEST TA=N, M=997, N=177, lda=1004, beta=0.000000 STARTED
TEST TA=N, M=997, N=177, lda=1004, beta=0.000000 PASSED
TEST TA=N, M=997, N=177, lda=1004, beta=1.000000 STARTED
TEST TA=N, M=997, N=177, lda=1004, beta=1.000000 PASSED
TEST TA=N, M=997, N=177, lda=1004, beta=0.800000 STARTED
TEST TA=N, M=997, N=177, lda=1004, beta=0.800000 PASSED
GEMV PRIMITIVE PASSED ALL TESTS
speedy. make dmvcaseN mvrout=CASES/ATL_dgemvN_1x1_1.c yu=1 xu=1
gemvN_0 : 49.484536 MFLOPS
gemvN_0 : 49.484536 MFLOPS
gemvN_0 : 49.230769 MFLOPS
gemvN_0 : 49.40 MFLOPS
In the above examples, we pass yu, the unrolling along the
vector,
and xu, the unrolling along the
vector, so that when ATLAS
blocks the operation, it knows the correct unrolling to use to avoid
cleanup code.
A similarly sophisticated transpose primitive is:
#ifdef BETA0
void ATL_dgemvT_a1_x1_b0_y1
#elif defined (BETA1)
void ATL_dgemvT_a1_x1_b1_y1
#else
void ATL_dgemvT_a1_x1_bX_y1
#endif
(const int M, const int N, const double alpha, const double *A,
const int lda, const double *X, const int incX, const double beta,
double *Y, const int incY)
{
int i, j;
register double y0;
for (j=0; j != M; j++)
{
#ifdef BETA0
y0 = 0.0;
#elif defined(BETA1)
y0 = Y[j];
#else
y0 = Y[j] * beta;
#endif
for (i=0; i != N; i++) y0 += A[i+j*lda] * X[i];
Y[j] = y0;
}
}
Which we could test and time by:
speedy. make dmvtstcaseT mvrout=CASES/ATL_dgemvT_1x1_1.c xu=1 yu=1
BEGINNING GEMV PRIMITIVE TESTING:
TEST TA=T, M=997, N=177, lda=1004, beta=0.000000 STARTED
TEST TA=T, M=997, N=177, lda=1004, beta=0.000000 PASSED
TEST TA=T, M=997, N=177, lda=1004, beta=1.000000 STARTED
TEST TA=T, M=997, N=177, lda=1004, beta=1.000000 PASSED
TEST TA=T, M=997, N=177, lda=1004, beta=0.800000 STARTED
TEST TA=T, M=997, N=177, lda=1004, beta=0.800000 PASSED
GEMV PRIMITIVE PASSED ALL TESTS
speedy. make dmvcaseT mvrout=CASES/ATL_dgemvT_1x1_1.c xu=1 yu=1
gemvT_0 : 37.647059 MFLOPS
gemvT_0 : 37.647059 MFLOPS
gemvT_0 : 37.647059 MFLOPS
gemvT_0 : 37.65 MFLOPS
An unsophisticated notranspose GEMV implementation for double precision complex would be:
#ifdef BETA0
#ifdef Conj_
void ATL_dgemvNc_a1_x1_b0_y1
#else
void ATL_dgemvN_a1_x1_b0_y1
#endif
#elif defined (BETA1)
#ifdef Conj_
void ATL_dgemvNc_a1_x1_b1_y1
#else
void ATL_dgemvN_a1_x1_b1_y1
#endif
#elif defined (BETAXI0)
#ifdef Conj_
void ATL_dgemvNc_a1_x1_bXi0_y1
#else
void ATL_dgemvN_a1_x1_bXi0_y1
#endif
#else
#ifdef Conj_
void ATL_dgemvNc_a1_x1_bX_y1
#else
void ATL_dgemvN_a1_x1_bX_y1
#endif
#endif
(const int M, const int N, const double *alpha,
const double *A, const int lda, const double *X, const int incX,
const double *beta, double *Y, const int incY)
{
int i, j;
const int M2 = M<<1, N2 = N<<1;
#if defined(BETAX)
const double rbeta = *beta, ibeta = beta[1];
#elif defined(BETAXI0)
const double rbeta = *beta;
#endif
register double ra, ia, rx, ix, ry, iy;
for (i=0; i != M2; i += 2)
{
#ifdef BETA0
ry = iy = 0.0;
#elif defined(BETAX)
rx = rbeta; ix = ibeta;
ra = Y[i]; ia = Y[i+1];
ry = ra * rx - ia * ix;
iy = ra * ix + ia * rx;
#else
ry = Y[i];
iy = Y[i+1];
#ifdef BETAXI0
rx = rbeta;
ry *= rx;
iy *= rx;
#endif
#endif
for(j=0; j != N2; j += 2)
{
ra = A[i+j*lda]; ia = A[i+j*lda+1];
rx = X[j]; ix = X[j+1];
ry += ra * rx;
iy += ra * ix;
#ifdef Conj_
ry += ia * ix;
iy -= ia * rx;
#else
ry -= ia * ix;
iy += ia * rx;
#endif
}
Y[i] = ry;
Y[i+1] = iy;
}
}
Which, when saved to ATLAS/tune/blas/gemv/CASES/ATL_zgemvN_1x1_1.c, we could test and time by:
make zmvtstcaseN mvrout=CASES/ATL_zgemvN_1x1_1.c xu=1 yu=1
BEGINNING GEMV PRIMITIVE TESTING:
TEST TA=N, M=997, N=177, lda=1004, beta=(0.000000,0.000000) STARTED
TEST TA=N, M=997, N=177, lda=1004, beta=(0.000000,0.000000) PASSED
TEST TA=-, M=997, N=177, lda=1004, beta=(0.000000,0.000000) STARTED
TEST TA=-, M=997, N=177, lda=1004, beta=(0.000000,0.000000) PASSED
TEST TA=N, M=997, N=177, lda=1004, beta=(1.000000,0.000000) STARTED
TEST TA=N, M=997, N=177, lda=1004, beta=(1.000000,0.000000) PASSED
TEST TA=-, M=997, N=177, lda=1004, beta=(1.000000,0.000000) STARTED
TEST TA=-, M=997, N=177, lda=1004, beta=(1.000000,0.000000) PASSED
TEST TA=N, M=997, N=177, lda=1004, beta=(0.800000,0.000000) STARTED
TEST TA=N, M=997, N=177, lda=1004, beta=(0.800000,0.000000) PASSED
TEST TA=-, M=997, N=177, lda=1004, beta=(0.800000,0.000000) STARTED
TEST TA=-, M=997, N=177, lda=1004, beta=(0.800000,0.000000) PASSED
TEST TA=N, M=997, N=177, lda=1004, beta=(0.800000,0.300000) STARTED
TEST TA=N, M=997, N=177, lda=1004, beta=(0.800000,0.300000) PASSED
TEST TA=-, M=997, N=177, lda=1004, beta=(0.800000,0.300000) STARTED
TEST TA=-, M=997, N=177, lda=1004, beta=(0.800000,0.300000) PASSED
GEMV PRIMITIVE PASSED ALL TESTS
speedy. make zmvcaseN mvrout=CASES/ATL_zgemvN_1x1_1.c xu=1 yu=1
gemvN_0 : 78.688525 MFLOPS
gemvN_0 : 78.688525 MFLOPS
gemvN_0 : 78.688525 MFLOPS
gemvN_0 : 78.69 MFLOPS
Also, note that after the first invocation of the kernel,
will come
from L1, leaving
the dominant data cost.
At present, SYMV does a different blocking that blocks both
and
(all other routines block only one dimension), so that
is reused
between calls to the Transpose and NoTranspose kernels. This may
eventually change as greater sophistication is achieved (as you might
imagine, you get two very different GEMV kernels if one is expecting
from main memory, and the other expects
to come from L1, as
in this case; this means we may at some time generate a specialized
L1-contained GEMV kernel).
Note that the
case must not read
, since the memory may
legally be unitialized.
ATLAS/tune/blas/ger/CASES.
Most of the discussion of the GEMV primitives applies to the GER primitives
as well, so I assume you have read and are familiar with the concepts
discussed above. As before, the routines to be timed are given
in a kernel description file, <pre>cases.dsc. GER does not have a
transpose case, so this file first lists the number of GER primitives to search,
followed by that many primitive lines describing them.
GER primitive lines are of the form:
<ID> <flag> <Xunroll> <Yunroll> <filename> "<author(s)>"
<ID>: Integer greater than 0 uniquely identifying this entry
<flag>: is an integer flag which is ignored at the moment
<Xunroll>: is the unrolling of the loop over the X vector
(i.e. the M-loop)
<Yunroll>: is the unrolling of the loop over the Y vector
(i.e. the N-loop)
<filename>: is the name of the C source file for the primitive.
<author(s)>: author(s) name(s)
The API for the ger primitive is:
#if defined(SCPLX) || defined(DCPLX)
#ifdef Conj_
ATL_<pre>ger1c_a1_x1_yX
#else
ATL_<pre>ger1u_a1_x1_yX
#endif
#else
ATL_<pre>ger1_a1_x1_yX
#endif
(
const int M, /* length of X vector */
const int N, /* length of Y vector */
const SCALAR alpha,/* ignored, assumed to be one */
const TYPE *X, /* pointer to X vector */
const int incX, /* ignored, assumed to be one */
const TYPE *Y, /* pointer to Y vector */
const int incY /* increment of Y vector; NOTE: NOT IGNORED */
TYPE *A, /* pointer to column-major matrix */
const int lda, /* leading dimension of A, or row-stride */
);
Assumptions:
incX = 1
A simple double precision real implementation of GER would then be:
void ATL_dger1_a1_x1_yX(const int M, const int N, const double alpha,
const double *X, const int incX, const double *Y,
const int incY, double *A, const int lda)
{
int i, j;
register double y0;
for (j=0; j < N; j++)
{
y0 = Y[j*incY];
for (i=0; i < M; i++) A[i+j*lda] += X[i] * y0;
}
}
Which we can test and time with:
speedy. make dr1tstcase r1rout=CASES/ATL_dger1_1x1_1.c xu=1 yu=1
TEST CONJ=0, M=997, N=177, lda=1006, incY=1, STARTED
TEST CONJ=0, M=997, N=177, lda=1006, incY=1, PASSED
TEST CONJ=0, M=997, N=177, lda=1006, incY=3, STARTED
TEST CONJ=0, M=997, N=177, lda=1006, incY=3, PASSED
TEST CONJ=0, M=997, N=177, lda=1006, incY=-3, STARTED
TEST CONJ=0, M=997, N=177, lda=1006, incY=-3, PASSED
speedy. make dr1case r1rout=CASES/ATL_dger1_1x1_1.c xu=1 yu=1
dger_0 : 31.168831 MFLOPS
dger_0 : 31.788079 MFLOPS
dger_0 : 31.578947 MFLOPS
dger_0 : 31.51 MFLOPS
A simple double precision complex implementation would be:
#ifdef Conj_
void ATL_zger1c_a1_x1_yX
#else
void ATL_zger1u_a1_x1_yX
#endif
(const int M, const int N, const double *alpha, const double *X,
const int incX, const double *Y, const int incY, double *A, const int lda)
{
const int M2 = M<<1, N2 = N<<1;
int i, j;
register double ry, iy, ra, ia, rx, ix;
for (j=0; j < N2; j += 2)
{
ry = Y[incY*j];
iy = Y[incY*j+1];
for (i=0; i < M2; i += 2)
{
rx = X[i];
ix = X[i+1];
ra = rx * ry;
ia = ix * ry;
#ifdef Conj_
ra += ix * iy;
ia -= rx * iy;
#else
ra -= ix * iy;
ia += rx * iy;
#endif
A[i+j*lda] += ra;
A[i+j*lda+1] += ia;
}
}
}
Which we test in time by:
speedy. make zr1tstcase r1rout=CASES/ATL_zger1_1x1_1.c xu=1 yu=1
TEST CONJ=0, M=997, N=177, lda=1006, incY=1, STARTED
TEST CONJ=0, M=997, N=177, lda=1006, incY=1, PASSED
TEST CONJ=0, M=997, N=177, lda=1006, incY=3, STARTED
TEST CONJ=0, M=997, N=177, lda=1006, incY=3, PASSED
TEST CONJ=0, M=997, N=177, lda=1006, incY=-3, STARTED
TEST CONJ=0, M=997, N=177, lda=1006, incY=-3, PASSED
TEST CONJ=1, M=997, N=177, lda=1006, incY=1, STARTED
TEST CONJ=1, M=997, N=177, lda=1006, incY=1, PASSED
TEST CONJ=1, M=997, N=177, lda=1006, incY=3, STARTED
TEST CONJ=1, M=997, N=177, lda=1006, incY=3, PASSED
TEST CONJ=1, M=997, N=177, lda=1006, incY=-3, STARTED
TEST CONJ=1, M=997, N=177, lda=1006, incY=-3, PASSED
speedy. make zr1case r1rout=CASES/ATL_zger1_1x1_1.c xu=1 yu=1
zger_0 : 39.024390 MFLOPS
zger_0 : 38.709677 MFLOPS
zger_0 : 39.024390 MFLOPS
zger_0 : 38.92 MFLOPS
As in GEMV, GER blocks
so that it is reused in L1. Since the
dominant direction of the loop is expected to be down the columns of
, incY is not restricted to 1, while incX is. As in GEMV,
all routines except SYR2 block only the the
dimension, while
SYR2 blocks both, so that
can be reused in L1.
A great deal of the performance win to be had on the Level 1 BLAS, particularly for long vectors, comes from using data prefetch. ATLAS now includes a prefetch header file (described in Section 6.1), which makes prefetch instructions for various systems available for C programmers.
For an example of complex-to-real reuse, consider ZNRM2 which, when called with incX = 1, can be simply implemented as a call to DNRM2 with 2*N. An example of a routine simplifying to another would be calling DSCAL with alpha = 0.0, which devolves to a call to the primitive ATL_dset.
Therefore, if you are planning to optimize a particular case, be sure to read the appopriate section below to make sure that the case you want to optimize is not implemented by a call to another routine.
As with other ATLAS optimizations, each routine has its own kernel index file, one for each precision (eg., AXPY/dcases.dsc indexes the various DAXPY implementations that should be tested and timed during the install process). All of these index files follow the format below, though they leave out unneeded parameters (eg., SCAL, which operates on only one vector, will not have an entry for incY or BETA).
<Number of kernels> <ID> <alpha> <beta> <incX> <incY> <source file> <author name> . . <ID> <alpha> <beta> <incX> <incY> <source file> <author name>
Here's an explanation of these values:
For example, here is an example AXPY/dcases.dsc:
3 1 2 0 0 axpy1_x0y0.c "R. Clint Whaley" 2 2 1 1 axpy1_x1y1.c "R. Clint Whaley" 2 -1 1 1 axpy1_an1x1y1.c "R. Clint Whaley" <ID> <ialpha> <incX> <incY> <rout> <author> [\
So, to make this work, all three filenames mentioned above must appear in your AXPY/ subdir. The first kernel routine is the general case: general incX, incY, and alpha. The second specializes both vector increments to unit stride, but takes any alpha. Finally, the last routine specializes even further, with unit strides and requiring alpha be negative one. Note that all index files must contain at least one routine that handles the most general cases, and passes the tester.
If you need a particular compiler and flag combo to compile your kernel
you end the line with a \, and put the compiler on the following line,
and the flags on a line after that. So, if axpy_x1y1.c needed gcc,
with -O1 -fschedule-insns -DGOODPERFORMANCE for flags, the above
file would be changed to:
3 1 2 0 0 axpy1_x0y0.c "R. Clint Whaley" 2 2 1 1 axpy1_x1y1.c "R. Clint Whaley" \ gcc -O1 -fschedule-insns -DGOODPERFORMANCE 2 -1 1 1 axpy1_an1x1y1.c "R. Clint Whaley" <ID> <ialpha> <incX> <incY> <rout> <author> [\
For routines taking more than one vector, this is not so simple. The first thing that needs to be clear is that internally, ATLAS does not use the pointer conventions that the BLAS do when it comes to handling vectors with negative increments. In the BLAS, if you want to operate on a vector with a negative increment, you pass in the end of the negatively incremented vector (i.e, you always pass the part of the vector closest to 0 in memory, regardless of the increment). ATLAS will instead pass the beginning of the vector regardless of sign, so that the following sample code would correctly step through a vector regardless of the sign:
for (i=0; i < N; i++, X += incX, Y += incY) ....
ATLAS also fiddles with the vector increments so that:
There may be additional constraints, on an individual routine basis.
make <pre><blas>test urout=<rout>
On the makefile line, you can also pass N=<N> to test a particular length vector. Each tester has various flags that can be passed to them, and these can be passed to make via the opt="flags" macro.
To find out what flags a tester takes, compile it, and run the resulting executable with -help (or just look in the source, you slacker).
Alright, here's a specific example. Let's say I have just written a new AXPY implementation and put it in AXPY/myaxpy.c. I can then test it with:
make daxpytest urout=myaxpy.c
When I fire this off, it does a bunch of compilation, and then spits out something like:
ITST N alpha incX incY TEST
====== ======== ======== ==== ==== ======
0 777 1.00 1 1 PASSED
1 777 -1.00 1 1 PASSED
2 777 0.90 1 1 PASSED
ALL AXPY SANITY TESTS PASSED.
Now, let's say I've written a complex routine (cmyaxpy), that can handle multiple increments. To test various increments, I do:
speedy. make zaxpytest urout=cmyaxpy.c opt="-X 4 -1 1 -2 3 -Y 4 1 -1 2 -3"
ITST N ralpha ialpha incX incY TEST
====== ======== ======== ======== ==== ==== ======
0 777 1.00 0.00 -1 1 PASSED
1 777 -1.00 0.00 -1 1 PASSED
2 777 1.30 0.00 -1 1 PASSED
3 777 0.90 1.10 -1 1 PASSED
.....
.....
61 777 -1.00 0.00 3 -3 PASSED
62 777 1.30 0.00 3 -3 PASSED
63 777 0.90 1.10 3 -3 PASSED
ALL AXPY SANITY TESTS PASSED.
Timing works a lot like testing. If your BLAS routine is <blas>, and your file in the