http://math-atlas.sourceforge.net/atlas_contrib
So what is the difference between development and kernel contribution? In kernel contribution, you write a kernel to be used by ATLAS, using the provided ATLAS testers and timers to verify it, and when you are satisfied with its performance and reliability, you submit your kernel to the ATLAS team, and they accept it or not, and you are done.
Doing actual core development is quite a bit more complex. Probably the biggest change is that you will need to write your own tester and timer for your new code. No code will be accepted into the ATLAS code base without a tester which can be used to verify it. Since writing a decent tester is usually at least as hard as writing the code it tests, and is always a whole lot less enjoyable, the author must bear the pain of producing it along with the pride of producing the code. As a developer, you will be responsible for testing such new code on several platforms as well.
If you are instead hoping to modify some of the existing code base, remember that for non-kernel operations, portability and robustness must be the primary goal. There are many sections of ATLAS that we know to be second rate on a certain platform, but we also know that it works on the twenty or so architectures that ATLAS is routinely compiled for, so we leave it that way. This means that when a modification is made to a previously existing routine, the modifying author must have good evidence that the new code is as portable as the old. In short, the barrier to replacing tested code is high.
It is possible that users want access to the CVS repository even though they do not plan on doing development, mainly I'd guess so they can get access to the newest stuff without waiting for developer releases. Also, kernel contributors who make subsequent changes to their routines can speed up their adoption by submitting them in a format ready for CVS check-in.
Therefore, new sections are welcome, and probably a FAQ appendix would be a good idea. As people contribute, their names will be added to the author list.
ATLAS was not originally developed in CVS. ATLAS was developed using a programming tool called extract, which means ATLAS is actually maintained in something called basefiles. If you think of regular development being access by value, and CVS using a level of indirection, CVS on basefiles gives you two such levels of indirection. So, if you want to be able to use CVS check-ins, you will need to learn at least the basics of extract. Details on extract can be found at:
http://www.cs.utsa.edu/~whaley/extract/Extract.html
Note that if you just want read-only access, you will need to install extract so that you can get at the files, but you will not need to learn anything about it.
cvs -z3 -d:pserver:anonymous@math-atlas.cvs.sourceforge.net:/cvsroot/math-atlas \ checkout AtlasBase
@define topd @/home/rwhaley/Base/AtlasBase@to:
@define topd @/home/rwhaley/work/AtlasBase@
extract -b <your topd>/make.base -o Makefile rout=Make.atldir make
So, what usually happens is you a messing with something, and you do it in the ATLAS/ directory. When you are confident in your change, you put it into the appropriate basefile in the AtlasBase/ directory (note that examining Make.ext will show you what basefile a given extracted file comes from), and you then re-extract over your working copy with the above command.
Normally, I would say the more the merrier in terms of adding people as SourceForge developers. I'd be happy to see hundreds of people associated with the project. On the other hand, I'd be a little scared with having hundreds of people I've never met have full write access to a project as detailed and delicate as ATLAS. CVS has marvelous rollback abilities, but I'm afraid as of right now anyway, I don't have marvelous CVS abilities, and so I tend to err on the side of timidity.
What all this whining is coming down to is, you have to first show me the money before I'll add you as a developer :). When you first begin hacking ATLAS, the method to use is to submit patches or code to the list, and if your submission is incorporated into ATLAS, we can then get you added as a SourceForge developer.
I will not do CVS check-ins on files that I am not the maintainer for, and I'll keep the main branch in working order.
Status: Not logged in Login via SSL New User via SSL
You want to take the New user via SSL link. This gets you a SourceForge user name, which you need to send to me at rwhaley@users.sourceforge.net. I then have to add you as an ATLAS developer.
You need to change your CVS access from anonymous/read to developer/write. It helps if you set your CVS_RSH and CVSROOT environment variables appropriately. CVS_RSH should be ssh. My CVSROOT is set to
:ext:rwhaley@math-atlas.cvs.sourceforge.net:/cvsroot/math-atlasReplacing my name (look for rwhaley in the above) with yours should get you what you need.
CVS check in/out will still not work correctly at this date. Now, do:
ssh math-atlas.cvs.sourceforge.netand enter your user name and password. As soon as you get a prompt, logout. This process creates some needed files.
Finally, if you have an ATLAS tree created by anonymous CVS access, the easiest thing is to delete it and recheck out as yourself. CVS creates some files on the first checkout saying what kind of access you are using, and these will still show as anonymous, despite your CVSROOT (this is true for branching as well, so watch out). If you want to scope the CVS commands, I'm sure you can switch over without deleting, but removing and rechecking out is how I've seen this problem successfully fixed.
http://ximbiot.com/cvs/manual/stableto be a helpful in doing CVS stuff.
For SourceForge, everything I know about it came from:
https://sourceforge.net/docman/?group_id=1
The companion piece of this guide is the atlas contributer guide, found in ATLAS/doc/atlas_contrib.pdf of the tarfile. You really need to know that before doing much with this guide.
Extract is explained at:
http://www.cs.utsa.edu/~whaley/extract/Extract.html
It is probably not practical that ATLAS will provide a complete LAPACK API (as it does with the BLAS) in the foreseeable future, both due to the algorithmic complexity of some of the operations, and to the sheer number of routines in LAPACK. It must be understood that adding routines adds to the inertia and maintenance costs of the package, and this additional burden must be offset by real advantage for the user.
ATLAS has so far only added LAPACK routines to ATLAS when we can make a performance-enhancing algorithm modification. For instance, we added the LU and Cholesky factorizations because we used the recursive formulations of these routines, which provides for better performance on pretty much any cache-based architecture.
However, when we have added such routines, we usually add the correlated routines even when a performance advantage is not supplied. For instance, upon adding GETRF support, we also added GETRS and GESV. As far as column- major routines go, we supply no better algorithm for GETRS or GESV than LAPACK. However, since these routines are very simple, and GETRF is very often used with them, we added them along with GETRF. The idea here is that their maintenance costs are not heavy, and real advantage is given to the user in that we have sped up GETRF, and if the factor and solve are all he needs, ATLAS will supply a complete solution.
The column-major comment points out another reason to add a routine to ATLAS: ATLAS supplies the only performance-aware row-major LAPACK implementation that I am aware of (I'm sure there are some, I just don't know of any that aren't simply using the col-major stuff, and thus performing terribly). It is possible that someone would want to add an LAPACK routine to ATLAS simply because they need a row-major version, and someone being motivated enough to write it would probably be ample justification to add the routine to the ATLAS tarfile.
So far, we have accepted no routines that do not also include a row-major equivalent, both for BLAS and for LAPACK. We hope to continue this. There are as yet only a few users of the row-major LAPACK/BLAS that I am aware of, but I believe that this is a chicken/egg problem.
Some people insist on using row-major arrays in C, but if they have access to a BLAS/LAPACK that supports it, they find the performance is no better than what they get with simple loops, or that it is calling the col-major in a naive way, and cutting the problems size they can solve in half by copying. Therefore, people with row-major bias don't call the stuff 'cause it doesn't help them, and the problem continues.
It is my belief, therefore, that good-quality row-major stuff must be produced before significant demand will appear. If I'm wrong, I guess we'll someday drop support for row-major, but I don't think this will be the case over a long enough time line.
Therefore, despite it being a hassle, having a good quality row-major implementation is critical for getting an LAPACK routine into ATLAS. For many routines, since we have row-major BLAS, the algorithm stays the same, and only some pointer arithmetic need be changed.
Other routines in LAPACK (GETRF is one) have a built in algorithmic bias towards column-major (in GETRF, this is doing row-pivoting), and another algorithm with the same stability and usage characteristics should be employed for row-major (eg., column-pivoting, for GETRF).
The first step in adding a new routine to ATLAS is to create a tester (and timer) which can be used to verify the correctness of your code. More than half of the challenge is getting the tester right; with a good tester/timer, the code usually comes fairly easily.
Your tester will go in ATLAS/bin when extracted; you can examine some of the testers available there to get an idea of what you should do (eg., look at ATLAS/bin/[lu/llt/slv/trtri/uum]tst.c). All of these routines come from the basefile AtlasBase/Clint/atlas-tlp.base, which is what you should submit your patch against, unless you want to create your own, separate basefile.
After your tester is written, its column-major components can be tested against LAPACK by using the make <rout>tstF target in ATLAS/bin/<arch>. You can even test the row-major components by having the F77 interface transpose the matrices on input, and back on output. See ATLAS/bin/uumtst.c for an example of this for square matrices.
As part of your debugging of the tester, be sure that it not only agrees that LAPACK produces the right answer, but truly detects errors as well. For instance, manually overwrite an entry, both in the matrix and in the padding (in separate tests), and make sure it is caught by the tester.
So, now you are in your working directory (say ATLAS/src/lapack), and you type make -f Make.ext, and nothing happens, no new files show up. This is because you need to re-extract your Make.ext file. This can of course be done by removing your whole ATLAS tree and reinstalling, but less brutally you can ``just'' use something like this: extract -b /home/soender/AtlasBase/make.base -o Make.ext rout=ATLAS/src/lapack -langM. The -langM switch is required for extract to properly handle makefiles, so you cannot skip it.
This is the basic procedure for this sort of stuff. When you need a makefile in a BLDdir subdirectory, the appropriate makefile is copied by Make.top from the ATLAS/makes/ directory. Check Make.ext to see which basefile they come from, and add your routine name among the names of the other routines.
Remember to update the Makefiles for both ATLAS/bin and ATLAS/src/testing, and to get these makefiles into the appropriate subdirs. In order to extract new makefiles, and get them put into the appropriate subdirs, I typically do something like (from the BLDdir:
pushd ~/TEST/ATLAS/makes/ ; make -f Make.ext ; cd .. ; \
make refresh ; popd
(replace the path and arch appropriately, obviously).
You will add your routine in atlas-lp.base with an additional @ROUT keyline, but also do not forget to update the include file atlas_lapack.h at the bottom of the file as well. You will need to add your routine to the prototype part, as well as to the macro renaming part. Examine the basefile for details.
Once it is extracting, use your LAPACK-debugged tester to debug your code.
The F77 interfaces are kept in AtlasBase/Clint/atlas-fint.base. Look at the existing examples and notice how extract generates all four precision from the same routine, if you use the extract macros. All the code for this interface can be ripped from LAPACK and adapted. Note that you will usually need to examine both complex and real versions of the original LAPACK routine, to find any differences in interface/testing and comments. You will also need to remove unneeded EXTERNAL declaration, etc.
This interface does the parameter checking, and converts any FORTRAN string arguments to some predefined integer values, and then call the ATLf77wrap interface. Scope any of the existing routines for details on this.
The C interfaces are easy to write, since they should just check the input arguments, and then call the ATLAS routine. The codes are stored in atlas-clp.base. Check it out for lots of examples.
That's the theoretical reason why they shouldn't cover all discovered items. However, ATLAS presently times the kernels in order to be able to produce a comprehensive SUMMARY.LOG, and these timings could be skipped, assuming this functionality were added to the atlas install process.
There are some weaknesses of architectural defaults. One of the main ones is how they can go out of date, and cause slowdown. One big way this can happen is with compiler changes. For instance, gcc 3.0 produces completely different (and inferior) x86 code than the 2.x series, and 4.0 was similarly worse than latter-day gcc 3. Almost all architectural defaults in ATLAS 3.8 are compiled with gcc 4.2.
Anytime a different compiler is used, the architectural defaults become suspect. For truly inferior compiler (like gcc 3.0 or 4.0), there is no way to get good performance, but at least some problems can be worked around by having ATLAS adapt itself to the new compiler, and architectural defaults prevent this from happening.
This will copy the search result output files into a directory <OBJdir>/ARCHS/<MACH>/, with appropriate subdirs under that. You can then go into these guys and delete files you don't want to be part of the defaults (eg., atlas_cacheedge.h, etc).
Now, to save these defaults to a transportable format, you can have the makefile create the tarfile for you by:
make tarfile
make check
If you are using threads, you will want to run the same tests for threading via:
make ptcheck
dudley.home.net. make check
...
... bunch of compilation ...
...
DONE BUILDING TESTERS, RUNNING:
SCOPING FOR FAILURES IN BIN TESTS:
fgrep -e fault -e FAULT -e error -e ERROR -e fail -e FAIL \
bin/Linux_ATHLON/sanity.out
8 cases: 8 passed, 0 skipped, 0 failed
4 cases: 4 passed, 0 skipped, 0 failed
8 cases: 8 passed, 0 skipped, 0 failed
4 cases: 4 passed, 0 skipped, 0 failed
8 cases: 8 passed, 0 skipped, 0 failed
4 cases: 4 passed, 0 skipped, 0 failed
8 cases: 8 passed, 0 skipped, 0 failed
4 cases: 4 passed, 0 skipped, 0 failed
DONE
SCOPING FOR FAILURES IN CBLAS TESTS:
fgrep -e fault -e FAULT -e error -e ERROR -e fail -e FAIL \
interfaces/blas/C/testing/Linux_ATHLON/sanity.out | \
fgrep -v PASSED
make[1]: [sanity_test] Error 1 (ignored)
DONE
SCOPING FOR FAILURES IN F77BLAS TESTS:
fgrep -e fault -e FAULT -e error -e ERROR -e fail -e FAIL \
interfaces/blas/F77/testing/Linux_ATHLON/sanity.out | \
fgrep -v PASSED
make[1]: [sanity_test] Error 1 (ignored)
DONE
So, in the LAPACK testers we see no failures (all tests show 0 failed), and we have no output from the BLAS testers, which is what we want. Notice the lines like:
make[1]: [sanity_test] Error 1 (ignored)
This is due to fgrep's behavior, and does not indicate an error. If fgrep does not find any pattern matches, it returns a 1, 0 on match. Therefore, since we are grepping for error, getting an ``error condition'' of 1 is what we hope for.
bin/sanity.out
interfaces/blas/C/testing/sanity.out
interfaces/blas/F77/testing/sanity.out
The threaded sanity test uses the same filenames with pt prefixed.
The first thing to notice is which of these tests are showing errors. The testers in bin are higher level than those in the interfaces directories, so if you get errors in both, track down and fix the interface errors first, as they may be causing the lapack errors. If both C and F77 BLAS interfaces are showing errors, I always scope and fix the Fortran77 stuff first, since Fortran is simpler (no RowMajor case to handle). Only if an error only shows up in C testing do I scope that output instead of the Fortran77.
The grepped error message probably gives you no idea what actually went wrong (it may show something as simple as:
FAIL
for instance), so you must go look at the sanity.out in question.
For instance, you might need to scope
interfaces/blas/F77/testing/sanity.out. You do a search for
whatever alerted you to the problem (eg., FAIL), and you see by the
surrounding context what tester failed.
x<pre>blat<lvl>
x<pre>cblat<lvl>
for Fortran77 and C, respectively. The Level 1 testers
(x[s,d,c,z]blat1) test certain fixed cases, and thus take no input file.
So if the error is in them, you simply run the executable with no args in
order to reproduce the failure.
The Level 2 and 3 testers allow a user to specify what tests should be run, via an input file. The standard input files that ATLAS runs with are:
<pre>blat<lvl>.dat c_<pre>blat<lvl>.datrespectively. The format of these input files is pretty self explanatory, and more explanation can be found at:
www.netlib.org/blas/faq.htmlTo run the tester with these files, you redirect them into the tester. For instance, to run the double precision Level 2 tester with the default input file, you'd issue:
./xdblat2 < ~/ATLAS/interfaces/blas/F77/testing/dblat2.dat
You should be aware that only the first error report in a run is accurate: one error can cause a cascade of spurious error reports, all of which may go away by fixing the first reported problem. So, it is important to find and fix the errors in sequence.
I usually copy the input file in question to a new file that I can hack on (for instance, if the error was in the double precision Level 2, I might issue:
cp ~/ATLAS/interfaces/blas/F77/testing/dblat2.dat bad.datI then repeatedly run the routine and simplify the input file until I have found the smallest, simplest input that displays the error.
The next step is to rule out tester error. The way I usually do this is to demonstrate that the error goes away by linking to the Fortran77 reference BLAS rather than ATLAS (you can only do this for errors in the F77 interface, obviously). I usually just do it by hand, i.e., for the same example again, I'd do:
f77 -o xtst dblat2.o /home/rwhaley/lib/libfblas.aIf the ATLAS-linked code has the error, and this one does not, it is a strong indication that the error is in ATLAS. If the F77 BLAS are shown to be in error, it is usually a compiler error, and can be fixed by turning down (or off) the optimization used to compile the tester.
Now you should have confirmed the tester is working properly, and that the error is in a specific routine (let us say DNRM2 as an example). As a quick proof that DNRM2 is indeed the problem, you can link explicitly to the F77 version of DNRM2, and to ATLAS for everything else (see Section 8.2 for hints on how to do this). If this still shows the error, you are confident that ATLAS's DNRM2 is indeed causing the problem, and you should either track it down, or report it (depending on your level of expertise).
The sanity tests only run the LAPACK testers in this directory. The LAPACK routines depend on the BLAS, so ignore errors in lapack testers until all the BLAS pass with no error. If you have errors in LAPACK but the BLAS pass all tests, then you have to hunt for the error in the LAPACK routines.
First, rule out that it's not a problem in the BLAS that is just not showing up in the BLAS testing. Get yourself a reference BLAS library, as explained in Section 8.2. Then, set your Make.inc's BLASlib macro to point to the created reference BLAS library. Then, you need to compile a library that uses ATLAS's lapack routines, but the reference BLAS. This can be done by compiling the same executable name with _sys suffixed. For instance, if you were running the LU tester, xdlutst, you would say make xdlutst_sys, and then run this executable with the same input.
If the error goes away, then the error is really in the ATLAS BLAS somewhere. I then usually look at the LAPACK routine and tester in question to find out what its BLAS dependencies are, and manually link in the reference BLAS object files until I find the exact BLAS causing the problem. Usually once you know what routine causes the prob, you can reproduce the error with the BLAS tester (i.e. you need a IDAMAX call with N=12, incX=82).
If the error still persists using ATLAS's LAPACK and the Fortran77 BLAS, the next trick is to do LAPACK just like the BLAS: download and compile the F77 LAPACK from netlib (www.netlib.org/lapack/lapack.tgz). You then set your Make.inc's FLAPACKlib to point to your Fortran77 lapack library. You then suffix the base executable name with F_sys (eg., for LU again, you would do make xdlutstF_sys), and you will get a tester linked against the Fortran77 BLAS and LAPACK. If this also shows to be in error, there is an error in the tester, or in the compiler. Try turning down compiler optimization to rule in or out compiler errors.
Before a stable release, we always do as much testing as possible. The 900 pound gorilla of testers is Antoine's tester scripts. This tester can run as long as several days, and does a great number of both fixed and random tests, and if it completes with no errors, you have a pretty good idea that the code is fairly solid. Even the casual user ought to run the sanity testing as a matter of course, and that should always be ran and passed first. Also, much of the methodology for understanding output, tracking down problems, etc, is the same for this tester and the sanity test, so read those sections first for tips I will not bother to repeat here.
Now, you create a directory for each architecture you wish to run the tester on, using the configure command. For instance, I could create a subdirectory under my AtlasTest directory with the following commands (following the above untar):
cd AtlasTest mkdir Core2DuoSSE3 ../configure --atldir=/home/whaley/TEST/ATLAS3.7.36.0/obj64/Where of course -atldir provides the path to the BLDdir that you want to test. From here on out, we will call this directory, which you have configured for a particularl platform's test, as the TSTdir.
Some of these tests need a reference BLAS library to compare against, so you need to fill in your ATLAS install's BLDdir/Make.inc with a trusted, complete BLASlib. See the following section for details on this.
You are now ready to start the testing, as described in the following sections.
Some of these tests need a reference BLAS library to compare against, so you need to fill in your ATLAS install's BLDdir/Make.inc with a trusted, complete BLASlib. On modern machines, we typically just compare against the Fortran77 reference BLAS from netlib, though this makes the install run longer. On slower machines, you may need to use an optimized/vendor BLAS to do testing, but then when you find errors you will need to debug whether it is ATLAS or the optimized BLAS that are causing the problem.
Get the BLAS reference tarfile from www.netlib.org/blas/blas.tgz. then do something similar to the following:
mkdir FBLAS cd FBLAS gunzip -c ../blas.tgz | tar xvf - gfortran -O -c *.f ar r ~/lib/libfblas.a *.o
You may need to substitute for your Fortran77 compiler and flags, and if your system uses ranlib, run that on libfblas.a as well. It is important the Fortran77 compiler and flags used to compile this library match those used by ATLAS!
Now simply set your Make.inc's BLASlib to something like:
BLASlib = /home/rwhaley/lib/libfblas.a
You may want an optimized library if one is available, since the Level 3 tests can go on for much longer if you use only the reference library. However, only a few vendor libraries supply all of the BLAS that ATLAS provides (to be fair, ATLAS provides BLAS above those mandated by the standard; it provides all the routines present in the Fortran77 reference library). So, the easiest way to get a complete library is to also install the reference Fortran77 library from netlib, as described in the previous section.
Now, you can set BLASlib so that the optimized library is linked in first, and the reference BLAS are used for any routines not provided in the optimized library. For instance, here's an old BLASlib for using MKL:
BLASlib = /home/rwhaley/lib/libmkl32_def.a /home/rwhaley/lib/libfblas.a
For many routines, the tester cannot tell the difference between an error in the BLAS given by BLASlib, and an error in ATLAS. Subsequent section will explain how to figure this out, but understand that a lot of optimized BLAS will fail this tester, in which case you need to link against the F77 BLAS instead of the optimized version of that routine. Let us say you find out that there are errors in the optimized DTRSM. In this case, you can simply link in the F77 reference DTRSM object file first to override the on in the optimized lib. So, your BLASlib line would then look something like:
BLASlib = /home/rwhaley/FBLAS/dtrsm.o \
/home/rwhaley/lib/libmkl32_def.a /home/rwhaley/lib/libfblas.a
Obviously, if you have more than a few routines like this, just testing against the f77 reference BLAS and taking the extra runtime is the way to go.
make
As previously mentioned, however, this tester can run as long as several days. So, if you are connected to the machine with an unreliable or short- term connection, you will need to ensure it can continue to run even if you are disconnected. Under most unixes, you can do this by using the nohup command. For example:
nohup make |& tee PPRO.out &is what I use with the tcsh shell. Bourne shell uses users will need a different redirect command.
Once you have the error, you need to repeat it. You can try running the exact case, but sometimes that won't do it (for instance, you have a memory error that requires you to run many cases); you then need to find a small run that does demonstrate the error.
You should then apply the normal tricks (linking to F77 BLAS instead of sys blas, having the tester call the f77 blas twice, etc) to ensure the error really is in ATLAS, before tracking the error to its source.
x<pre>l<level>blaststThe BLAS testers test ATLAS against a known-good implementation, so the first thing to do is make sure the error is in ATLAS, and not the known-good implementation. To do this, compile the reference BLAS from netlib (using conservative compiler flags), as discussed in Section 8.2, and then relink and rerun the test in question. If the error goes away, you have found an error in your known-good library, not ATLAS. If it stays, you have found an error in ATLAS, and you should track it down or report it. See Section 7.5 for information on tracking problems in the LAPACK testers.
For machines with very large L1 caches, often several blocking factors that fit into L1 have roughly the same performance. In such a case, it is very likely that you want to choose the smallest achieving that rough performance, as it will allow more blocks to fit into the L2 blocking to be done later.
If a kernel appears to get much better performance with a large NB, the best idea is to build a full GEMM using both the best-performing small NB, and the best performing large NB, and seeing what the gap truly is. Very often, the small kernel will actually be better even asymptotically, and if it is not, it will often be so much better for smaller problems that it makes sense to use it anyway.
Even beyond these explanations, it is sometimes the case that the kernel timer predicts good performance that is not realized when the full GEMM is built. This is usually due to inadequate cache flushing, leading to overprediction of performance because things are retained more in the cache than they are in practice. Therefore, I usually pump up the flushing mechanism (set L2SIZE of your Make.inc to ridiculously large levels). No matter what, actual full GEMM performance is the final arbiter. If it is not as high as predicted by the kernel timer, it may be worthwhile to see if other, smaller NB, cases achieve the same full-gemm performance.
Therefore, if you must choose a large NB in order to get adequate GEMM performance, you must pay an unusual amount of attention to cleanup optimization. However, as the next section will discuss, even if cleanup ran at the same speed as your best kernel, this will yield poor performance for many codes.
To get an idea of this, simply scope the factorizations provided by LAPACK. These applications are staticly blocked, so that the column factorizations (eg., DGETF2 for LU) are used until NB is reached. If ILAENV returns a blocking factor smaller than your GEMM, the applications will stay in cleanup even for large problems. Even worse, some applications (eg., QR) require workspace proportional to NB, and since dynamic memory is not used, it is possible even if you hack ILAENV to use the correct blocking factor, they will be forced to a smaller one.
You should not choose an
that is a power of 2, as this could occasionally
cause nasty cache conflicts. There's often a small advantage to choosing
that are a multiple of cache line size; this can sometimes be critical,
depending on the arch.
So, the basic idea is to start looking at
given by the above two
computations, and then try a little smaller and larger using the kernel
timer. If you get two that tie for out-of-cache performance, always take
the smaller. If best performance is achieved with very large
(say
), then always confirm that it yields better GEMM performance
than a smaller
, and that application performance is not severely
impacted, particularly for smaller problems.
The way I usually time application performance is to time ATLAS's LU.
This actually gives you a very rosy picture of how a large block factor
will effect performance, in that it uses recursion rather than staticly
blocking. This means that ATLAS's LU does not have any unblocked code,
and thus doesn't slow down the way LAPACK's LU will for large
.
However, if even this code shows performance loss for smaller sizes,
you know your cleanup needs to get a lot better, or you need
to reduce
, even if it results in a slight reduction in GEMM
performance.
If you want to get a better idea of how most applications will perform,
time one of LAPACK's factorizations instead.
Under no circumstances should you choose a blocking factor much larger than 120. I confine the ATLAS search to a maximal size of 80 for the above reasons, but occasionally go a little higher for machines without effective L1 caches. However, this can absolutely kill application performance. Further, it is never a good idea to completely fill an Level 2 cache with your block. It may look good in GEMM, but it will die in any application, both for the reasons above, and the following: The L2 cache is shared instruction/data. Filling it with data will often lead to instruction loading/flushing cycle when a larger application is calling. Remember that GEMM is of interest because of all the applications that are built from it, not when used in isolation.
If a NB larger than 60 only gives you a few percent, always choose a smaller one; only go above 80 for significant advantage, and essentially don't go above 120 unless absolutely necessary, and then you can expect slowdown in many applications, even once you have fully optimized all cleanup cases.
For ATLAS 3.7.12, ATLAS's configure routine was completely rewritten for greater modularity. The total amount of code probably increased, but the amount that must be examined at any time should be very much smaller.
In the new system, the topmost unit is ATLAS/configure which is a BFI shell script which allows ATLAS's config.c to be invoked in a way very similar to gnu configure. This shell script gathers some info and fills in a Makfile which is then used to build xconfig from ATLAS/CONFIG/src/config.c. config.c is a driver program that first calls various probes to determine any information not overridden by user flags, and then calls xspew to create a full Make.inc for the target architecture. xsprew is built from the file ATLAS/CONFIG/src/SpewMakeInc.c.
The idea is to change ATLAS's install so it consists of the following commands:
Every type of probe has a frontend driver (occasionally, config may directly call the backend driver, if there is only one) which will itself call multiple backend drivers. For instance, the probe to compute the architecure runs on the frontend, and calls different backend drivers depending on the assembly dialect and operating system of the backend. The files for the frontend drivers are located in ATLAS/CONFIG/src, and the backend files are in ATLAS/CONFIG/src/backend, with all include files in ATLAS/CONFIG/include. All frontend probes use the file atlconf_misc.c (prototyped in atlconf_misc.h), which handles things like file I/O, issuing shell commands, etc. The current probes used by config are:
( (1<<ISA0) | (1<<ISA1) | ... | (1<<ISAn) )
The frontend wrapper script archinfo.c calls these
guys according to OS, and tries to get all flags filled in with union of
functionality of archinfo_x86 and archinfo_
OS
.
Deprecated machines (no longer supported in config or arch def):
Still missing HPUX support. Linux and FreeBSD support best tested.
This is complicated as hell. Potentially, each architecture/OS combo has unique compiler and flags for each supported compiler (more below), and the user can override any/all of these. I'm changing the number of supported compilers for greater flexability. These are:
Here's my present design:
This document was generated using the LaTeX2HTML translator Version 2002-2-1 (1.70)
Copyright © 1993, 1994, 1995, 1996,
Nikos Drakos,
Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999,
Ross Moore,
Mathematics Department, Macquarie University, Sydney.
The command line arguments were:
latex2html -show_section_numbers -split 0 atlas_devel
The translation was initiated by R. Clint Whaley on 2007-10-10