Tracking down an error in the BLAS interface testers

The BLAS testers are split by BLAS Level (1, 2 or 3) and precision/type (s,d,c,z). The basic names of the tester executables are
    x<pre>blat<lvl>
    x<pre>cblat<lvl>
for Fortran77 and C, respectively. The Level 1 testers (x[s,d,c,z]blat1) test certain fixed cases, and thus take no input file. So if the error is in them, you simply run the executable with no args in order to reproduce the failure.

The Level 2 and 3 testers allow a user to specify what tests should be run, via an input file. The standard input files that ATLAS runs with are:

   <pre>blat<lvl>.dat
   c_<pre>blat<lvl>.dat
respectively. The format of these input files is pretty self explanatory, and more explanation can be found at:
   www.netlib.org/blas/faq.html
To run the tester with these files, you redirect them into the tester. For instance, to run the double precision Level 2 tester with the default input file, you'd issue:
   ./xdblat2 < ~/ATLAS/interfaces/blas/F77/testing/dblat2.dat

You should be aware that only the first error report in a run is accurate: one error can cause a cascade of spurious error reports, all of which may go away by fixing the first reported problem. So, it is important to find and fix the errors in sequence.

I usually copy the input file in question to a new file that I can hack on (for instance, if the error was in the double precision Level 2, I might issue:

   cp ~/ATLAS/interfaces/blas/F77/testing/dblat2.dat bad.dat
I then repeatedly run the routine and simplify the input file until I have found the smallest, simplest input that displays the error.

The next step is to rule out tester error. The way I usually do this is to demonstrate that the error goes away by linking to the Fortran77 reference BLAS rather than ATLAS (you can only do this for errors in the F77 interface, obviously). I usually just do it by hand, i.e., for the same example again, I'd do:

   f77 -o xtst dblat2.o /home/rwhaley/lib/libfblas.a
If the ATLAS-linked code has the error, and this one does not, it is a strong indication that the error is in ATLAS. If the F77 BLAS are shown to be in error, it is usually a compiler error, and can be fixed by turning down (or off) the optimization used to compile the tester.

Now you should have confirmed the tester is working properly, and that the error is in a specific routine (let us say DNRM2 as an example). As a quick proof that DNRM2 is indeed the problem, you can link explicitly to the F77 version of DNRM2, and to ATLAS for everything else (see Section [*] for hints on how to do this). If this still shows the error, you are confident that ATLAS's DNRM2 is indeed causing the problem, and you should either track it down, or report it (depending on your level of expertise).

Clint Whaley 2012-07-10