The ATLAS time step

In this optional step, ATLAS times certain kernel routines and reports their performance as a percentage of clock rate. Its purpose is to provide a quick way to ensure that your install has resulted in a library that obtains adequate performance. If you are installing using architectural defaults, this step will print a timing comparison against the performance that the ATLAS maintainer got when creating the architectural defaults. To invoke this step, issue the following command in your BLDdir:
   make time

Figure: Normal results for make time on Core2Duo64SSE3
....1 119.4 116.9 29.1 26.0 46.8 45.6\end{verbatim}
Figure: Timings results when architectural defaults are compiled with substandard gcc4.1
\begin{verbatim}Reference clock rate=2200Mh...
...28.8 91.8 65.1 23.7 18.3 46.8 40.3\end{verbatim}

In Figure [*] we see a typical printout of a successful install, in this case ran on my 2.4Ghz Core2Duo. The Refrenc columns provide the performance achieved by the architectural defaults when they were originally created, while the Present columns provide the results obtained using the new ATLAS install we have just completed. We see that the Present columns wins occasionally (eg. single precision real kSelMM), and loses sometimes (eg. single precision complex kSelMM), but that the timings are relatively similar across the board. This tells us that the install is OK from a performance angle.

As a general rule, performance for both data types of a particular precision should be roughly comparable, but may vary dramatically between precisions (due mainly to differing vector lengths in SIMD instructions).

The timings are normalized to the clock rate, which is why the clock rate of both the reference and present install are printed. It is expected that as clock rates rise, performance as a percent of it may fall slightly (since memory bus speeds do not usually rise in exact lockstep). Therefore, if I installed on a 3.2Ghz Core2Duo, I would not be surprised if the Present install lost by a few percentage points in most cases.

True problems typically display a significant loss that occurs in a pattern. The most common problem is from installing with a poor compiler, which will lower the performance of most compiled kernels, without affecting the speed of assembly kernels. Figure [*] shows such an example, where gcc 4.1 (a terrible compiler for floating point arithmetic on x86 machines) has been used to install ATLAS on an Opteron, rather than gcc 4.7.0, which was the compiler that was used to create the architectural defaults. Here, we see that the present machine is actually slower than the machine that was used to create the defaults, so if anything, we expect it to achieve a greater percentage of clock rate. Indeed, this is more or less true of the first line, kSelMM. On this platform, kSelMM is written totally in assembly, and BIG_MM calls these kernels, and so the Present results are good for these rows. All the other rows show kernels that are written in C, and so we see that the use of a bad compiler has markedly depressed performance across the board. Anytime you see a pattern such as this, the first thing you should check is if you are using a recommended compiler, and if not, install and use that compiler.

On the other hand, if only your BIG_MM column is depressed, it is likely you have a bad setting for the CacheEdge or the complex-to-real crossover point (if the performance is depressed only for both complex types).

R. Clint Whaley 2016-07-28