Timing a kernel

IMPORTANT NOTE: At present, all of ATLAS Level 1 timing is completely inaccurate for short vectors. Both the level 1 timing in ATLAS/bin, and the kernel timers described here screw this up. Essentially, our portable cache flushing mechanisms are not complete enough to get things completely out of cache, so you see that short vectors appear to get better performance than long vectors, a patent impossibility if all caches were correctly flushed. The only ``solution'' we have at the moment is to time vectors that themselves overflow the cache (i.e., N=1000000).

Timing works a lot like testing. If your BLAS routine is <blas>, and your file in the /<blas> subdir is <rout>, then you compile and time it by:

   make <pre><blas>case urout=<rout>

So, to time the previously meantioned myaxpy.c with a million length vector, you'd have the following:

speedy. make daxpycase urout=myaxpy.c N=1000000
      N=1000000, tim=5.000000e-02
      N=1000000, tim=5.000000e-02
      N=1000000, tim=5.000000e-02
   N=1000000, time=5.000000e-02, mflop=40.000000
N=1000000, incX=1, incY=1, mflop = 40.000000

So, we got 40Mflop out of that implementation. All the timers take the flag -C <cache flush size in bytes>, which you can use to get the L1 (or L2) contained performance. Eg, the timer is usually doing it's best to flush the caches, but I want to see what performance I get when in the L1 cache. What I do is:

make daxpycase urout=myaxpy.c N=500 opt="-C 512"
      N=500, tim=4.296302e-06
      N=500, tim=3.938276e-06
      N=500, tim=4.296302e-06
   N=500, time=4.176960e-06, mflop=239.408571
N=500, incX=1, incY=1, mflop = 239.408571
Flushing 512 bytes is not going to do anything, and N=500 will not overflow cache, so we see that L1-contained operations come in at 240Mflop . . .

When it comes to timing codes with varying increments, the timer is not as flexible as the tester. It can time only one given increment value at a time. So, to time axpy with incX=3 and incY=4, we'd do:

   make daxpycase urout=myaxpy N=500 opt="-X 3 -Y 4"

Clint Whaley 2012-07-10