Timing works a lot like testing. If your BLAS routine is <blas>, and your file in the /<blas> subdir is <rout>, then you compile and time it by:
make <pre><blas>case urout=<rout>
So, to time the previously meantioned myaxpy.c with a million length vector, you'd have the following:
speedy. make daxpycase urout=myaxpy.c N=1000000 N=1000000, tim=5.000000e-02 N=1000000, tim=5.000000e-02 N=1000000, tim=5.000000e-02 N=1000000, time=5.000000e-02, mflop=40.000000 N=1000000, incX=1, incY=1, mflop = 40.000000
So, we got 40Mflop out of that implementation. All the timers take the flag -C <cache flush size in bytes>, which you can use to get the L1 (or L2) contained performance. Eg, the timer is usually doing it's best to flush the caches, but I want to see what performance I get when in the L1 cache. What I do is:
make daxpycase urout=myaxpy.c N=500 opt="-C 512" N=500, tim=4.296302e-06 N=500, tim=3.938276e-06 N=500, tim=4.296302e-06 N=500, time=4.176960e-06, mflop=239.408571 N=500, incX=1, incY=1, mflop = 239.408571Flushing 512 bytes is not going to do anything, and N=500 will not overflow cache, so we see that L1-contained operations come in at 240Mflop . . .
When it comes to timing codes with varying increments, the timer is not as flexible as the tester. It can time only one given increment value at a time. So, to time axpy with incX=3 and incY=4, we'd do:
make daxpycase urout=myaxpy N=500 opt="-X 3 -Y 4"
Clint Whaley 2012-07-10