ATLAS Pentium 4 Timings

This graph compares the performance of ATLAS 3.5.6 and 3.4.1 using double precision matrix multiply and LU factorization.

DGEMM and LU Factorization on a 1.7Ghz P4 (256K L2 Cache)

OK, the keen of eye may have noticed how great the N = 1800 performance is, and dismissed it as a timing artifact. Not so, however. This performance bump is due the fact that 1800 is a multiple of the gemm kernel's blocking factor, 72. Unlike on the Opteron, I have not written all the cleanup cases (instead, I rely on the Camm & Peter's earlier work). To see the effect of not getting the cleanup to work as efficiently as the primary kernel, I reproduce the above timings for 3.5.6, but include problem factors that are multiples of $72$:

DGEMM and LU Factorization on a 1.7Ghz P4 (256K L2 Cache)

The reason I have not written the cleanup cases, is that I do not believe this kernel is done. The present DGEMM is coming in around 76% of peak, and I'd really like to push that up around 80% before calling it good enough. So, for now, we are staying with this bumpy ride.


Here's the standard graph for single precision real:

SGEMM and LU Factorization on a 1.7Ghz P4 (256K L2 Cache)

3.5.6 actually uses the same SGEMM kernel as 3.4.1, so the performance difference you see for single precision is only that 3.5.6's CacheEdge is not as good.

I no longer have access to MKL to compare against. Last time I did was here.


Back to ATLAS timing page