# ATLAS Pentium 4 Timings

This graph compares the performance of ATLAS 3.5.6 and 3.4.1 using
double precision matrix multiply and LU factorization.
## DGEMM and LU Factorization on a 1.7Ghz P4 (256K L2 Cache)

OK, the keen of eye may have noticed how great the `N = 1800`
performance is, and dismissed it as a timing artifact. Not so, however.
This performance bump is due the fact that `1800` is a multiple
of the gemm kernel's blocking factor, `72`. Unlike on the
Opteron, I have not written all the cleanup cases (instead, I rely on
the Camm & Peter's earlier work). To see the effect of not getting
the cleanup to work as efficiently as the primary kernel,
I reproduce the above timings for 3.5.6, but include problem factors that
are multiples of $72$:

## DGEMM and LU Factorization on a 1.7Ghz P4 (256K L2 Cache)

The reason I have not written the cleanup cases, is that I do not believe
this kernel is done. The present DGEMM is coming in around `76%`
of peak, and I'd really like to push that up around `80%` before
calling it good enough. So, for now, we are staying with this bumpy
ride.

Here's the standard graph for single precision real:
## SGEMM and LU Factorization on a 1.7Ghz P4 (256K L2 Cache)

3.5.6 actually uses the same SGEMM kernel as 3.4.1, so the performance
difference you see for single precision is only that 3.5.6's CacheEdge
is not as good.

I no longer have access to MKL to compare against. Last time I did was
here.

Back to ATLAS timing page