ATLAS 3.7.3 Pentium 4 Timings

This release was mainly about SSE3 and Pentium 4E (prescott) support. ATLAS 3.6.0 does not distinguish between a Pentium 4 and a Pentium 4E, but using P4 architectural defaults for a P4E will significantly depress performance. When installed with good defaults, a P4E is faster than a P4 of the same speed.

The P4E is a redesign of the core, and it has two significant advantages as far as ATLAS is concerned: it posseses support for the new SSE3 instructions, and it has a larger Level 1 and 2 cache.

In order to determine how much of a role SSE3 plays in the performance advantage enjoyed by the P4E over a P4, I have made three different timing runs. In addition, Dean Gaudet has contributed a fourth. In the chart below, 2.8P4E_SSE3 indicates ATLAS 3.7.3 on a 2.8Ghz Pentium 4E using SSE3. 2.8P4E_SSE2 is the performance without using SSE3, and is what you will get if you install ATLAS 3.6.0 and don't take the P4 arch defaults! I do not, unfortunately, have access to an actual 2.8Ghz P4. The chart below provides two estimates instead. 1.4P4SSE2 labels the performance curve of a 1.4Ghz chip, where all results have been multiplied by 2 for comparison. 2.4P4SSE2 labels 2.4Ghz P4/Xeon results sent in by Dean Gaudet, scaled by 2.8/2.4.

In general, doubling the clock speed does not result in doubling of performance, so these estimates are probably generous.

DGEMM and DLU on various Pentium 4 architectures

The general order is clearly 2.8P4ESSE3, 2.8P4ESSE2, 2.4P4SSE2, 1.4P4SSE2. The oldest chip does the worst, probably because it is a model-0 P4, with only a 256K L2. The Xeon (model 2/Northwood) P4 does almost as well as the P4ESSE2 for gemm, but not nearly as well for LU. This is probably due to the P4E using a smaller block size (36 vs 72), which is an advantage for the factorizations.

Finally, SSE3 gives a modest improvement. Note that this is because the stuff SSE3 helps with is a low-order term for the Level 3 BLAS; Level 1 and 2 BLAS should get a more impressive speedup from SSE3. Also, the SSE3 presently in ATLAS is not as complete as it could be (in particular, I have not put SSE3 instructions in the cleanup code).


Back to ATLAS timing page