The P4E is a redesign of the core, and it has two significant advantages as far as ATLAS is concerned: it posseses support for the new SSE3 instructions, and it has a larger Level 1 and 2 cache.
In order to determine how much of a role SSE3 plays in the performance advantage enjoyed by the P4E over a P4, I have made three different timing runs. In addition, Dean Gaudet has contributed a fourth. In the chart below, 2.8P4E_SSE3 indicates ATLAS 3.7.3 on a 2.8Ghz Pentium 4E using SSE3. 2.8P4E_SSE2 is the performance without using SSE3, and is what you will get if you install ATLAS 3.6.0 and don't take the P4 arch defaults! I do not, unfortunately, have access to an actual 2.8Ghz P4. The chart below provides two estimates instead. 1.4P4SSE2 labels the performance curve of a 1.4Ghz chip, where all results have been multiplied by 2 for comparison. 2.4P4SSE2 labels 2.4Ghz P4/Xeon results sent in by Dean Gaudet, scaled by 2.8/2.4.
In general, doubling the clock speed does not result in doubling of performance, so these estimates are probably generous.
The general order is clearly 2.8P4ESSE3, 2.8P4ESSE2, 2.4P4SSE2, 1.4P4SSE2. The oldest chip does the worst, probably because it is a model-0 P4, with only a 256K L2. The Xeon (model 2/Northwood) P4 does almost as well as the P4ESSE2 for gemm, but not nearly as well for LU. This is probably due to the P4E using a smaller block size (36 vs 72), which is an advantage for the factorizations.
Finally, SSE3 gives a modest improvement. Note that this is because the stuff SSE3 helps with is a low-order term for the Level 3 BLAS; Level 1 and 2 BLAS should get a more impressive speedup from SSE3. Also, the SSE3 presently in ATLAS is not as complete as it could be (in particular, I have not put SSE3 instructions in the cleanup code).