The timings are given as percent improvement over the old code (i.e.
100% means same speed as old threading provided). First, here is the
performance improvement for LU, QR and Cholesky factorizations for on
an 8-processor, 2.5 Ghz Core2 server running Linux (ATLAS3.9.5):
So, we see the new threading probably needs a little work on crossover points, as we lose on LU for small problems to the old code. However, at best LU is twice as fast, and asymptotically we see a >60% improvement!
Also new in this release, ATLAS can use native windows threads in addition
to using pthreads. Here is the speedup we got on a 2.4Ghz Core2Quad running
WinXP using the new native-threads port (with the above threading techniques):
Again, we see performance loss for small problems, and dominating wins for large problems. The Y axis is now just speedup, so at best, the new code ran roughly 2.6 faster than ATLAS's original threaded LU on a quadcore Windows box. The above timings were on ATLAS 3.9.5.
Finally, here are some timings from ATLAS3.9.9 on a 6-processor
700Mhz (MIPS) SiCortex SC072 node: