Old vs. New Threaded Timings

ATLAS 3.9.5 featured a complete rewrite of ATLAS's threading system, in order to apply some of the techniques we discuss in the IPDPS2009 paper (Castaldo & Whaley, "Minimizing Startup Costs for Performance-Critical Threading").

The timings are given as percent improvement over the old code (i.e. 100% means same speed as old threading provided). First, here is the performance improvement for LU, QR and Cholesky factorizations for on an 8-processor, 2.5 Ghz Core2 server running Linux (ATLAS3.9.5):

So, we see the new threading probably needs a little work on crossover points, as we lose on LU for small problems to the old code. However, at best LU is twice as fast, and asymptotically we see a >60% improvement!

Also new in this release, ATLAS can use native windows threads in addition to using pthreads. Here is the speedup we got on a 2.4Ghz Core2Quad running WinXP using the new native-threads port (with the above threading techniques):

Again, we see performance loss for small problems, and dominating wins for large problems. The Y axis is now just speedup, so at best, the new code ran roughly 2.6 faster than ATLAS's original threaded LU on a quadcore Windows box. The above timings were on ATLAS 3.9.5.

Finally, here are some timings from ATLAS3.9.9 on a 6-processor 700Mhz (MIPS) SiCortex SC072 node:

% Improvement from new threads on a SiCortex 6-processor 700Mhz MIPS node