ATLAS 3.5.7 SYRK Timings
So, we've known for a while now that ATLAS's SYRK/HERK perform sub-optimally.
It should run roughly at GEMM speed, but it has been instead one of our
slowest BLAS. This is bad news, because SYRK is the asymptotic limit to
Cholesky performance, just as GEMM is for LU. So, even though Cholesky
should run at least as fast as LU, ATLAS's Cholesky has always
significantly underperformed.
The solution is to write SYRK in terms of the GEMM kernel, rather than
the full GEMM. This has been done for real precision only in 3.5.7,
and here are the results.
The first graph compares all DSYRK versions vs DGEMM on a 600Mhz Athlon
classic.
DGEMM & DSYRK: ATLAS 3.5.7 vs. 3.5.6 on a 600 Mhz Athlon
The good news is that SYRK and GEMM speed are now essentially the same, and
we see that 3.5.7's SYRK is clearly superior to 3.5.6's. However, the best
performing case is 'Upper','NoTrans', which is not used by Cholesky. Since
all the cases are relatively close, this is not that big a deal.
Now that our Cholesky performance kernel, DSYRK, is rolling along, what
about Cholesky itself:
Factorization results: ATLAS 3.5.7 vs. 3.5.6 on a 600 Mhz Athlon
Again, things are more like what we hoped. Cholesky is still slightly slower
than LU for small problems (I believe our LU low-order terms have been better
optimized), but catches up around N=600. After this size, Cholesky
runs around the same speed as LU, and possibly a little faster.
I include SYRK 'Upper' 'Trans' on this chart, because this is essentially the
upper limit on Cholesky performance. Just as with LU, we see that the low-order
terms are keeping us from reaching our Cholesky peak.
OK, our final graph is a little busy, but it compares ATLAS 3.5.7 and 3.5.6's
differing SYRK, and resulting Cholesky performance on a 1.6 Ghz Opteron.
ATLAS 3.5.7 vs. 3.5.6 on 1 Processor of 1.6Ghz Opteron
There are several interesting things about this chart. We see that
for the small to medium-sized problems, 3.5.7's gemm-kernel-based SYRK
kicks butt, but that the recursive gemm-based SYRK used in 3.5.6
catches up asymptotically. 3.5.7's SYRK does not run as fast as GEMM,
though actually some cases are closer than this. The case I timed was
the one used in Lower Cholesky. If you take the 'Lower' 'Transpose'
setting instead, performance will roughly halve the difference between SYRK
and GEMM.
3.5.6's Cholesky factorizations never catch up with 3.5.7's, however.
It is almost impossible to see on this busy chart, but 3.5.7's Cholesky
roughly catches up with LU around N=1000, and they then stay within
clock resolution of each other.
Back to ATLAS timing page