OK, this is generally what we would like to see. All precisions clock in top performance around 88%. Double real and complex are within clock resolution, as we would hope. Single precision complex is not quite as good as single precision real: this is due to the extra shuffling required at the end of the K-loop. I think it could be made to run a little faster, but frankly I got tired of messing with it. With gemms that are at the right speed, how about LU:
Looks good. What if you've got symmetric matrices:
Well, the large-case complex appears to be a little slower than the real. I have a feeling I may have messed up CacheEdge for complex SYRK, so that it is not using the cache as effectively. I need to investigate this. At any rate, the gap is not too large. Last up is Cholesky: