OK, the keen of eye may have noticed how great the N = 1800 performance is, and dismissed it as a timing artifact. Not so, however. This performance bump is due the fact that 1800 is a multiple of the gemm kernel's blocking factor, 72. Unlike on the Opteron, I have not written all the cleanup cases (instead, I rely on the Camm & Peter's earlier work). To see the effect of not getting the cleanup to work as efficiently as the primary kernel, I reproduce the above timings for 3.5.6, but include problem factors that are multiples of $72$:
The reason I have not written the cleanup cases, is that I do not believe this kernel is done. The present DGEMM is coming in around 76% of peak, and I'd really like to push that up around 80% before calling it good enough. So, for now, we are staying with this bumpy ride.
3.5.6 actually uses the same SGEMM kernel as 3.4.1, so the performance difference you see for single precision is only that 3.5.6's CacheEdge is not as good.
I no longer have access to MKL to compare against. Last time I did was here.