For instance, on my AMD Dozer system,
there are 8 integer cores, but only 4 FPUs, and so for best performance we
would like to use 4 threads rather than 8, and be sure to not use any
integer core that shares an FPU. A little testing showed that on my
system, core IDs 0, 1, 3, and 6 are all independent of each other, and so
I can tell ATLAS to use only these four cores in threaded operations by
adding this flag to configure:
--force-tids="4 0 1 3 6"
On my system, this actually slightly reduces parallel GEMM performance, but
noticably improves factorization performance.
R. Clint Whaley
2016-07-28