Handling modules on AMD systems

For instance, on my AMD Dozer system, there are 8 integer cores, but only 4 FPUs, and so for best performance we would like to use 4 threads rather than 8, and be sure to not use any integer core that shares an FPU. A little testing showed that on my system, core IDs 0, 1, 3, and 6 are all independent of each other, and so I can tell ATLAS to use only these four cores in threaded operations by adding this flag to configure:
    --force-tids="4 0 1 3 6"
On my system, this actually slightly reduces parallel GEMM performance, but noticably improves factorization performance.



R. Clint Whaley 2016-07-28