Handling SMT on IBM POWER

Similarly, an IBM Power7 I have access to has 8 physical cores, but offers 64 SMT units. If you install with the default flags, your parallel speedup for moderate sized DGEMMs is around 4.75. On the other hand, if you add:
    --force-tids="8 0 8 16 24 32 40 48 56"
Then the parallel DGEMM speedup for moderate sized problems is more like 6.5.

I also have access to a POWER8 machine, with four physical cores, that are again shared 8-way, leading to the need to add to configure:

    --force-tids="4 0 8 16 24"



R. Clint Whaley 2016-07-28