--force-tids="8 0 8 16 24 32 40 48 56"Then the parallel DGEMM speedup for moderate sized problems is more like 6.5.
I also have access to a POWER8 machine, with four physical cores, that are again shared 8-way, leading to the need to add to configure:
--force-tids="4 0 8 16 24"