On Fri, 24 Jan 2020, Sven Schreiber wrote:
I was finally able to get a speed advantage for 4 threads when I
increased the problem size in the script to T=1000, N=40. So of course
you're right that single-thread is not the universal solution. But it
does seem that openblas tries multithreading much too aggressively.
There's a compile-time option for openblas that is apparently more
functional in 0.3.7 than it used to be. Quoting Makefile.rule:
"If any gemm argument m, n or k is less or equal to
[GEMM_MULTITHREAD_THRESHOLD], gemm will be execute with single thread.
(Actually in recent versions this is a factor proportional to the
number of floating point operations necessary for the given problem
size, no longer an individual dimension). You can use this setting to
avoid the overhead of multi-threading in small matrix sizes. The
default value is 4, but values as high as 50 have been reported to be
optimal for certain workloads (50 is the recommended value for
Julia)."
It might be worth trying the Julia value of 50. But this is applicable
only to our builds for Windows, or for people building the openblas
library themselves (e.g. on Linux). I haven't experimented yet but I
plan to.
Allin