On Tue, 17 Jun 2014, Allin Cottrell wrote:
Thanks to all of you who have run the matrix_perf tests. This will be
helpful
in setting gretl's (internal, default) parameters for using the system BLAS
versus OpenMP (where available), versus our own single-threaded matrix
multiplication code.
Sorry I'm late. A few more interesting results here: two machines with
same operating system (64-bit debian). One is a low-end dual core, the
other is a modern machine with avx and two physical processors, each with
8 hyperthreaded cores.
The results follow, but I believe the moral of the story (also, having
seen the results others posted earlier) is quite evident: the "right"
software setup depends heavily on what your hardware/software combination
is.
Machine #1:
? matrix_perf(1234)
dgemm experiment 1, variant 1, speed in Gflops
m n k vanilla openmp netlib
128 128 128 1.1414 3.8052 4.4472
128 128 256 1.6901 3.9216 9.6005
128 128 512 1.7049 4.0992 12.939
128 128 1024 1.7066 4.0703 14.629
128 128 2048 1.6609 3.0855 14.598
result: netlib dominates
dgemm experiment 1, variant 2, speed in Gflops
m n k vanilla openmp netlib
128 128 128 1.6039 2.8291 12.559
256 256 128 1.6551 3.7319 13.157
512 512 128 1.5689 3.1081 12.157
1024 1024 128 1.7065 3.2241 13.810
2048 2048 128 1.4343 3.1975 12.901
result: netlib dominates
dgemm experiment 1, variant 3, speed in Gflops
m n k vanilla openmp netlib
128 128 128 1.6506 3.6948 9.4904
256 256 256 1.6570 3.6680 12.913
512 512 512 1.4917 3.4395 16.028
1024 1024 1024 0.70373 1.4884 18.831
2048 2048 2048 0.78776 1.5937 17.032
result: netlib dominates
dgemm experiment 2, variant 1, speed in Gflops
m n k vanilla openmp netlib
8 8 8 0.46703 0.37920 0.32284
16 8 8 0.63081 0.60029 0.54729
32 8 8 0.73601 0.89405 0.92477
64 8 8 0.90807 1.1178 1.3149
128 8 8 0.99528 1.4653 1.6911
256 8 8 1.0726 1.5460 1.5603
512 8 8 1.0186 1.8071 2.1102
1024 8 8 1.1191 1.6092 2.1806
2048 8 8 1.0810 1.6515 2.2472
4096 8 8 1.1118 1.6283 2.2618
result: netlib dominates for mnk >= 2048
vanilla dominates for mnk < 2048
dgemm experiment 2, variant 2, speed in Gflops
m n k vanilla openmp netlib
10 2 1000 1.4020 1.0541 2.9305
20 2 1000 1.2824 1.3566 3.2125
40 2 1000 1.4981 1.9113 2.3364
80 2 1000 1.6409 3.3097 3.0373
160 2 1000 1.6510 3.4486 2.8229
320 2 1000 1.2217 2.5190 2.2135
640 2 1000 0.80308 1.6429 1.9382
1280 2 1000 0.78715 1.5664 1.8491
2560 2 1000 0.71995 1.5641 1.8383
5120 2 1000 0.80175 1.5103 1.7516
result: netlib dominates for mnk >= 1280000
dgemm experiment 2, variant 3, speed in Gflops
m n k vanilla openmp netlib
10 10 1000 1.3822 2.9129 7.9036
20 10 1000 1.2999 3.3758 9.7947
40 10 1000 1.5159 3.2116 10.159
80 10 1000 1.6433 3.8072 11.100
160 10 1000 1.4395 4.1740 11.270
320 10 1000 1.1835 2.6255 9.5855
result: netlib dominates
Operating system: Linux (64-bit)
BLAS library: Netlib
Number of processors: 2
OpenMP enabled: yes
Performance summary:
vanilla -
dominates outright in 0 out of 6 tests
dominates in 1 test(s) for mnk < 2048
openmp -
dominates outright in 0 out of 6 tests
netlib -
dominates outright in 4 out of 6 tests
dominates in 2 test(s) for mnk >= (2048, 1280000)
Machine #2:
? matrix_perf(1234)
dgemm experiment 1, variant 1, speed in Gflops
m n k vanilla openmp netlib
128 128 128 0.90944 1.6489 3.6727
128 128 256 1.0361 12.066 4.0342
128 128 512 1.0363 13.360 3.2194
128 128 1024 2.1998 14.647 4.3699
128 128 2048 2.2040 15.157 4.4580
result: openmp dominates for mnk >= 4194304
netlib dominates for mnk < 4194304
dgemm experiment 1, variant 2, speed in Gflops
m n k vanilla openmp netlib
128 128 128 0.99176 7.2687 3.3558
256 256 128 1.0369 10.092 4.5736
512 512 128 1.0251 10.393 5.5707
1024 1024 128 1.0528 12.044 5.8882
2048 2048 128 1.5099 12.170 5.7919
result: openmp dominates
dgemm experiment 1, variant 3, speed in Gflops
m n k vanilla openmp netlib
128 128 128 1.0068 8.0279 3.1235
256 256 256 1.0678 12.979 5.2844
512 512 512 1.2698 14.865 6.8351
1024 1024 1024 2.3056 15.893 7.5533
2048 2048 2048 1.8804 19.930 10.979
result: openmp dominates
dgemm experiment 2, variant 1, speed in Gflops
m n k vanilla openmp netlib
8 8 8 0.31635 0.060822 0.40654
16 8 8 0.37316 0.13059 0.64134
32 8 8 0.48452 0.25061 0.89951
64 8 8 1.0088 0.35585 1.1348
128 8 8 1.2879 0.56582 1.3032
256 8 8 1.3405 0.83658 1.3995
512 8 8 1.3665 1.0493 0.61674
1024 8 8 1.3656 1.2329 0.87132
2048 8 8 1.3579 1.3256 1.0189
4096 8 8 1.3372 1.3590 0.59412
result: openmp dominates for mnk >= 262144
dgemm experiment 2, variant 2, speed in Gflops
m n k vanilla openmp netlib
10 2 1000 0.93204 0.27760 0.39253
20 2 1000 1.8316 0.51884 0.71238
40 2 1000 1.6916 0.60586 0.85907
80 2 1000 1.9516 1.0208 1.2188
160 2 1000 2.1520 1.1731 1.4970
320 2 1000 2.2201 1.4638 1.6964
640 2 1000 2.3148 1.8720 1.9165
1280 2 1000 2.2985 1.9854 2.3177
2560 2 1000 2.1321 2.4638 2.2796
5120 2 1000 1.8681 1.6833 1.5077
result: vanilla dominates for mnk >= 10240000
dgemm experiment 2, variant 3, speed in Gflops
m n k vanilla openmp netlib
10 10 1000 0.75487 0.64920 2.2521
20 10 1000 1.4513 0.99118 3.2809
40 10 1000 1.7493 1.5317 4.3326
80 10 1000 2.0249 2.1957 4.8481
160 10 1000 2.1919 3.3981 5.6694
320 10 1000 2.2979 5.4260 6.2356
result: netlib dominates
Operating system: Linux (64-bit)
BLAS library: Netlib
Number of processors: 32
OpenMP enabled: yes
Performance summary:
vanilla -
dominates outright in 0 out of 6 tests
dominates in 1 test(s) for mnk >= 10240000
openmp -
dominates outright in 2 out of 6 tests
dominates in 2 test(s) for mnk >= (4194304, 262144)
netlib -
dominates outright in 1 out of 6 tests
dominates in 1 test(s) for mnk < 4194304
-------------------------------------------------------
Riccardo (Jack) Lucchetti
Dipartimento di Scienze Economiche e Sociali (DiSES)
Università Politecnica delle Marche
(formerly known as Università di Ancona)
r.lucchetti(a)univpm.it
http://www2.econ.univpm.it/servizi/hpp/lucchetti
-------------------------------------------------------