On Sat, 12 Oct 2019, Marcin Błażejowski wrote:
some time ago I found an interesting disscusion on code optimisiation
in
gcc with '-avx2 ' flag in case of copying blocks of matrixes. So, I
wrote a simple script and I got the following results [...]
I'm attaching a modified verson of your script which may clarify
things. The relative execution times of your variants are mostly a
function of how much excess indexation arithmetic you're doing. Do as
little arithmetic as possible in the inner loop in particular. Your
first variant does 160000 additions/subtractions where 2 will do just
fine.
That said, copying element-by-element by row -- as in all your
variants -- is very inefficient for two reasons.
First, gretl matrices are in column-major order: column elements are
adjacent in memory, row elements are separated by the number of rows
in the matrix. So go by columns whenever possible.
Second, one should uses ranges rather than single-element indices
whenever possible. If the data in the given range are contiguous in
memory, libgretl will use the C library's memcpy() to copy a chunk of
data in one call.
Here are my timings for the 6 variants in my version of your script
(on i7, Arch Linux):
loop 1: 5.3523 (add/sub = 160000)
loop 2: 4.6882 (add/sub = 80000)
loop 3: 4.9248 (add/sub = 80000)
loop 4: 4.5084 (add/sub = 40001)
loop 5: 3.3973 (add/sub = 2)
loop 6: 0.0109 (add/sub = 2; column chunks)
Note the huge speed-up when copying columns as chunks.
Allin