On 12.10.2019 22:01, Allin Cottrell wrote:
On Sat, 12 Oct 2019, Marcin Błażejowski wrote:
I'm attaching a modified verson of your script which may clarify
things. The relative execution times of your variants are mostly a
function of how much excess indexation arithmetic you're doing. Do as
little arithmetic as possible in the inner loop in particular. Your
first variant does 160000 additions/subtractions where 2 will do just
Ok. And that is something I would expect.
That said, copying element-by-element by row -- as in all your
variants -- is very inefficient for two reasons.
First, gretl matrices are in column-major order: column elements are
adjacent in memory, row elements are separated by the number of rows
in the matrix. So go by columns whenever possible.
Ok, Does it mean that if I have
a matrix which I expand by adding new
"tuples" I should append these tuples as new columns instead of new rows?
Second, one should uses ranges rather than single-element indices
whenever possible. If the data in the given range are contiguous in
memory, libgretl will use the C library's memcpy() to copy a chunk of
data in one call.
I know. I just wanted to replicate an example I mentioned at the
beggining of my email, because people discovered that gcc was able to
'vectorize' only "a[i,j] = b[i,j]" copying (i.e. load the chunks of
matrixes into YMM/XMM registers and call specific AVX functions on it).
Here are my timings for the 6 variants in my version of your script
(on i7, Arch Linux):
loop 1: 5.3523 (add/sub = 160000)
loop 2: 4.6882 (add/sub = 80000)
loop 3: 4.9248 (add/sub = 80000)
loop 4: 4.5084 (add/sub = 40001)
loop 5: 3.3973 (add/sub = 2)
loop 6: 0.0109 (add/sub = 2; column chunks)
Note the huge speed-up when copying columns as chunks.
My results (Inspiron laptop with Debian, i7-8550U CPU @ 1.80GHz Skylake):
loop 1: 11,1638 (add/sub = 160000)
loop 2: 9,6939 (add/sub = 80000)
loop 3: 9,8530 (add/sub = 80000)
loop 4: 9,5306 (add/sub = 40001)
loop 4a: 8,7002 (add/sub = 2001)
loop 5: 8,5831 (add/sub = 2)
loop 6: 0,0163 (add/sub = 2; column chunks)