On Sat, 12 Oct 2019, Marcin Błażejowski wrote:
some time ago I found an interesting disscusion on code optimisiation
in
gcc with '-avx2 ' flag in case of copying blocks of matrixes.
Marcin, here's another observation, which chimes with what I think
may have been your original intent.
As I mentioned in my previous reply, one really wants to copy by
column when possible, and take advantage of libgretl's use of
memcpy() as opposed to copying element-by-element. But... the
question arises: is there such a thing as being "too greedy" in use
of memcpy? Might it help to divide the data to be copied into
smaller blocks? And the answer is Yes, if the matrix is big enough.
(I guess this has to do with the available cache.)
I'm appending an example script below. We have a big matrix (5000 x
500) and we'd like to copy its entire content. We try copying by
chunks of columns, starting at 1 column per chunk and going up to
the full 500 in a single chunk. At first the copy time declines, but
in this example the "too greedy" point arrives when copying 50
columns at a time. And if we try to copy all 500 columns in one go,
that's actually worse than going by individual columns.
Here are my timings:
1 columns per chunk: 2.4016s
2 columns per chunk: 1.1832s
5 columns per chunk: 0.3720s
10 columns per chunk: 0.1427s
25 columns per chunk: 0.0708s
50 columns per chunk: 0.1604s
100 columns per chunk: 0.5870s
125 columns per chunk: 0.8519s
500 columns per chunk: 2.8247s
And here's the script:
<hansl>
set verbose off
clear
scalar ROW = 5000
scalar COL = 500
scalar LOOP = 600
matrix A = mnormal(ROW, COL)
matrix B = zeros(ROW, COL)
matrix chunkcols = {1, 2, 5, 10, 25, 50, 100, 125, 500}
loop k=1..nelem(chunkcols) --quiet
cols = chunkcols[k]
n = COL / cols
set stopwatch
loop LOOP --quiet
loop for (j=1; j<=n; j+=cols) --quiet
B[,j:j+cols-1] = A[,j:j+cols-1]
endloop
endloop
printf "%3d columns per chunk: %.4fs\n", cols, $stopwatch
endloop
</hansl>
Quite interesting, and maybe we can make use of this in libgretl's
internals.
Allin