On Mon, 14 Oct 2019, Allin Cottrell wrote:
So all the data seems to (pretty much) agree: the point at which
reduction in
copy-time turns into increase, as we crank up the number of columns to copy
at once, is in the neighbourhood of the L2 cache size, which is typically 1
MB (2^20) these days.
Oof, sorry people! I'm afraid that matrix-copy timings based on the
script I posted are mostly artifacts of an error in the script --
revealed when I finally checked for B == A after the copy. The limit
@n for the inner loop across columns was wrong, with the result that
not all columns were getting copied. Here are my current timings --
relatively flat in respect of the number/size of chunks:
matrix size = 2500000, (20000000 bytes)
1 columns per chunk: 2.4073s
2 columns per chunk: 2.3623s
5 columns per chunk: 2.4675s
10 columns per chunk: 2.4984s
25 columns per chunk: 2.5216s
50 columns per chunk: 2.5999s
100 columns per chunk: 3.2440s
125 columns per chunk: 3.5543s
500 columns per chunk: 2.8366s
And here's the corrected script:
<hansl>
set verbose off
clear
scalar ROW = 5000
scalar COL = 500
scalar LOOP = 600
matrix chunkcols = {1, 2, 5, 10, 25, 50, 100, 125, 500}
matrix A = mnormal(ROW, COL)
matrix B = zeros(ROW, COL)
printf "matrix size = %d, (%d bytes)\n", ROW*COL, ROW*COL*8
loop k=1..nelem(chunkcols) --quiet
cols = chunkcols[k]
# n = COL / cols # WRONG !!
n = COL - cols + 1
B .= 0
set stopwatch
loop LOOP --quiet
loop for (j=1; j<=n; j+=cols) --quiet
# printf "copy cols %d to %d (n=%d)\n", j, j+cols-1, n
B[,j:j+cols-1] = A[,j:j+cols-1]
endloop
endloop
printf "%3d columns per chunk: %.4fs\n", cols, $stopwatch
# printf "max(abs(A-B)) = %g\n", max(abs(A-B))
endloop
</hansl>
Some evidence remains that smaller chunks are better, but nothing
like as striking as before.
Allin