Am 14.10.19 um 20:02 schrieb Allin Cottrell:
On Sun, 13 Oct 2019, Marcin Błażejowski wrote:
> On 13.10.2019 02:15, Allin Cottrell wrote:
>> Thanks, Jack. So, given the options I posited, our machines agree on a
>> best chunk size of (25 * 5000 * 8) bytes (=~ 1 MB) for use with
>> memcpy. (5000 being the number of rows in the matrix to be copied, and
>> 8 the number of bytes to represent a double-precision floating point
>> value.)
>>
>> Now, to optimize libgretl's copying of contiguous data, we just have
>> to figure out how that relates to the size of L1 or L2 cache, or
>> whatever is truly the relevant hardware parameter here!
>
> But Allin, isn't it something that could/should(?) be done by AVX
> extensions? But I have to admit that I no idea how to use such low-level
> optimisation for scripting language.
The AVX extensions are facilities that a compiler may or may not end
up using as part of its optimization efforts. In the matrix-copying
context the relevant question would be how the compiler and C-library
jointly interpret and implement calls to memcpy (e.g. does memcpy
translate to some AVX-optimized variant?).
From the timings we're looking at (thanks for sending yours, Marcin)
it's clear that whatever optimizations are employed by gcc and the
relevant C-libraries, they are not automatically dividing a big call
to memcpy into smaller chunks whenever that would speed up the total
copy. Hence the idea that we may want to try some dividing up at the
libgretl level. I suspect that things are slowing when we try to copy
more data than fits into L2 cache in one go.
By the way, these are results for an Intel(R) Core(TM) i5-6600 CPU
3.30GHz, L1=265kb, L2=1MB, L3=6MB
<output: ROW=5000, COL=500, LOOP=600>
1 columns per chunk: 1.5847s
2 columns per chunk: 0.7398s
5 columns per chunk: 0.2226s
10 columns per chunk: 0.0837s
20 columns per chunk: 0.0663s
21 columns per chunk: 0.0700s
22 columns per chunk: 0.0356s
23 columns per chunk: 0.0368s
24 columns per chunk: 0.0389s
25 columns per chunk: 0.0406s
35 columns per chunk: 0.0617s
45 columns per chunk: 0.0880s
50 columns per chunk: 0.1195s
100 columns per chunk: 0.3994s
125 columns per chunk: 0.3895s
500 columns per chunk: 1.7884s
</>
<output: ROW=10000, COL=500, LOOP=600>
1 columns per chunk: 3.2237s
2 columns per chunk: 1.6661s
5 columns per chunk: 0.6233s
10 columns per chunk: 0.2529s
20 columns per chunk: 0.1990s
21 columns per chunk: 0.2190s
22 columns per chunk: 0.0884s
23 columns per chunk: 0.0938s
24 columns per chunk: 0.1050s
25 columns per chunk: 0.1095s
35 columns per chunk: 0.2152s
45 columns per chunk: 0.3409s
50 columns per chunk: 0.3980s
100 columns per chunk: 0.6861s
125 columns per chunk: 0.8611s
500 columns per chunk: 11.8921s
</>
Artur