On Mon, 14 Oct 2019, Artur Tarassow wrote:
Am 14.10.19 um 20:02 schrieb Allin Cottrell:
> On Sun, 13 Oct 2019, Marcin Błażejowski wrote:
>
>> On 13.10.2019 02:15, Allin Cottrell wrote:
>>> Thanks, Jack. So, given the options I posited, our machines agree on a
>>> best chunk size of (25 * 5000 * 8) bytes (=~ 1 MB) for use with
>>> memcpy. (5000 being the number of rows in the matrix to be copied, and
>>> 8 the number of bytes to represent a double-precision floating point
>>> value.)
>>>
>>> Now, to optimize libgretl's copying of contiguous data, we just have
>>> to figure out how that relates to the size of L1 or L2 cache, or
>>> whatever is truly the relevant hardware parameter here!
>>
>> But Allin, isn't it something that could/should(?) be done by AVX
>> extensions? But I have to admit that I no idea how to use such low-level
>> optimisation for scripting language.
>
> The AVX extensions are facilities that a compiler may or may not end
> up using as part of its optimization efforts. In the matrix-copying
> context the relevant question would be how the compiler and C-library
> jointly interpret and implement calls to memcpy (e.g. does memcpy
> translate to some AVX-optimized variant?).
>
> From the timings we're looking at (thanks for sending yours, Marcin)
> it's clear that whatever optimizations are employed by gcc and the
> relevant C-libraries, they are not automatically dividing a big call
> to memcpy into smaller chunks whenever that would speed up the total
> copy. Hence the idea that we may want to try some dividing up at the
> libgretl level. I suspect that things are slowing when we try to copy
> more data than fits into L2 cache in one go.
By the way, these are results for an Intel(R) Core(TM) i5-6600 CPU 3.30GHz,
L1=265kb, L2=1MB, L3=6MB
<output: ROW=5000, COL=500, LOOP=600>
1 columns per chunk: 1.5847s
2 columns per chunk: 0.7398s
5 columns per chunk: 0.2226s
10 columns per chunk: 0.0837s
20 columns per chunk: 0.0663s
21 columns per chunk: 0.0700s
22 columns per chunk: 0.0356s
23 columns per chunk: 0.0368s
24 columns per chunk: 0.0389s
25 columns per chunk: 0.0406s
35 columns per chunk: 0.0617s
45 columns per chunk: 0.0880s
50 columns per chunk: 0.1195s
100 columns per chunk: 0.3994s
125 columns per chunk: 0.3895s
500 columns per chunk: 1.7884s
</>
<output: ROW=10000, COL=500, LOOP=600>
1 columns per chunk: 3.2237s
2 columns per chunk: 1.6661s
5 columns per chunk: 0.6233s
10 columns per chunk: 0.2529s
20 columns per chunk: 0.1990s
21 columns per chunk: 0.2190s
22 columns per chunk: 0.0884s
23 columns per chunk: 0.0938s
24 columns per chunk: 0.1050s
25 columns per chunk: 0.1095s
35 columns per chunk: 0.2152s
45 columns per chunk: 0.3409s
50 columns per chunk: 0.3980s
100 columns per chunk: 0.6861s
125 columns per chunk: 0.8611s
500 columns per chunk: 11.8921s
</>
So all the data seems to (pretty much) agree: the point at which
reduction in copy-time turns into increase, as we crank up the
number of columns to copy at once, is in the neighbourhood of the L2
cache size, which is typically 1 MB (2^20) these days.
We'll see what we can do with this information.
Allin