On Sun, 13 Oct 2019, Marcin Błażejowski wrote:
On 13.10.2019 02:15, Allin Cottrell wrote:
> Thanks, Jack. So, given the options I posited, our machines agree on a
> best chunk size of (25 * 5000 * 8) bytes (=~ 1 MB) for use with
> memcpy. (5000 being the number of rows in the matrix to be copied, and
> 8 the number of bytes to represent a double-precision floating point
> Now, to optimize libgretl's copying of contiguous data, we just have
> to figure out how that relates to the size of L1 or L2 cache, or
> whatever is truly the relevant hardware parameter here!
But Allin, isn't it something that could/should(?) be done by AVX
extensions? But I have to admit that I no idea how to use such low-level
optimisation for scripting language.
The AVX extensions are facilities that a compiler may or may not end
up using as part of its optimization efforts. In the matrix-copying
context the relevant question would be how the compiler and C-library
jointly interpret and implement calls to memcpy (e.g. does memcpy
translate to some AVX-optimized variant?).
From the timings we're looking at (thanks for sending yours, Marcin)
it's clear that whatever optimizations are employed by gcc and the
relevant C-libraries, they are not automatically dividing a big call
to memcpy into smaller chunks whenever that would speed up the total
copy. Hence the idea that we may want to try some dividing up at the
libgretl level. I suspect that things are slowing when we try to copy
more data than fits into L2 cache in one go.