[Gretl-devel] Re: Speed of matrix "block" operations

Monday, 14 October 2019

Am 14.10.19 um 20:02 schrieb Allin Cottrell:
...
 On Sun, 13 Oct 2019, Marcin Błażejowski wrote:

> On 13.10.2019 02:15, Allin Cottrell wrote:
>> Thanks, Jack. So, given the options I posited, our machines agree on a
>> best chunk size of (25 * 5000 * 8) bytes (=~ 1 MB) for use with
>> memcpy. (5000 being the number of rows in the matrix to be copied, and
>> 8 the number of bytes to represent a double-precision floating point
>> value.)
>>
>> Now, to optimize libgretl's copying of contiguous data, we just have
>> to figure out how that relates to the size of L1 or L2 cache, or
>> whatever is truly the relevant hardware parameter here!
>
> But Allin, isn't it something that could/should(?) be done by AVX
> extensions? But I have to admit that I no idea how to use such low-level
> optimisation for scripting language.

 The AVX extensions are facilities that a compiler may or may not end
 up using as part of its optimization efforts. In the matrix-copying
 context the relevant question would be how the compiler and C-library
 jointly interpret and implement calls to memcpy (e.g. does memcpy
 translate to some AVX-optimized variant?).

  From the timings we're looking at (thanks for sending yours, Marcin)
 it's clear that whatever optimizations are employed by gcc and the
 relevant C-libraries, they are not automatically dividing a big call
 to memcpy into smaller chunks whenever that would speed up the total
 copy. Hence the idea that we may want to try some dividing up at the
 libgretl level. I suspect that things are slowing when we try to copy
 more data than fits into L2 cache in one go. 
By the way, these are results for an Intel(R) Core(TM) i5-6600 CPU 
3.30GHz, L1=265kb, L2=1MB, L3=6MB

<output: ROW=5000, COL=500, LOOP=600>
   1 columns per chunk: 1.5847s
   2 columns per chunk: 0.7398s
   5 columns per chunk: 0.2226s
  10 columns per chunk: 0.0837s
  20 columns per chunk: 0.0663s
  21 columns per chunk: 0.0700s
  22 columns per chunk: 0.0356s
  23 columns per chunk: 0.0368s
  24 columns per chunk: 0.0389s
  25 columns per chunk: 0.0406s
  35 columns per chunk: 0.0617s
  45 columns per chunk: 0.0880s
  50 columns per chunk: 0.1195s
100 columns per chunk: 0.3994s
125 columns per chunk: 0.3895s
500 columns per chunk: 1.7884s
</>

<output: ROW=10000, COL=500, LOOP=600>
   1 columns per chunk: 3.2237s
   2 columns per chunk: 1.6661s
   5 columns per chunk: 0.6233s
  10 columns per chunk: 0.2529s
  20 columns per chunk: 0.1990s
  21 columns per chunk: 0.2190s
  22 columns per chunk: 0.0884s
  23 columns per chunk: 0.0938s
  24 columns per chunk: 0.1050s
  25 columns per chunk: 0.1095s
  35 columns per chunk: 0.2152s
  45 columns per chunk: 0.3409s
  50 columns per chunk: 0.3980s
100 columns per chunk: 0.6861s
125 columns per chunk: 0.8611s
500 columns per chunk: 11.8921s
</>

Artur

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

[Gretl-devel] Re: Speed of matrix "block" operations