Hi everyone,

Thanks for the replies. Sorry for the late response.

Jack: Yes, I'm tracking the number of observations. I'm actually not using the gretl GUI at all, I'm directly using the libraries. I pretty much tested how long it takes to execute:

    MahalDist *distance = get_mahal_distances(gretlParameters, gretlData, OPT_NONE, NULL, &error);

where gretlData is a DATASET with n number of observations (so this is the amount that I increased from 1 - 250). FYI: My system monitor indicates that the statement above only executes on a single thread.

I ran your script in a Linux Mint virtual machine (4 cores, 4GB RAM) and got different results compared to Helio. I ran the script a couple of times (see attachments) and although different, they show similar characteristics. I'm not sure how Helio got the first result.
Looking at the script output, I don't think it's the best way to benchmarking the execution time in this case.

I've used 8 different datasets with 30-40 million samples each. Every single window over every single dataset gave the exact same time jump between 199 and 200 observations.
What I've done is start a timer just before calling the get_mahal_distances function and then stopping the timer right after. I've done this a total of about 300 million times and the graphs I sent in my original post is the average over all these tests - so it should be a quite accurate estimation.

I'm using these results for my thesis, and I somehow have to explain why this happens (even if it's just a performance improvement like Allin said). So if anyone else knows why, please let me know.

In any case, thanks for the help

Chris


On 2014/04/15 02:01 PM, Allin Cottrell wrote:
On Tue, 15 Apr 2014, Riccardo (Jack) Lucchetti wrote:

On Tue, 15 Apr 2014, Allin Cottrell wrote:

On Tue, 15 Apr 2014, GOO Creations wrote:

I'm benchmarking the Mahalanobis distance to see how the accuracy and
execution time changes with an increasing sample size. As far as I
understand the algorithm the execution time should grow linearly as the
sample size increases. The weird thing is that the time grows linearly up
to (and including) 199 samples, but then suddenly has a drop at 200
samples. I've attached a graph to illustrate this.
What implementation of lapack/blas are you using?

The most demanding task in computing Mahalanobis distance is the inversion
of the covariance matrix of the selected series, which is performed via
the lapack Cholesky functions dpotrf and dpotri. Depending on the
implementation, these functions may switch algorithm based on the size of
the input data (e.g. invoking parallelization when a certain threshold
size is exceeded).
That's what I had thought too, initially. However, the size of the covariance 
matrix doesn't depend on the number of observations, which is the variable 
our friend is tracking (unless I misunderstood his message).
Duh! You're right. Then I can't explain this either.

Allin
_______________________________________________
Gretl-users mailing list
Gretl-users@lists.wfu.edu
http://lists.wfu.edu/mailman/listinfo/gretl-users