[Gretl-users] Performance issues with sums and covariances

Tuesday, 11 March 2014

Dear connoisseurs of gretl,

I am currently stuck with an issue that can be described as “premature
optimisation” (the root of all evil, as we know).

I am trying to evaluate the distribution of Dickey---Fuller's $T(\hat
\alpha-1)$ statistic in the model $x_t = \alpha x_{t-1} + \varepsilon$ with
the unit root $\alpha=1$ to the highest precision. My goal is to run
several million regressions and save the column of $\alpha$'s in a separate
file which is to be processed in other software (the same is about to be
done later for Durbin---Watson, other Dickey---Fuller's distributions
etc.). Since my goal is >10m iterations, every second is crucial for me. At
first, I wrote the following code (10k iterations without the loss of
ostensiveness):

set stopwatch
nulldata 10000
scalar iterations=10000
loop for (i=0; i<iterations; i+=1) --progressive --quiet
    smpl --full
    series eps=normal()
    series x=0
    series x=x(-1)+eps
    series xlag=x(-1)
    smpl 3 10000
    ols x xlag
    scalar ahat=$coeff(xlag)
    scalar DFT=$T*(ahat-1)
    store df.csv DFT --no-header
endloop
printf "Time taken: %f seconds\n", $stopwatch

Note: the sample is restricted since x[1]=0 and xlag[2]=0, we do not need
those meaningless numbers during the estimation.
The average time for my PC was 27.6 seconds. However, I thought that the
*ols* command invoked all sorts of sideway calculations (residuals, t
ratios, R squared, criteria etc.); thus I decided to bypass possible
unnecessary tricks, obtaining the $\hat\alpha$ by hand:

set stopwatch
nulldata 10000
scalar iterations=10000
loop for (i=0; i<iterations; i+=1) --progressive --quiet
    smpl --full
    series eps=normal()
    series x=0
    series x=x(-1)+eps
    series xlag=x(-1)
    smpl 3 10000
    scalar DFT=9998*(cov(x,xlag)/var(xlag)-1)
    store df.csv DFT --no-header
endloop
printf "Time taken: %f seconds\n", $stopwatch

Indeed there was an improvement in time (now it is 25.2 on average).
Nevertheless, the happiness was alloyed by the seeming impurity of the
result since by definition, the true coefficient in the ratio of two sums,
not just covariance divided by variance (due to finite samples and possible
non-zero mean). So I took the liberty of evaluating the sums manually:

set stopwatch
nulldata 10000
scalar iterations=10000
loop for (i=0; i<iterations; i+=1) --progressive --quiet
    smpl --full
    series eps=normal()
    series x=0
    series x=x(-1)+eps
    series xlag=x(-1)
    series xxlag=x*xlag
    series xlag_sq=xlag^2
    smpl 3 10000
    scalar DFT=9998*(sum(xxlag)/sum(xlag_sq)-1)
    store df.csv DFT --no-header
endloop
printf "Time taken: %f seconds\n", $stopwatch

Much to my regret, the result was very disappointing, 33.0 seconds on
average. Why does the optimisation turn out to be harmful in this case?
What else can be done in order to reduce running time without the loss of
precision or consistency?

In addition, could you make a slight correction to the manual, please? The
commands *cov* and *corr* are expecting a comma in their input, but it is
not mentioned in the command reference and the awakening comes through the
error message “Expected ',' but found ...” in the output. Thank you in
advance!

Yours faithfully,
Andreï V. Kostyrka
Department of Mathematical Economics and Econometrics
Higher School of Economics
Moscow, Russia

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

[Gretl-users] Performance issues with sums and covariances