Performance issues with sums and covariances
by Андрей Викторович | Andreï
Dear connoisseurs of gretl,
I am currently stuck with an issue that can be described as “premature
optimisation” (the root of all evil, as we know).
I am trying to evaluate the distribution of Dickey---Fuller's $T(\hat
\alpha-1)$ statistic in the model $x_t = \alpha x_{t-1} + \varepsilon$ with
the unit root $\alpha=1$ to the highest precision. My goal is to run
several million regressions and save the column of $\alpha$'s in a separate
file which is to be processed in other software (the same is about to be
done later for Durbin---Watson, other Dickey---Fuller's distributions
etc.). Since my goal is >10m iterations, every second is crucial for me. At
first, I wrote the following code (10k iterations without the loss of
ostensiveness):
set stopwatch
nulldata 10000
scalar iterations=10000
loop for (i=0; i<iterations; i+=1) --progressive --quiet
smpl --full
series eps=normal()
series x=0
series x=x(-1)+eps
series xlag=x(-1)
smpl 3 10000
ols x xlag
scalar ahat=$coeff(xlag)
scalar DFT=$T*(ahat-1)
store df.csv DFT --no-header
endloop
printf "Time taken: %f seconds\n", $stopwatch
Note: the sample is restricted since x[1]=0 and xlag[2]=0, we do not need
those meaningless numbers during the estimation.
The average time for my PC was 27.6 seconds. However, I thought that the
*ols* command invoked all sorts of sideway calculations (residuals, t
ratios, R squared, criteria etc.); thus I decided to bypass possible
unnecessary tricks, obtaining the $\hat\alpha$ by hand:
set stopwatch
nulldata 10000
scalar iterations=10000
loop for (i=0; i<iterations; i+=1) --progressive --quiet
smpl --full
series eps=normal()
series x=0
series x=x(-1)+eps
series xlag=x(-1)
smpl 3 10000
scalar DFT=9998*(cov(x,xlag)/var(xlag)-1)
store df.csv DFT --no-header
endloop
printf "Time taken: %f seconds\n", $stopwatch
Indeed there was an improvement in time (now it is 25.2 on average).
Nevertheless, the happiness was alloyed by the seeming impurity of the
result since by definition, the true coefficient in the ratio of two sums,
not just covariance divided by variance (due to finite samples and possible
non-zero mean). So I took the liberty of evaluating the sums manually:
set stopwatch
nulldata 10000
scalar iterations=10000
loop for (i=0; i<iterations; i+=1) --progressive --quiet
smpl --full
series eps=normal()
series x=0
series x=x(-1)+eps
series xlag=x(-1)
series xxlag=x*xlag
series xlag_sq=xlag^2
smpl 3 10000
scalar DFT=9998*(sum(xxlag)/sum(xlag_sq)-1)
store df.csv DFT --no-header
endloop
printf "Time taken: %f seconds\n", $stopwatch
Much to my regret, the result was very disappointing, 33.0 seconds on
average. Why does the optimisation turn out to be harmful in this case?
What else can be done in order to reduce running time without the loss of
precision or consistency?
In addition, could you make a slight correction to the manual, please? The
commands *cov* and *corr* are expecting a comma in their input, but it is
not mentioned in the command reference and the awakening comes through the
error message “Expected ',' but found ...” in the output. Thank you in
advance!
Yours faithfully,
Andreï V. Kostyrka
Department of Mathematical Economics and Econometrics
Higher School of Economics
Moscow, Russia