On Wed, 27 Apr 2016, Mikael Postila wrote:
We've been experiencing a problem where exactly same data yields
different regression results depending on A) which computer is used
and B) when regression is being run on same computer.
Mikael kindly sent me a subset of his data and a sample model
specification, so I've now been able to look into this. Since the
question may be of some general interest I'm including the list in my
reply.
Mikael is estimating a linear model via OLS. In the sample he gave me
there are 5021 observations and 46 regressors, and among the
regressors one, "dLAT1km", has a coefficient which was found to differ
at the 6th digit on different runs of the script (either on different
machines or on different dates): 2.68356 versus 2.68357.
By using gretl's "mpols" command (multiple-precision OLS) I determined
that the correct coefficient for this variable is, to 15 significant
digits, 2.68356548257653. So it's on the rounding cusp between the two
values that Michael observed, although on standard "round 5 away from
zero" we'd say that 2.68357 is the "right" 6-digit result.
We usually expect that on linear problems gretl's results will be
correct (and stable) to at least the 6 printed digits (in fact,
several more than that in most cases). So what's going on here?
It turns out the X'X matrix for this regression is very ill
conditioned, and in a particulay way. Invoking the "vif" command after
OLS we find that the Variance Inflation Factor for dLAT1km is
58985220 ("Values > 10.0 may indicate a collinearity problem"),
several orders of magnitude greater than for any other regressor
(there are 6 others with VIFs greater than 100). Running an OLS
regression of dLAT1km on all the other regressors, gretl gets an
R-squared of 0.999999983.
So Mikael's regression is close to the limits of gretl's default OLS
procedure (using Cholesky decomposition) with regard to collinearity.
Close enough that the outcome is likely to depend on the precise
details of the physical computation (which intermediate results get
stored to 80-bit registers, using how much "excess precision", and
whether or not they get "spilled" to 64-bit memory locations).
I'm therefore not surprised that you could get differences at the 6th
digit across different machines (or across different compilers, or
different compiler settings).
It's more surprising that you could get differences on the same
machine (with the same OS and the same pre-built gretl version), on
different days. However, I've read that on MS Windows certain
software, such as DirectX, may reset the precision of floating-point
units. So it may be that -- when the results are so terribly sensitive
to precision -- what gets printed by gretl depends on what else you've
been running lately.
Is this a real problem? I'd say No. As a related experiment I tried
probing the effect of a tiny change to the data. One of the regressors
is "size", which I believe measures the size of properties in square
meters and which seems to have a minimum increment of 0.5. I tried
changing the size of just one of the 5000+ properties by 0.5 and
rerunning Mikael's model. Result: most of the coefficients changed at
the 6th digit or higher (several changed at the 4th or 5th digit).
Moral: unless the data are considered _perfectly_ accurate you really
don't want to be paying attention to the 6th digit of regression
coefficients. However, you have "mpols" if you think you need it.
Allin Cottrell