On Sun, 14 Dec 2014, Sven Schreiber wrote:
there was an open-ended thread initiated by Paulo Grahl
http://lists.wfu.edu/pipermail/gretl-users/2013-December/009475.html)
about gretl's 'pca' command. I checked again and I think there still
--with a very recent snapshot-- is a bug, although slightly different
from Paulo's experience. Here's an example script:
<hansl>
open denmark
list vars = IBO IDE
# compare 'pca' and 'princomp()' in the full sample
matrix P1 = princomp({vars}, 1)
pca vars --save=1 # turns out they coincide; good
# now compare them in the reduced sample
smpl 1980:1 1985:1
matrix P2 = princomp({vars}, 1)
pca vars --save=1 # matrix and series differ; bad
# check if the PCs are different in the overlapping range
if sum(PC1 - PC11) > 0.01 # PC naming is fragile...
print "ok"
else
print "PCs are the same although they should differ" # I get this
endif
smpl --full
</hansl>
Summary: The 'princomp()' function seems to work fine, but 'pca'
apparently uses the full sample for calculating the pca, even if a
reduced sample is specified.
OK, I think I've finally worked out what the issue is here. It's
complicated by the fact that the eigenvectors of a 2 x 2 correlation
matric are invariant with respect to the (single) correlation
coefficient. For a while I thought we were in error in reporting the
same eigenvectors for the full dataset and the sub-sample, but
that's expected. The real problem was that when we standardized the
series for computing the saved PC series we used the means and
standard deviations for the full dataset, regardless of the current
sample. That's now fixed in CVS.
- Accessor for the loadings, as suggested by Henrique
(
http://lists.wfu.edu/pipermail/gretl-users/2012-March/007346.html) and
in terms of the princomp() function by myself. Allin answered that it's
easy to get them as the eigenvectors of the correlation matrix. This is
of course correct, but first it's a convenience issue, and secondly if
you perhaps want to do some simulations it seems like an avoidable
inefficiency to compute the eigenvectors twice (first implicitly in
princomp, and then explicitly by hand).
Would be nice but IMO not a high priority. The efficiency point
doesn't strike me as very important unless perhaps one were
computing PCs for a huge array of series.
- Automatic printing of the workfile variables: When using
'pca' to save
some PCs to the workfile, gretl automatically prints all the variables
in the workfile. IMHO this contaminates the script output for no good
reason (I currently have thousands of variables in there, and it really
is a long list in the output...). So could this be switched off?
You can switch it off with "set messages off". However, in CVS I've
now made it quieter even without setting message off.
Allin