msortby() stumbles over NA / nan
by Sven Schreiber
Hi,
it seems that msortby() only sorts within "blocks" surrounded by
occurrences of NA (showing up as nan in matrices). The rest of the
sorting-like functions seem to work ok. Example:
<hansl>
matrix in = {2; 1; NA; 0; -5}
print in
matrix check = msortby(in, 1) # not ok
print check
matrix check = sort(in) # ok
print check
matrix check = dsort(in) # ok
print check
matrix check = values(in) # ok
print check
matrix check = uniq(in) # ok
print check
</hansl>
Thanks,
sven
10 years
pca bug and issues reloaded
by Sven Schreiber
Hi,
there was an open-ended thread initiated by Paulo Grahl
http://lists.wfu.edu/pipermail/gretl-users/2013-December/009475.html)
about gretl's 'pca' command. I checked again and I think there still
--with a very recent snapshot-- is a bug, although slightly different
from Paulo's experience. Here's an example script:
<hansl>
open denmark
list vars = IBO IDE
# compare 'pca' and 'princomp()' in the full sample
matrix P1 = princomp({vars}, 1)
pca vars --save=1 # turns out they coincide; good
# now compare them in the reduced sample
smpl 1980:1 1985:1
matrix P2 = princomp({vars}, 1)
pca vars --save=1 # matrix and series differ; bad
# check if the PCs are different in the overlapping range
if sum(PC1 - PC11) > 0.01 # PC naming is fragile...
print "ok"
else
print "PCs are the same although they should differ" # I get this
endif
smpl --full
</hansl>
Summary: The 'princomp()' function seems to work fine, but 'pca'
apparently uses the full sample for calculating the pca, even if a
reduced sample is specified. What's different from Paulo's report is
that the PCs are saved only over the reduced sample range (but the
values are still wrong).
I also would like to (re-) raise some other issues with pca:
- Accessor for the loadings, as suggested by Henrique
(http://lists.wfu.edu/pipermail/gretl-users/2012-March/007346.html) and
in terms of the princomp() function by myself. Allin answered that it's
easy to get them as the eigenvectors of the correlation matrix. This is
of course correct, but first it's a convenience issue, and secondly if
you perhaps want to do some simulations it seems like an avoidable
inefficiency to compute the eigenvectors twice (first implicitly in
princomp, and then explicitly by hand).
- Automatic printing of the workfile variables: When using 'pca' to save
some PCs to the workfile, gretl automatically prints all the variables
in the workfile. IMHO this contaminates the script output for no good
reason (I currently have thousands of variables in there, and it really
is a long list in the output...). So could this be switched off?
Thanks,
sven
10 years
"join" news
by Allin Cottrell
Some news regarding gretl's "join" command (importation of data with
lots of options). These points are in the current documentation for
"join" in the User's Guide, but I thought it would be worth
explicitly drawing them to people's attention.
1) I've mentioned this before but only in passing: besides "CSV"
(delimited text) files you can now join from gretl-native gdt or
gdtb (binary) files.
2) More recent: you can now pull multiple series from the source
file in one command.
I'll expand on the second point. When we first wrote "join" we were
wrestling with a lot of complexity (key-matching, filtering,
aggregation) and we simplified matters by stipulating that only a
single series could be operated on at a time. Now that the join code
has stabilized, we've found it feasible to support "batch"
importation of series. This is subject to two limitations:
1) When importing multiple series, the --data option (which permits
renaming of a single series on import) is not available. You have to
accept the names of series as they appear in the source data file
(or as "fixed up" by gretl, if need be).
2) You only get one set of key-matching, filtering and aggregation
options; these options are applied uniformly to all series
specified in a single command. So if you want to import several
series but with different keys, filters or aggregation methods,
you still need separate instances of the "join" command.
How do you ask for multiple series? You just replace the second
(series-name) argument to "join" with either (a) several series
names, separated by spaces, or (b) the name of an array-of-strings
variable that holds the names of the series you want.
My motivation for setting this up is that this semester I've been
helping some students construct datasets from the PUMS (Public Use
Microdata Sample) made available by the US Census Bureau. These are
BIG files (e.g. the person datafile for California alone is >
300MB). So if you want data from all 50 US states plus DC, and
especially if you want household-level data too, we're talking quite
a major data processing exercise. I've found that with multiple
imports in "join" it doesn't take much longer to import 6 or 7
series at a time than it does to import a single series, meaning
that we get a very noticeable speed-up of the process.
Allin
10 years
silent failure of sprintf
by Sven Schreiber
Hi,
I stumbled again over something for which my own mistake was the
ultimate cause, but still I think gretl should have complained:
<hansl>
scalar r = 5
sprintf r "%d", 10
print r # prints number 5
print "@r" prints literal @r
</hansl>
So I guess it's expected that gretl doesn't want to change the type of r
from scalar to string -- BTW, I don't mind this, but is this static
typing actually an intended property of hansl? --, but then this should
produce an error or a warning at least, no?
thanks,
sven
10 years
syntax inconsistency for coeff vector
by Sven Schreiber
Hi,
this is nothing new, but I stumbled over it again and now I can use it
as an excuse why I never manage to remember the correct syntax:
When we access the coeff vector after estimation, we have '$coeff'; when
we give the variable index we use square brackets ($coeff[2]), but when
we give the name then it's round brackets ($coeff(myvar)).
So far so good. But when we formulate restrictions, it's now 'b' for the
coeff vector, and we _always_ have to use square brackets apparently
(b[2] as well as b[myvar]). So there are two pretty obvious questions
from the point of view of the user I think:
1) Why have a separate symbol for the coeff vector in restrict blocks at
all? (Backward compatibility issues aside for now.)
2) The bracket situation seems arbitrary and confuses me every time. Can
it be changed? (in the medium term I mean, no need to rush.)
Thanks,
sven
10 years
slight hiccup with Umlauts
by Sven Schreiber
Hi,
I don't even know if it's supposed to work, but if I work with German
special characters in matrix row names, the output is slightly misarranged.
<hansl>
string ex = "ÄÜß hi"
matrix in = {1, 3; 5, 6}
err = rownames(in, ex)
print in
</hansl>
Apart from that gretl has suddenly started acting very sluggishly again,
I'll investigate if it has to do with the names of matrix columns or rows.
thanks,
sven
10 years
is there an inarray function?
by Logan Kelly
Hello,
I need a function the tests if variable is in an array (sorry for the poor wording of this sentence). What I mean by this is a function that compares each element of an array to a variable and returns 0 if no elements of the array are equal and the position in the array of the first instance of equality. So here are my question:
1. Dose such a function exist? (I haven't found one, but I thought I should ask)
2. If not, coding it up is no problem, but I need a way to check the data type of variable. Is there such a command?
3. Is there a way, other than using a bundle, to pass a variable of unknown data type to a function?
Thanks,
Logan
10 years
SIGSEGV
by Marcin Błażejowski
Hi,
I get the following gdb error (1.9.92 and current CVS):
------------
Program received signal SIGSEGV, Segmentation fault.
__strcmp_sse2_unaligned () at
../sysdeps/x86_64/multiarch/strcmp-sse2-unaligned.S:29
29 ../sysdeps/x86_64/multiarch/strcmp-sse2-unaligned.S: No such
file or directory.
------------
The problem occurs in one of my old packages and I don't know where
since I don't use only strlen() function.
Best Regards,
Marcin
--
Marcin Błażejowski
GG: 203127
10 years