Some news regarding gretl's "join" command (importation of data with
lots of options). These points are in the current documentation for
"join" in the User's Guide, but I thought it would be worth
explicitly drawing them to people's attention.
1) I've mentioned this before but only in passing: besides "CSV"
(delimited text) files you can now join from gretl-native gdt or
gdtb (binary) files.
2) More recent: you can now pull multiple series from the source
file in one command.
I'll expand on the second point. When we first wrote "join" we were
wrestling with a lot of complexity (key-matching, filtering,
aggregation) and we simplified matters by stipulating that only a
single series could be operated on at a time. Now that the join code
has stabilized, we've found it feasible to support "batch"
importation of series. This is subject to two limitations:
1) When importing multiple series, the --data option (which permits
renaming of a single series on import) is not available. You have to
accept the names of series as they appear in the source data file
(or as "fixed up" by gretl, if need be).
2) You only get one set of key-matching, filtering and aggregation
options; these options are applied uniformly to all series
specified in a single command. So if you want to import several
series but with different keys, filters or aggregation methods,
you still need separate instances of the "join" command.
How do you ask for multiple series? You just replace the second
(series-name) argument to "join" with either (a) several series
names, separated by spaces, or (b) the name of an array-of-strings
variable that holds the names of the series you want.
My motivation for setting this up is that this semester I've been
helping some students construct datasets from the PUMS (Public Use
Microdata Sample) made available by the US Census Bureau. These are
BIG files (e.g. the person datafile for California alone is >
300MB). So if you want data from all 50 US states plus DC, and
especially if you want household-level data too, we're talking quite
a major data processing exercise. I've found that with multiple
imports in "join" it doesn't take much longer to import 6 or 7
series at a time than it does to import a single series, meaning
that we get a very noticeable speed-up of the process.
Allin
Show replies by date