[reply probably more suitable for the devel list, thus switching]
On 08/09/2012 10:43 PM, Allin Cottrell wrote:
Last month in Ancona, Jack Lucchetti, Claudia Pigini and I spent an
intensive week cooking up a new command for gretl. It's called
"join", its job is to pull together data from two or more sources
with the help of keys and/or filters, and -- casting modesty aside
-- we think it's a killer! Stata has a deservedly good reputation
for this sort of thing but we think that in some respects "join" may
put gretl ahead.
It's in CVS and snapshots and we invite you to try it out and give
us your comments. Full documentation with examples of use is
available at:
http://ricardo.ecn.wfu.edu/~cottrell/tmp/join.pdf (US letter)
http://ricardo.ecn.wfu.edu/~cottrell/tmp/join-a4.pdf (A4)
Yes this looks like a "great leap forward"! Allow me some more or less
ad-hoc reactions while browsing the documentation:
* Terminology: in relational database theory, there are "inner joins"
and "outer joins" AFAIR. In your docs, "inner" and "outer"
seem to have
a different meaning. Maybe this can be separated somehow. ("Incumbent"
and "incoming" perhaps? Or simply "first" and "second"?)
This would
probably also affect the naming of --ikey and --okey.
* To push this argument a little further, since gretl's join seems to
work on single series only (which is fine!), the whole thing seems
rather different from a database/SQL join, and the name could therefore
be misleading. Maybe call it "importseries" or somesuch instead?
* Datafile format: you note the connection to large datasets. Yet so far
only text format files are supported. At the risk of stating the obvious
("breaking into open doors" as we say in German), for large datasets
some binary format is probably wanted -- or do you include gzipped text
files when saying text files?
* You don't seem to mention the decimal separator issue, what is allowed
in this context?
* I find the '--data' option naming unintuitive or too generic; why not
call it '--name' if it's about renaming?
* string filtering: Maybe there is a case to provide some option for
surrounding whitespace handling? What I mean is, in a CSV file the name
of a variable "nkids" could also appear as " nkids".
And a general inquiry about gretl version numbers: Is this the final
feature step towards 2.0 and the 1.9.x series in this sense is still the
beta for 2.0? If not, adding such a feature would seem to warrant a
version 1.10.0 instead of just increasing the 3rd-level digit.
There's also a follow-up on the way, namely an account of how "join"
can be used to handle "real-time" data (time series data indexed not
only by the data to which the data refer, but also by the date on
which they were produced/revised). Expect something on this in the
next few weeks.
Very cool -- can't wait!
cheers,
sven