On Mon, 13 Aug 2012, Sven Schreiber wrote:
[reply probably more suitable for the devel list, thus switching]
On 08/09/2012 10:43 PM, Allin Cottrell wrote:
>
> Last month in Ancona, Jack Lucchetti, Claudia Pigini and I spent an
> intensive week cooking up a new command for gretl. It's called
> "join", its job is to pull together data from two or more sources
> with the help of keys and/or filters, and -- casting modesty aside
> -- we think it's a killer! Stata has a deservedly good reputation
> for this sort of thing but we think that in some respects "join" may
> put gretl ahead.
>
> It's in CVS and snapshots and we invite you to try it out and give
> us your comments. Full documentation with examples of use is
> available at:
>
>
http://ricardo.ecn.wfu.edu/~cottrell/tmp/join.pdf (US letter)
>
http://ricardo.ecn.wfu.edu/~cottrell/tmp/join-a4.pdf (A4)
Yes this looks like a "great leap forward"! Allow me some more or less
ad-hoc reactions while browsing the documentation:
* Terminology: in relational database theory, there are "inner joins"
and "outer joins" AFAIR. In your docs, "inner" and "outer"
seem to have
a different meaning. Maybe this can be separated somehow. ("Incumbent"
and "incoming" perhaps? Or simply "first" and "second"?)
This would
probably also affect the naming of --ikey and --okey.
* To push this argument a little further, since gretl's join seems to
work on single series only (which is fine!), the whole thing seems
rather different from a database/SQL join, and the name could therefore
be misleading. Maybe call it "importseries" or somesuch instead?
Maybe. I guess the force of this comment depends on how wedded
are potential users of this command to database/SQL
terminology. I'll await Jack's reaction when he gets back
online.
* Datafile format: you note the connection to large datasets. Yet so
far
only text format files are supported. At the risk of stating the obvious
("breaking into open doors" as we say in German), for large datasets
some binary format is probably wanted -- or do you include gzipped text
files when saying text files?
We could read gzipped CSV without too much difficulty, though
we don't at present. We could also apply the "join" apparatus
to native gretl binary databases. However, our focus so far
has been on processing big "third party" data sources, and
these mostly seem to be in delimited text format.
* You don't seem to mention the decimal separator issue, what is
allowed
in this context?
Yes, that should be mentioned in the doc. In fact, the
handling of the decimal separator is exactly the same as for
regular CSV reading via "open" (i.e. the decimal comma is
supported).
* I find the '--data' option naming unintuitive or too
generic; why not call it '--name' if it's about renaming?
Jack originally suggested that this option should be called
"payload". Maybe that's better than "data".
* string filtering: Maybe there is a case to provide some option for
surrounding whitespace handling? What I mean is, in a CSV file the name
of a variable "nkids" could also appear as " nkids".
I believe that's handled already, in the sense that we always
strip leading and trailing white space (on the assumption that
it's just junk).
And a general inquiry about gretl version numbers: Is this the final
feature step towards 2.0 and the 1.9.x series in this sense is still the
beta for 2.0? If not, adding such a feature would seem to warrant a
version 1.10.0 instead of just increasing the 3rd-level digit.
I think probably the former (1.9.x the beta for 2.0).
Allin