On Mon, 13 Aug 2012, Allin Cottrell wrote:
On Mon, 13 Aug 2012, Sven Schreiber wrote:
> [reply probably more suitable for the devel list, thus switching]
>
> On 08/09/2012 10:43 PM, Allin Cottrell wrote:
>>
>> Last month in Ancona, Jack Lucchetti, Claudia Pigini and I spent an
>> intensive week cooking up a new command for gretl. It's called
>> "join", its job is to pull together data from two or more sources
>> with the help of keys and/or filters, and -- casting modesty aside
>> -- we think it's a killer! Stata has a deservedly good reputation
>> for this sort of thing but we think that in some respects "join" may
>> put gretl ahead.
>>
>> It's in CVS and snapshots and we invite you to try it out and give
>> us your comments. Full documentation with examples of use is
>> available at:
>>
>>
http://ricardo.ecn.wfu.edu/~cottrell/tmp/join.pdf (US letter)
>>
http://ricardo.ecn.wfu.edu/~cottrell/tmp/join-a4.pdf (A4)
>
> Yes this looks like a "great leap forward"! Allow me some more or less
> ad-hoc reactions while browsing the documentation:
>
> * Terminology: in relational database theory, there are "inner joins"
> and "outer joins" AFAIR. In your docs, "inner" and
"outer" seem to have
> a different meaning. Maybe this can be separated somehow. ("Incumbent"
> and "incoming" perhaps? Or simply "first" and
"second"?) This would
> probably also affect the naming of --ikey and --okey.
>
> * To push this argument a little further, since gretl's join seems to
> work on single series only (which is fine!), the whole thing seems
> rather different from a database/SQL join, and the name could therefore
> be misleading. Maybe call it "importseries" or somesuch instead?
Maybe. I guess the force of this comment depends on how wedded
are potential users of this command to database/SQL
terminology. I'll await Jack's reaction when he gets back
online.
The idea of using the word "join" was primarily inspired by the "join"
unix command, rather than the SQL JOIN statement. I admit SQL users
may find the terms "inner" and "outer" confusing at first (but then,
the
same goes for "left" and "right"); but how many gretl users are so
adept
at SQL syntax to find the terminologic short-circuit problematic?
> * Datafile format: you note the connection to large datasets. Yet
so far
> only text format files are supported. At the risk of stating the obvious
> ("breaking into open doors" as we say in German), for large datasets
> some binary format is probably wanted -- or do you include gzipped text
> files when saying text files?
We could read gzipped CSV without too much difficulty, though
we don't at present. We could also apply the "join" apparatus
to native gretl binary databases. However, our focus so far
has been on processing big "third party" data sources, and
these mostly seem to be in delimited text format.
Or perhaps, fixed-format, though I haven't seen one in years.
> * You don't seem to mention the decimal separator issue, what
is allowed
> in this context?
Yes, that should be mentioned in the doc. In fact, the
handling of the decimal separator is exactly the same as for
regular CSV reading via "open" (i.e. the decimal comma is
supported).
I'll be a good boy and I'll pretend I never read this, ok? ;-)
> * I find the '--data' option naming unintuitive or too
> generic; why not call it '--name' if it's about renaming?
Jack originally suggested that this option should be called
"payload". Maybe that's better than "data".
Well, IMO "name" is just as generic as "data". I don't mind
either.
I originally found "payload" mildly amusing. Anybody else out there with
strong a preference?
--------------------------------------------------
Riccardo (Jack) Lucchetti
Dipartimento di Economia
Università Politecnica delle Marche
(formerly known as Università di Ancona)
r.lucchetti(a)univpm.it
http://www2.econ.univpm.it/servizi/hpp/lucchetti
--------------------------------------------------