Re: [Gretl-devel] [Gretl-users] new command, "join"

Monday, 13 August 2012

On Mon, 13 Aug 2012, Sven Schreiber wrote:

...
 [reply probably more suitable for the devel list, thus switching]

 On 08/09/2012 10:43 PM, Allin Cottrell wrote:
>
> Last month in Ancona, Jack Lucchetti, Claudia Pigini and I spent an
> intensive week cooking up a new command for gretl. It's called
> "join", its job is to pull together data from two or more sources
> with the help of keys and/or filters, and -- casting modesty aside
> -- we think it's a killer! Stata has a deservedly good reputation
> for this sort of thing but we think that in some respects "join" may
> put gretl ahead.
>
> It's in CVS and snapshots and we invite you to try it out and give
> us your comments. Full documentation with examples of use is
> available at:
>
> http://ricardo.ecn.wfu.edu/~cottrell/tmp/join.pdf (US letter)
> http://ricardo.ecn.wfu.edu/~cottrell/tmp/join-a4.pdf (A4)

 Yes this looks like a "great leap forward"! Allow me some more or less
 ad-hoc reactions while browsing the documentation:

 * Terminology: in relational database theory, there are "inner joins"
 and "outer joins" AFAIR. In your docs, "inner" and "outer"
seem to have
 a different meaning. Maybe this can be separated somehow. ("Incumbent"
 and "incoming" perhaps? Or simply "first" and "second"?)
This would
 probably also affect the naming of --ikey and --okey.

 * To push this argument a little further, since gretl's join seems to
 work on single series only (which is fine!), the whole thing seems
 rather different from a database/SQL join, and the name could therefore
 be misleading. Maybe call it "importseries" or somesuch instead? 
Maybe. I guess the force of this comment depends on how wedded 
are potential users of this command to database/SQL 
terminology. I'll await Jack's reaction when he gets back 
online.

...
 * Datafile format: you note the connection to large datasets. Yet so
far
 only text format files are supported. At the risk of stating the obvious
 ("breaking into open doors" as we say in German), for large datasets
 some binary format is probably wanted -- or do you include gzipped text
 files when saying text files? 
We could read gzipped CSV without too much difficulty, though 
we don't at present. We could also apply the "join" apparatus 
to native gretl binary databases. However, our focus so far 
has been on processing big "third party" data sources, and 
these mostly seem to be in delimited text format.

...
 * You don't seem to mention the decimal separator issue, what is
allowed
 in this context? 
Yes, that should be mentioned in the doc. In fact, the 
handling of the decimal separator is exactly the same as for 
regular CSV reading via "open" (i.e. the decimal comma is 
supported).

...
 * I find the '--data' option naming unintuitive or too 
 generic; why not call it '--name' if it's about renaming? 
Jack originally suggested that this option should be called 
"payload". Maybe that's better than "data".

...
 * string filtering: Maybe there is a case to provide some option for
 surrounding whitespace handling? What I mean is, in a CSV file the name
 of a variable "nkids" could also appear as "  nkids". 
I believe that's handled already, in the sense that we always 
strip leading and trailing white space (on the assumption that 
it's just junk).

...
 And a general inquiry about gretl version numbers: Is this the final
 feature step towards 2.0 and the 1.9.x series in this sense is still the
 beta for 2.0? If not, adding such a feature would seem to warrant a
 version 1.10.0 instead of just increasing the 3rd-level digit. 
I think probably the former (1.9.x the beta for 2.0).

Allin

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Gretl-devel] [Gretl-users] new command, "join"