Re: [Gretl-devel] [Gretl-users] new command, "join"

older

Using libgretl from C++

install error in OS X 10.8

Sven Schreiber

Monday, 13 August 2012 Mon, 13 Aug '12

10:32 a.m.

[reply probably more suitable for the devel list, thus switching] On 08/09/2012 10:43 PM, Allin Cottrell wrote:

...

Last month in Ancona, Jack Lucchetti, Claudia Pigini and I spent an intensive week cooking up a new command for gretl. It's called "join", its job is to pull together data from two or more sources with the help of keys and/or filters, and -- casting modesty aside -- we think it's a killer! Stata has a deservedly good reputation for this sort of thing but we think that in some respects "join" may put gretl ahead. It's in CVS and snapshots and we invite you to try it out and give us your comments. Full documentation with examples of use is available at: http://ricardo.ecn.wfu.edu/~cottrell/tmp/join.pdf (US letter) http://ricardo.ecn.wfu.edu/~cottrell/tmp/join-a4.pdf (A4)

Yes this looks like a "great leap forward"! Allow me some more or less ad-hoc reactions while browsing the documentation: * Terminology: in relational database theory, there are "inner joins" and "outer joins" AFAIR. In your docs, "inner" and "outer" seem to have a different meaning. Maybe this can be separated somehow. ("Incumbent" and "incoming" perhaps? Or simply "first" and "second"?) This would probably also affect the naming of --ikey and --okey. * To push this argument a little further, since gretl's join seems to work on single series only (which is fine!), the whole thing seems rather different from a database/SQL join, and the name could therefore be misleading. Maybe call it "importseries" or somesuch instead? * Datafile format: you note the connection to large datasets. Yet so far only text format files are supported. At the risk of stating the obvious ("breaking into open doors" as we say in German), for large datasets some binary format is probably wanted -- or do you include gzipped text files when saying text files? * You don't seem to mention the decimal separator issue, what is allowed in this context? * I find the '--data' option naming unintuitive or too generic; why not call it '--name' if it's about renaming? * string filtering: Maybe there is a case to provide some option for surrounding whitespace handling? What I mean is, in a CSV file the name of a variable "nkids" could also appear as " nkids". And a general inquiry about gretl version numbers: Is this the final feature step towards 2.0 and the 1.9.x series in this sense is still the beta for 2.0? If not, adding such a feature would seem to warrant a version 1.10.0 instead of just increasing the 3rd-level digit.

...

There's also a follow-up on the way, namely an account of how "join" can be used to handle "real-time" data (time series data indexed not only by the data to which the data refer, but also by the date on which they were produced/revised). Expect something on this in the next few weeks.

Very cool -- can't wait! cheers, sven

Show replies by date

Allin Cottrell

Monday, 13 August Mon, 13 Aug

4:28 p.m.

New subject: [Gretl-users] new command, "join"

On Mon, 13 Aug 2012, Sven Schreiber wrote:

...

[reply probably more suitable for the devel list, thus switching] On 08/09/2012 10:43 PM, Allin Cottrell wrote: > > Last month in Ancona, Jack Lucchetti, Claudia Pigini and I spent an > intensive week cooking up a new command for gretl. It's called > "join", its job is to pull together data from two or more sources > with the help of keys and/or filters, and -- casting modesty aside > -- we think it's a killer! Stata has a deservedly good reputation > for this sort of thing but we think that in some respects "join" may > put gretl ahead. > > It's in CVS and snapshots and we invite you to try it out and give > us your comments. Full documentation with examples of use is > available at: > > http://ricardo.ecn.wfu.edu/~cottrell/tmp/join.pdf (US letter) > http://ricardo.ecn.wfu.edu/~cottrell/tmp/join-a4.pdf (A4) Yes this looks like a "great leap forward"! Allow me some more or less ad-hoc reactions while browsing the documentation: * Terminology: in relational database theory, there are "inner joins" and "outer joins" AFAIR. In your docs, "inner" and "outer" seem to have a different meaning. Maybe this can be separated somehow. ("Incumbent" and "incoming" perhaps? Or simply "first" and "second"?) This would probably also affect the naming of --ikey and --okey. * To push this argument a little further, since gretl's join seems to work on single series only (which is fine!), the whole thing seems rather different from a database/SQL join, and the name could therefore be misleading. Maybe call it "importseries" or somesuch instead?

Maybe. I guess the force of this comment depends on how wedded are potential users of this command to database/SQL terminology. I'll await Jack's reaction when he gets back online.

...

* Datafile format: you note the connection to large datasets. Yet so far only text format files are supported. At the risk of stating the obvious ("breaking into open doors" as we say in German), for large datasets some binary format is probably wanted -- or do you include gzipped text files when saying text files?

We could read gzipped CSV without too much difficulty, though we don't at present. We could also apply the "join" apparatus to native gretl binary databases. However, our focus so far has been on processing big "third party" data sources, and these mostly seem to be in delimited text format.

...

* You don't seem to mention the decimal separator issue, what is allowed in this context?

Yes, that should be mentioned in the doc. In fact, the handling of the decimal separator is exactly the same as for regular CSV reading via "open" (i.e. the decimal comma is supported).

...

* I find the '--data' option naming unintuitive or too generic; why not call it '--name' if it's about renaming?

Jack originally suggested that this option should be called "payload". Maybe that's better than "data".

...

* string filtering: Maybe there is a case to provide some option for surrounding whitespace handling? What I mean is, in a CSV file the name of a variable "nkids" could also appear as " nkids".

I believe that's handled already, in the sense that we always strip leading and trailing white space (on the assumption that it's just junk).

...

And a general inquiry about gretl version numbers: Is this the final feature step towards 2.0 and the 1.9.x series in this sense is still the beta for 2.0? If not, adding such a feature would seem to warrant a version 1.10.0 instead of just increasing the 3rd-level digit.

I think probably the former (1.9.x the beta for 2.0). Allin

Riccardo (Jack) Lucchetti

Monday, 20 August Mon, 20 Aug

4:46 p.m.

New subject: [Gretl-users] new command, "join"

On Mon, 13 Aug 2012, Allin Cottrell wrote:

...

On Mon, 13 Aug 2012, Sven Schreiber wrote: > [reply probably more suitable for the devel list, thus switching] > > On 08/09/2012 10:43 PM, Allin Cottrell wrote: >> >> Last month in Ancona, Jack Lucchetti, Claudia Pigini and I spent an >> intensive week cooking up a new command for gretl. It's called >> "join", its job is to pull together data from two or more sources >> with the help of keys and/or filters, and -- casting modesty aside >> -- we think it's a killer! Stata has a deservedly good reputation >> for this sort of thing but we think that in some respects "join" may >> put gretl ahead. >> >> It's in CVS and snapshots and we invite you to try it out and give >> us your comments. Full documentation with examples of use is >> available at: >> >> http://ricardo.ecn.wfu.edu/~cottrell/tmp/join.pdf (US letter) >> http://ricardo.ecn.wfu.edu/~cottrell/tmp/join-a4.pdf (A4) > > Yes this looks like a "great leap forward"! Allow me some more or less > ad-hoc reactions while browsing the documentation: > > * Terminology: in relational database theory, there are "inner joins" > and "outer joins" AFAIR. In your docs, "inner" and "outer" seem to have > a different meaning. Maybe this can be separated somehow. ("Incumbent" > and "incoming" perhaps? Or simply "first" and "second"?) This would > probably also affect the naming of --ikey and --okey. > > * To push this argument a little further, since gretl's join seems to > work on single series only (which is fine!), the whole thing seems > rather different from a database/SQL join, and the name could therefore > be misleading. Maybe call it "importseries" or somesuch instead? Maybe. I guess the force of this comment depends on how wedded are potential users of this command to database/SQL terminology. I'll await Jack's reaction when he gets back online.

The idea of using the word "join" was primarily inspired by the "join" unix command, rather than the SQL JOIN statement. I admit SQL users may find the terms "inner" and "outer" confusing at first (but then, the same goes for "left" and "right"); but how many gretl users are so adept at SQL syntax to find the terminologic short-circuit problematic?

...

> * Datafile format: you note the connection to large datasets. Yet so far > only text format files are supported. At the risk of stating the obvious > ("breaking into open doors" as we say in German), for large datasets > some binary format is probably wanted -- or do you include gzipped text > files when saying text files? We could read gzipped CSV without too much difficulty, though we don't at present. We could also apply the "join" apparatus to native gretl binary databases. However, our focus so far has been on processing big "third party" data sources, and these mostly seem to be in delimited text format.

Or perhaps, fixed-format, though I haven't seen one in years.

...

> * You don't seem to mention the decimal separator issue, what is allowed > in this context? Yes, that should be mentioned in the doc. In fact, the handling of the decimal separator is exactly the same as for regular CSV reading via "open" (i.e. the decimal comma is supported).

I'll be a good boy and I'll pretend I never read this, ok? ;-)

...

> * I find the '--data' option naming unintuitive or too > generic; why not call it '--name' if it's about renaming? Jack originally suggested that this option should be called "payload". Maybe that's better than "data".

Well, IMO "name" is just as generic as "data". I don't mind either. I originally found "payload" mildly amusing. Anybody else out there with strong a preference? -------------------------------------------------- Riccardo (Jack) Lucchetti Dipartimento di Economia Università Politecnica delle Marche (formerly known as Università di Ancona) r.lucchetti(a)univpm.it http://www2.econ.univpm.it/servizi/hpp/lucchetti --------------------------------------------------

Sven Schreiber

5:04 p.m.

New subject: [Gretl-users] new command, "join"

On 08/20/2012 04:46 PM, Riccardo (Jack) Lucchetti wrote:

...

On Mon, 13 Aug 2012, Allin Cottrell wrote: > On Mon, 13 Aug 2012, Sven Schreiber wrote:

...

>> >> * Terminology: in relational database theory, there are "inner joins" >> and "outer joins" AFAIR. In your docs, "inner" and "outer" seem to have >> a different meaning. Maybe this can be separated somehow. ("Incumbent" >> and "incoming" perhaps? Or simply "first" and "second"?) This would >> probably also affect the naming of --ikey and --okey. >> >> * To push this argument a little further, since gretl's join seems to >> work on single series only (which is fine!), the whole thing seems >> rather different from a database/SQL join, and the name could therefore >> be misleading. Maybe call it "importseries" or somesuch instead? > > > Maybe. I guess the force of this comment depends on how wedded > are potential users of this command to database/SQL > terminology. I'll await Jack's reaction when he gets back > online. The idea of using the word "join" was primarily inspired by the "join" unix command, rather than the SQL JOIN statement. I admit SQL users may find the terms "inner" and "outer" confusing at first (but then, the same goes for "left" and "right"); but how many gretl users are so adept at SQL syntax to find the terminologic short-circuit problematic?

Ok, maybe I misunderstood the intended (non-) relation to the functionality of relational databases.

...

>> * Datafile format: you note the connection to large datasets. Yet so far >> only text format files are supported. At the risk of stating the obvious >> ("breaking into open doors" as we say in German), for large datasets >> some binary format is probably wanted -- or do you include gzipped text >> files when saying text files? > > We could read gzipped CSV without too much difficulty, though > we don't at present. We could also apply the "join" apparatus > to native gretl binary databases. However, our focus so far > has been on processing big "third party" data sources, and > these mostly seem to be in delimited text format. Or perhaps, fixed-format, though I haven't seen one in years.

I'm not sure, but isn't Stata's .dta format a binary (non-text) file format? If so, then I guess many big micro datasets are distributed as binary. (I'm thinking of the German SOEP for example.)

...

>> * You don't seem to mention the decimal separator issue, what is allowed >> in this context? > > Yes, that should be mentioned in the doc. In fact, the > handling of the decimal separator is exactly the same as for > regular CSV reading via "open" (i.e. the decimal comma is > supported). I'll be a good boy and I'll pretend I never read this, ok? ;-)

Well the consensus rule for gretl was to enforce decimal points only in hansl scripts, wasn't it?

...

>> * I find the '--data' option naming unintuitive or too >> generic; why not call it '--name' if it's about renaming? > > Jack originally suggested that this option should be called > "payload". Maybe that's better than "data". Well, IMO "name" is just as generic as "data". I don't mind either. I originally found "payload" mildly amusing. Anybody else out there with strong a preference?

Well I'm not anybody else in this discussion's context, but I don't get the pun with payload, I must confess. cheers, sven

Riccardo (Jack) Lucchetti

Tuesday, 21 August Tue, 21 Aug

1:14 a.m.

New subject: [Gretl-users] new command, "join"

On Mon, 20 Aug 2012, Sven Schreiber wrote:

...

>>> * Datafile format: you note the connection to large datasets. Yet so far >>> only text format files are supported. At the risk of stating the obvious >>> ("breaking into open doors" as we say in German), for large datasets >>> some binary format is probably wanted -- or do you include gzipped text >>> files when saying text files? >> >> We could read gzipped CSV without too much difficulty, though >> we don't at present. We could also apply the "join" apparatus >> to native gretl binary databases. However, our focus so far >> has been on processing big "third party" data sources, and >> these mostly seem to be in delimited text format. > > Or perhaps, fixed-format, though I haven't seen one in years. I'm not sure, but isn't Stata's .dta format a binary (non-text) file format? If so, then I guess many big micro datasets are distributed as binary. (I'm thinking of the German SOEP for example.)

True. I'm currently working on the SOEP database myself and, if I had to start from scratch now that we have "join" in gretl, I think I'd use Stata just to turn the whole thing into csv. Instead we had to use this diabolical stata add-on called "PanelWhiz". Brrrr.

...

>>> * I find the '--data' option naming unintuitive or too >>> generic; why not call it '--name' if it's about renaming? >> >> Jack originally suggested that this option should be called >> "payload". Maybe that's better than "data". > > Well, IMO "name" is just as generic as "data". I don't mind either. I > originally found "payload" mildly amusing. Anybody else out there with > strong a preference? > Well I'm not anybody else in this discussion's context, but I don't get the pun with payload, I must confess.

There's no pun. I just enjoyed the idea of likening the join command to the space shuttle or something like that, skillfully carrying something precious across. Besides, the "payload" is a well-established term in the computer virus jargon, too. -------------------------------------------------- Riccardo (Jack) Lucchetti Dipartimento di Economia Università Politecnica delle Marche (formerly known as Università di Ancona) r.lucchetti(a)univpm.it http://www2.econ.univpm.it/servizi/hpp/lucchetti --------------------------------------------------

Sven Schreiber

10:52 a.m.

New subject: [Gretl-users] new command, "join"

Am 21.08.2012 01:14, schrieb Riccardo (Jack) Lucchetti:

...

On Mon, 20 Aug 2012, Sven Schreiber wrote:

...

> > I'm not sure, but isn't Stata's .dta format a binary (non-text) file > format? If so, then I guess many big micro datasets are distributed as > binary. (I'm thinking of the German SOEP for example.) True. I'm currently working on the SOEP database myself and, if I had to start from scratch now that we have "join" in gretl, I think I'd use Stata just to turn the whole thing into csv. Instead we had to use this diabolical stata add-on called "PanelWhiz". Brrrr.

But I think that's my point -- it would be good if 'join' worked on some binary format, gretl's own formats being the obvious premier choice. Being able to process Stata's .dta would also be nice of course, but that's probably a luxury. Specifically I believe that there may be a good chance to get the SOEP team to distribute their data also in some gretl format in the medium term, once the equivalent functionality of Stata's merge exists (and is tested). I'm not an insider there in any way, but that's my educated guess.

...

>>>> * I find the '--data' option naming unintuitive or too >>>> generic; why not call it '--name' if it's about renaming? >>> >>> Jack originally suggested that this option should be called >>> "payload". Maybe that's better than "data". >> >> Well, IMO "name" is just as generic as "data". I don't mind either. I >> originally found "payload" mildly amusing. Anybody else out there with >> strong a preference? >> > > Well I'm not anybody else in this discussion's context, but I don't get > the pun with payload, I must confess. There's no pun. I just enjoyed the idea of likening the join command to the space shuttle or something like that, skillfully carrying something precious across. Besides, the "payload" is a well-established term in the computer virus jargon, too.

I'm lost here. Maybe I didn't understand what --data actually does. cheers, sven

Riccardo (Jack) Lucchetti

1:04 p.m.

New subject: [Gretl-users] new command, "join"

On Tue, 21 Aug 2012, Sven Schreiber wrote:

...

Am 21.08.2012 01:14, schrieb Riccardo (Jack) Lucchetti: > On Mon, 20 Aug 2012, Sven Schreiber wrote: > >> >> I'm not sure, but isn't Stata's .dta format a binary (non-text) file >> format? If so, then I guess many big micro datasets are distributed as >> binary. (I'm thinking of the German SOEP for example.) > > True. I'm currently working on the SOEP database myself and, if I had to > start from scratch now that we have "join" in gretl, I think I'd use > Stata just to turn the whole thing into csv. Instead we had to use this > diabolical stata add-on called "PanelWhiz". Brrrr. But I think that's my point -- it would be good if 'join' worked on some binary format, gretl's own formats being the obvious premier choice. Being able to process Stata's .dta would also be nice of course, but that's probably a luxury. Specifically I believe that there may be a good chance to get the SOEP team to distribute their data also in some gretl format in the medium term, once the equivalent functionality of Stata's merge exists (and is tested). I'm not an insider there in any way, but that's my educated guess.

I have the feeling that anything different from csv may be quite a technical challenge, in that I can't see a way to extract a given column from a dta file (or a gdt file, for that matter) without reading it into memory in its entirety, but I'm no authority on this.

...

>>>>> * I find the '--data' option naming unintuitive or too >>>>> generic; why not call it '--name' if it's about renaming? >>>> >>>> Jack originally suggested that this option should be called >>>> "payload". Maybe that's better than "data". >>> >>> Well, IMO "name" is just as generic as "data". I don't mind either. I >>> originally found "payload" mildly amusing. Anybody else out there with >>> strong a preference? >>> >> >> Well I'm not anybody else in this discussion's context, but I don't get >> the pun with payload, I must confess. > > There's no pun. I just enjoyed the idea of likening the join command to > the space shuttle or something like that, skillfully carrying something > precious across. Besides, the "payload" is a well-established term in > the computer virus jargon, too. I'm lost here. Maybe I didn't understand what --data actually does.

Well, it just tells join what data from the right-hand file you want to bring into the left-hand dataset. -------------------------------------------------- Riccardo (Jack) Lucchetti Dipartimento di Economia Università Politecnica delle Marche (formerly known as Università di Ancona) r.lucchetti(a)univpm.it http://www2.econ.univpm.it/servizi/hpp/lucchetti --------------------------------------------------

Allin Cottrell

4:59 p.m.

New subject: [Gretl-users] new command, "join"

On Tue, 21 Aug 2012, Riccardo (Jack) Lucchetti wrote:

...

On Tue, 21 Aug 2012, Sven Schreiber wrote: > Am 21.08.2012 01:14, schrieb Riccardo (Jack) Lucchetti: >> On Mon, 20 Aug 2012, Sven Schreiber wrote: >>> >>> I'm not sure, but isn't Stata's .dta format a binary (non-text) file >>> format? If so, then I guess many big micro datasets are distributed as >>> binary. (I'm thinking of the German SOEP for example.) >> >> True. I'm currently working on the SOEP database myself and, if I had to >> start from scratch now that we have "join" in gretl, I think I'd use >> Stata just to turn the whole thing into csv. Instead we had to use this >> diabolical stata add-on called "PanelWhiz". Brrrr. > > But I think that's my point -- it would be good if 'join' worked on some > binary format, gretl's own formats being the obvious premier choice. > Being able to process Stata's .dta would also be nice of course, but > that's probably a luxury. > > Specifically I believe that there may be a good chance to get the SOEP > team to distribute their data also in some gretl format in the medium > term, once the equivalent functionality of Stata's merge exists (and is > tested). I'm not an insider there in any way, but that's my educated guess. I have the feeling that anything different from csv may be quite a technical challenge, in that I can't see a way to extract a given column from a dta file (or a gdt file, for that matter) without reading it into memory in its entirety, but I'm no authority on this.

At present we're set up to read selected columns from a delimited text data file in a way that we're not set up for any other format. My feeling about gretl's gdt format is that it's designed to represent a working dataset -- something that can be pulled into memory in one go -- not a huge database from which one might draw series. On the other hand, gretl's binary database format (bin/idx) may be a reasonable candidate for getting the "join" treatment, since it was designed from the start to support fast, easy extraction of selected series. As for stata's dta format, I'd have to think about that.

...

>>>>>> * I find the '--data' option naming unintuitive or too >>>>>> generic; why not call it '--name' if it's about renaming? >>>>> >>>>> Jack originally suggested that this option should be called >>>>> "payload". Maybe that's better than "data". >>>> >>>> Well, IMO "name" is just as generic as "data". I don't mind either. I >>>> originally found "payload" mildly amusing. Anybody else out there with >>>> strong a preference? >>>> >>> >>> Well I'm not anybody else in this discussion's context, but I don't get >>> the pun with payload, I must confess. >> >> There's no pun. I just enjoyed the idea of likening the join command to >> the space shuttle or something like that, skillfully carrying something >> precious across. Besides, the "payload" is a well-established term in >> the computer virus jargon, too. > > I'm lost here. Maybe I didn't understand what --data actually does. Well, it just tells join what data from the right-hand file you want to bring into the left-hand dataset.

Yes. You could think of it like this. The way we've done "join", the required series-name argument is the name by which the new data will be known on the left. This may be the same name that is used on the right, but if not you use "--data" to tell gretl what actual data you want extracted from file (what the relevant column heading is). We _could_ have done it the other way, with the required argument being the right-hand column name, in which case we might have had a --rename-as option to let you set a name for the imported series on the left. But in fact that would not work so well, because with the --aggr=count method there's no "data column" as such on the right; the series that appears on the left is just a count of key-matches. A more explicit version of the option in question would be --data-column-heading=<whatever> but we try to keep option strings reasonably short, and hopefully if the documentation is good enough users will get the idea. Allin

4739

days inactive

4747

days old

gretl-devel@gretlml.univpm.it

Manage subscription

7 comments

3 participants

tags (0)

participants (3)

Allin Cottrell
Riccardo (Jack) Lucchetti
Sven Schreiber

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Gretl-devel] [Gretl-users] new command, "join"