Re: [Gretl-devel] [Gretl-users] new command, "join"

Tuesday, 21 August 2012

On Tue, 21 Aug 2012, Riccardo (Jack) Lucchetti wrote:

...
 On Tue, 21 Aug 2012, Sven Schreiber wrote:

> Am 21.08.2012 01:14, schrieb Riccardo (Jack) Lucchetti:
>> On Mon, 20 Aug 2012, Sven Schreiber wrote:
>>> 
>>> I'm not sure, but isn't Stata's .dta format a binary (non-text)
file
>>> format? If so, then I guess many big micro datasets are distributed as
>>> binary. (I'm thinking of the German SOEP for example.)
>> 
>> True. I'm currently working on the SOEP database myself and, if I had to
>> start from scratch now that we have "join" in gretl, I think I'd
use
>> Stata just to turn the whole thing into csv. Instead we had to use this
>> diabolical stata add-on called "PanelWhiz". Brrrr.
> 
> But I think that's my point -- it would be good if 'join' worked on some
> binary format, gretl's own formats being the obvious premier choice.
> Being able to process Stata's .dta would also be nice of course, but
> that's probably a luxury.
> 
> Specifically I believe that there may be a good chance to get the SOEP
> team to distribute their data also in some gretl format in the medium
> term, once the equivalent functionality of Stata's merge exists (and is
> tested). I'm not an insider there in any way, but that's my educated guess.

 I have the feeling that anything different from csv may be quite a technical 
 challenge, in that I can't see a way to extract a given column from a dta 
 file (or a gdt file, for that matter) without reading it into memory in its 
 entirety, but I'm no authority on this. 
At present we're set up to read selected columns from a 
delimited text data file in a way that we're not set up for 
any other format. My feeling about gretl's gdt format is that 
it's designed to represent a working dataset -- something that 
can be pulled into memory in one go -- not a huge database 
from which one might draw series. On the other hand, gretl's 
binary database format (bin/idx) may be a reasonable candidate 
for getting the "join" treatment, since it was designed from 
the start to support fast, easy extraction of selected series. 
As for stata's dta format, I'd have to think about that.

...
>>>>>> * I find the '--data' option naming
unintuitive or too
>>>>>> generic; why not call it '--name' if it's about
renaming?
>>>>> 
>>>>> Jack originally suggested that this option should be called
>>>>> "payload". Maybe that's better than "data".
>>>> 
>>>> Well, IMO "name" is just as generic as "data". I
don't mind either. I
>>>> originally found "payload" mildly amusing. Anybody else out
there with
>>>> strong a preference?
>>>> 
>>> 
>>> Well I'm not anybody else in this discussion's context, but I
don't get
>>> the pun with payload, I must confess.
>> 
>> There's no pun. I just enjoyed the idea of likening the join command to
>> the space shuttle or something like that, skillfully carrying something
>> precious across. Besides, the "payload" is a well-established term in
>> the computer virus jargon, too.
> 
> I'm lost here. Maybe I didn't understand what --data actually does.

 Well, it just tells join what data from the right-hand file you want to bring 
 into the left-hand dataset. 
Yes. You could think of it like this. The way we've done 
"join", the required series-name argument is the name by which 
the new data will be known on the left. This may be the same 
name that is used on the right, but if not you use "--data" to 
tell gretl what actual data you want extracted from file (what 
the relevant column heading is).

We _could_ have done it the other way, with the required 
argument being the right-hand column name, in which case we 
might have had a --rename-as option to let you set a name for 
the imported series on the left. But in fact that would not 
work so well, because with the --aggr=count method there's no 
"data column" as such on the right; the series that appears on 
the left is just a count of key-matches.

A more explicit version of the option in question would be

--data-column-heading=<whatever>

but we try to keep option strings reasonably short, and 
hopefully if the documentation is good enough users will get 
the idea.

Allin

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Gretl-devel] [Gretl-users] new command, "join"