On Tue, 21 Aug 2012, Riccardo (Jack) Lucchetti wrote:
On Tue, 21 Aug 2012, Sven Schreiber wrote:
> Am 21.08.2012 01:14, schrieb Riccardo (Jack) Lucchetti:
>> On Mon, 20 Aug 2012, Sven Schreiber wrote:
>>>
>>> I'm not sure, but isn't Stata's .dta format a binary (non-text)
file
>>> format? If so, then I guess many big micro datasets are distributed as
>>> binary. (I'm thinking of the German SOEP for example.)
>>
>> True. I'm currently working on the SOEP database myself and, if I had to
>> start from scratch now that we have "join" in gretl, I think I'd
use
>> Stata just to turn the whole thing into csv. Instead we had to use this
>> diabolical stata add-on called "PanelWhiz". Brrrr.
>
> But I think that's my point -- it would be good if 'join' worked on some
> binary format, gretl's own formats being the obvious premier choice.
> Being able to process Stata's .dta would also be nice of course, but
> that's probably a luxury.
>
> Specifically I believe that there may be a good chance to get the SOEP
> team to distribute their data also in some gretl format in the medium
> term, once the equivalent functionality of Stata's merge exists (and is
> tested). I'm not an insider there in any way, but that's my educated guess.
I have the feeling that anything different from csv may be quite a technical
challenge, in that I can't see a way to extract a given column from a dta
file (or a gdt file, for that matter) without reading it into memory in its
entirety, but I'm no authority on this.
At present we're set up to read selected columns from a
delimited text data file in a way that we're not set up for
any other format. My feeling about gretl's gdt format is that
it's designed to represent a working dataset -- something that
can be pulled into memory in one go -- not a huge database
from which one might draw series. On the other hand, gretl's
binary database format (bin/idx) may be a reasonable candidate
for getting the "join" treatment, since it was designed from
the start to support fast, easy extraction of selected series.
As for stata's dta format, I'd have to think about that.
>>>>>> * I find the '--data' option naming
unintuitive or too
>>>>>> generic; why not call it '--name' if it's about
renaming?
>>>>>
>>>>> Jack originally suggested that this option should be called
>>>>> "payload". Maybe that's better than "data".
>>>>
>>>> Well, IMO "name" is just as generic as "data". I
don't mind either. I
>>>> originally found "payload" mildly amusing. Anybody else out
there with
>>>> strong a preference?
>>>>
>>>
>>> Well I'm not anybody else in this discussion's context, but I
don't get
>>> the pun with payload, I must confess.
>>
>> There's no pun. I just enjoyed the idea of likening the join command to
>> the space shuttle or something like that, skillfully carrying something
>> precious across. Besides, the "payload" is a well-established term in
>> the computer virus jargon, too.
>
> I'm lost here. Maybe I didn't understand what --data actually does.
Well, it just tells join what data from the right-hand file you want to bring
into the left-hand dataset.
Yes. You could think of it like this. The way we've done
"join", the required series-name argument is the name by which
the new data will be known on the left. This may be the same
name that is used on the right, but if not you use "--data" to
tell gretl what actual data you want extracted from file (what
the relevant column heading is).
We _could_ have done it the other way, with the required
argument being the right-hand column name, in which case we
might have had a --rename-as option to let you set a name for
the imported series on the left. But in fact that would not
work so well, because with the --aggr=count method there's no
"data column" as such on the right; the series that appears on
the left is just a count of key-matches.
A more explicit version of the option in question would be
--data-column-heading=<whatever>
but we try to keep option strings reasonably short, and
hopefully if the documentation is good enough users will get
the idea.
Allin