On Mon, 19 Aug 2013, Allin Cottrell wrote:
On Mon, 19 Aug 2013, Sven Schreiber wrote:
>
> I'm wondering whether gretl isn't perhaps doing "too much too
early"
> with the input file contents in the context of the join command. After
> all, the values that are going to go into the gretl workfile are only in
> the 'varname' column. The other columns are "only" used for
filtering
> and matching, so maybe they should be left as is? (With the further
> exception of the column specified by --tkey, I guess)
I think that in most cases we have to try to recognize missing values in the
outer data file. But this can be tricky with string variables, after all "."
could be intended as a valid value for a string variable (as could "NA" for
that matter).
Thinking about this some more, it seems we really should not apply
the same policy regarding CSV NAs to string variables as we apply to
numeric ones. With numeric variables we want to be fairly "liberal",
interpreting as NA any string that the author of the CSV file could
plausibly have intended as a code for missing. But there's no
unambiguous "missing" indiactor for string values, other than a
blank cell or an empty string.
So here's a suggestion: when we determine that a certain column of a
CSV file represents a string-valued variable, by default we treat
all non-blank values as string literals. But we provide a "set"
variable ("missing_string" or some such) so that the user can
specify a missing-code for string-valued input. E.g. when reading
from Alfred's tab-separated files one could say
set missing_string "."
(This would not just be for "join", but for any delimited-text
read.)
I'll be mostly offline for a week or so, so I won't be able to
implement this right away. But I'll be interested to hear what
people think.
Allin