On Sat, 18 Jul 2020, Artur Tarassow wrote:
Am 17.07.20 um 16:32 schrieb Allin Cottrell:
> On Fri, 17 Jul 2020, Artur Tarassow wrote:
>
>> Am 16.07.20 um 14:53 schrieb Allin Cottrell:
>>> On Wed, 15 Jul 2020, Allin Cottrell wrote:
>>>
>>>> On Wed, 15 Jul 2020, Artur Tarassow wrote:
>>>>
>>>>> But what about the case when adding the " --permanent"
flag?
>>>>
>>>> I can see a case for shrinking the strings array when the --permanent
>>>> option is given, though it's not totally clear-cut.
>>>
>>> Here's a follow-up. You could think of this as a prototype of what we
>>> might do internally with a string-valued series on permanent
>>> sub-sampling.
>>
>> Sorry for the late reply, Allin. Yes that looks good to me.
>>
>> But would "permanent sub-sampling" mean that this only applies when
>> executing the smpl command with the --permanent flag? Or would it also
>> apply when storing a sub-sampled data set?
>
> I favour doing this only when the --permanent option is given. It's an
> information-destroying move, and I can imagine cases where one wants to
> save a sub-sample and yet not lose the information in question. But if you
> _want_ to lose it, without using --permanent, then just store in a format
> other than gdt or gdtb.
I understand your point, Allin. And getting it worked when using the
--permanent option would be very useful.
But, let me loudly think about some of my use cases -- maybe some others have
to deal with similar ones...
[cases where using the --permanent option with "smpl" would clearly
not be convenient]
OK, here's what's now in git (not yet in snapshots, I'd prefer to
see some testing first):
(1) Imposing a sample restriction with the --permanent option
results in "trimming" of string-valued series: only string values
that appear within the sub-sample are preserved, and the numeric
coding for such series is adjusted accordingly. Note, this means
that any given observation will have the same string value as it had
in the full dataset, but may not have the same numeric code.
(2) When using the "store" command with a native target (gdt or
gdtb) there's a new option --trim-strvals which has a similar
effect. We achieve this as follows:
* Any string-valued series are first backed-up (copied in RAM).
* Before we actually write the data file we "trim" as described
above.
* Once the write is finished we restore the full form of the
string-valued series.
So you can sub-sample, store the data in trimmed form, then restore
the full dataset without loss of information -- or at least that's
the idea! This has worked OK in my limited testing today, but more
testing is wanted.
One further remark: "store --trim-strvals" will work even when
there's no sub-sample in place, in case the dataset contains any
redundant string values. I hadn't noticed before, but gretl's
grunfeld.gdt contains a redundant 11th firmname, "American Steel"
(there are only 10 firms in our dataset). You can remove that by
using store with the new option.
Allin