Am 17.07.20 um 16:32 schrieb Allin Cottrell:
On Fri, 17 Jul 2020, Artur Tarassow wrote:
> Am 16.07.20 um 14:53 schrieb Allin Cottrell:
>> On Wed, 15 Jul 2020, Allin Cottrell wrote:
>>
>>> On Wed, 15 Jul 2020, Artur Tarassow wrote:
>>>
>>>> But what about the case when adding the " --permanent" flag?
>>>
>>> I can see a case for shrinking the strings array when the
>>> --permanent option is given, though it's not totally clear-cut.
>>
>> Here's a follow-up. You could think of this as a prototype of what we
>> might do internally with a string-valued series on permanent
>> sub-sampling.
>
> Sorry for the late reply, Allin. Yes that looks good to me.
>
> But would "permanent sub-sampling" mean that this only applies when
> executing the smpl command with the --permanent flag? Or would it also
> apply when storing a sub-sampled data set?
I favour doing this only when the --permanent option is given. It's an
information-destroying move, and I can imagine cases where one wants to
save a sub-sample and yet not lose the information in question. But if
you _want_ to lose it, without using --permanent, then just store in a
format other than gdt or gdtb.
I understand your point, Allin. And getting it worked when using the
--permanent option would be very useful.
But, let me loudly think about some of my use cases -- maybe some others
have to deal with similar ones.
From time to time I have to deal with large panel data sets (size is
about 800 MB or even larger) originally stored as csv. Reading such csv
data sets is _very_ costly as is also joining additional series to it.
So its natural to store such data as gdt or even better in a binary
format (gdtb).
Let's say, I would like to store several sub-sampled data sets of this
large one, then I would approach it by:
<pseudo-hansl>
open large_dataset.csv
matrix assortments = values(assortment)
loop i=1..rows(assortments)
if i > 1
open large_dataset.csv -p -q
endif
smpl assortment == assortments[i] --restrict --permanent
<store "data_for_assortment_$i.gdtb>
endloop
</pseudo-hansl>
In this case one would have to re-open the "large_dataset.csv" multiple
times (ok, fair enough, I could have stored the "large_dataset.csv" in
gdtb format before the loop and load this -- but still would have a copy
of a 'large' data set.
I just would like to remember that given the tendency that larger and
larger data sets become available also in academics, it may be worth to
got a step further. Don't get me wrong, getting work the hansl example
would be cool but if it's difficult to implement "on-the-fly" this can
wait. I could create a feature request ticket for this if worth.
Artur