[Gretl-users] Re: Bug with string-valued series when storing data

Saturday, 18 July 2020

Am 17.07.20 um 16:32 schrieb Allin Cottrell:
...
 On Fri, 17 Jul 2020, Artur Tarassow wrote:

> Am 16.07.20 um 14:53 schrieb Allin Cottrell:
>> On Wed, 15 Jul 2020, Allin Cottrell wrote:
>>
>>> On Wed, 15 Jul 2020, Artur Tarassow wrote:
>>>
>>>> But what about the case when adding the " --permanent" flag?
>>>
>>> I can see a case for shrinking the strings array when the 
>>> --permanent option is given, though it's not totally clear-cut.
>>
>> Here's a follow-up. You could think of this as a prototype of what we 
>> might do internally with a string-valued series on permanent 
>> sub-sampling.
>
> Sorry for the late reply, Allin. Yes that looks good to me.
>
> But would "permanent sub-sampling" mean that this only applies when 
> executing the smpl command with the --permanent flag? Or would it also 
> apply when storing a sub-sampled data set?

 I favour doing this only when the --permanent option is given. It's an 
 information-destroying move, and I can imagine cases where one wants to 
 save a sub-sample and yet not lose the information in question. But if 
 you _want_ to lose it, without using --permanent, then just store in a 
 format other than gdt or gdtb. 
I understand your point, Allin. And getting it worked when using the 
--permanent option would be very useful.

But, let me loudly think about some of my use cases -- maybe some others 
have to deal with similar ones.

 From time to time I have to deal with large panel data sets (size is 
about 800 MB or even larger) originally stored as csv. Reading such csv 
data sets is _very_ costly as is also joining additional series to it. 
So its natural to store such data as gdt or even better in a binary 
format (gdtb).

Let's say, I would like to store several sub-sampled data sets of this 
large one, then I would approach it by:

<pseudo-hansl>
open large_dataset.csv
matrix assortments = values(assortment)

loop i=1..rows(assortments)
	if i > 1
		open large_dataset.csv -p -q
	endif
	smpl assortment == assortments[i] --restrict --permanent
	<store "data_for_assortment_$i.gdtb>
endloop
</pseudo-hansl>

In this case one would have to re-open the "large_dataset.csv" multiple 
times (ok, fair enough, I could have stored the "large_dataset.csv" in 
gdtb format before the loop and load this -- but still would have a copy 
of a 'large' data set.

I just would like to remember that given the tendency that larger and 
larger data sets become available also in academics, it may be worth  to 
got a step further. Don't get me wrong, getting work the hansl example 
would be cool but if it's difficult to implement "on-the-fly" this can 
wait. I could create a feature request ticket for this if worth.

Artur

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

[Gretl-users] Re: Bug with string-valued series when storing data