Re: [Gretl-devel] really big data

Tuesday, 19 June 2012

On Tue, 19 Jun 2012, Riccardo (Jack) Lucchetti wrote:

...
 On Mon, 18 Jun 2012, Allin Cottrell wrote:

> A few changes in recent gretl CVS address the issue of
> handling very large datasets -- datasets that will not fit
> into RAM in their entirety. [...]

 First of all, let me thank Allin for all the work he's done in this 
 direction: I ran a few tests on the new CVS features and I can confirm that 
 every test I tried works splendidly.

 That said, I have the feeling that, in order to use effectively the datasets 
 Allin is referring to, we need an extra ingredient (which, IMHO, is THE 
 feature that made Stata the killer package in some quarters of the 
 econometrics profession): the ability to extract data sensibly by performing 
 those operations that, in database parlance, are called JOINs. 
 [...] 
Thanks for the clear explanation, and I think you're right; handling 
JOINs is something we should work towards.

I'm also thinking that it would be nice to offer a shortcut for the 
kind of procedure I outlined, so that you could do something like

open huge.txt --cols=whatever --rowmask="gender==1"

In the background gretl would do what I described before: find and 
use the full-length gender series to construct a rowmask, then read 
the selected columns using the mask. This would not only be more 
user-friendly, it would also be more efficient: libgretl could use a 
simple byte array for the rowmask, and it wouldn't necessarily have 
to read the whole gender series into memory, just scan it row by 
row.

A question for users of big data (somewhat relevant to implementing 
the above suggestion): do monster-size text datafiles typically come 
in fixed format? That's my sense, but I don't have a lot of 
experience in this area. (If that's right it makes sense 
computationally: it's much quicker to read specific variables out of 
a big file if you know in advance exactly where to find them.)

Allin

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Gretl-devel] really big data