Re: [Gretl-devel] really big data

Tuesday, 19 June 2012

On Tue, 19 Jun 2012, Allin Cottrell wrote:

...
 I'm also thinking that it would be nice to offer a shortcut for
the kind of 
 procedure I outlined, so that you could do something like

 open huge.txt --cols=whatever --rowmask="gender==1"

 In the background gretl would do what I described before: find and use the 
 full-length gender series to construct a rowmask, then read the selected 
 columns using the mask. This would not only be more user-friendly, it would 
 also be more efficient: libgretl could use a simple byte array for the 
 rowmask, and it wouldn't necessarily have to read the whole gender series 
 into memory, just scan it row by row. 
Yes, that would be very nice. One more thing we have to keep in mind: 
There are some cases in which the "identifier" field may be non-numeric. 
For example, the World Development Indicators. These are by no means a 
database as huge as the ones we're potentially dealing with here (although 
respectable in size: the zipped set of CVS files is a hefty 35.7 Mb: see 
http://data.worldbank.org/data-catalog/world-development-indicators/), 
but make for an interesting test case: all the relevant items you may want 
to select rows on are strings. That is, we may be in the position of 
needing something like "--rowmask="FOO==\"bar\""

Moreover, we should also be prepared to find csv or fixed-formats files in 
which the variable names are not valid gretl identifiers, because are too 
long or contain spaces, etc.

...
 A question for users of big data (somewhat relevant to implementing
the above 
 suggestion): do monster-size text datafiles typically come in fixed format? 
 That's my sense, but I don't have a lot of experience in this area. (If 
 that's right it makes sense computationally: it's much quicker to read 
 specific variables out of a big file if you know in advance exactly where to 
 find them.) 
In my experience, fixed format was the standard in the past. Nowadays, 
it's getting less common and I think is being supplanted by CSV. But I'm 
not in the position of saying anything authoritative.

--------------------------------------------------
  Riccardo (Jack) Lucchetti
  Dipartimento di Economia

  Università Politecnica delle Marche
  (formerly known as Università di Ancona)

  r.lucchetti(a)univpm.it
  http://www2.econ.univpm.it/servizi/hpp/lucchetti
--------------------------------------------------

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Gretl-devel] really big data