"join" news

Monday, 8 December 2014

Some news regarding gretl's "join" command (importation of data with 
lots of options). These points are in the current documentation for 
"join" in the User's Guide, but I thought it would be worth 
explicitly drawing them to people's attention.

1) I've mentioned this before but only in passing: besides "CSV" 
(delimited text) files you can now join from gretl-native gdt or 
gdtb (binary) files.

2) More recent: you can now pull multiple series from the source
file in one command.

I'll expand on the second point. When we first wrote "join" we were 
wrestling with a lot of complexity (key-matching, filtering, 
aggregation) and we simplified matters by stipulating that only a 
single series could be operated on at a time. Now that the join code 
has stabilized, we've found it feasible to support "batch" 
importation of series. This is subject to two limitations:

1) When importing multiple series, the --data option (which permits 
renaming of a single series on import) is not available. You have to 
accept the names of series as they appear in the source data file 
(or as "fixed up" by gretl, if need be).

2) You only get one set of key-matching, filtering and aggregation 
options; these options are applied uniformly to all series 
specified in a single command. So if you want to import several 
series but with different keys, filters or aggregation methods, 
you still need separate instances of the "join" command.

How do you ask for multiple series? You just replace the second 
(series-name) argument to "join" with either (a) several series 
names, separated by spaces, or (b) the name of an array-of-strings 
variable that holds the names of the series you want.

My motivation for setting this up is that this semester I've been 
helping some students construct datasets from the PUMS (Public Use 
Microdata Sample) made available by the US Census Bureau. These are 
BIG files (e.g. the person datafile for California alone is > 
300MB). So if you want data from all 50 US states plus DC, and 
especially if you want household-level data too, we're talking quite 
a major data processing exercise. I've found that with multiple 
imports in "join" it doesn't take much longer to import 6 or 7 
series at a time than it does to import a single series, meaning 
that we get a very noticeable speed-up of the process.

Allin

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006