[Gretl-users] the aggregate() function

Wednesday, 27 February 2013

There was some interest expressed here lately in the business of 
"aggregating" some variable of interest "by" the distinct values of 
some (discrete) variable -- e.g. finding mean income by gender or 
race.

In response to this, we added the aggregate() function (it's in CVS 
and the snapshots for Windows and OS X). Now I've extended the 
functionality of aggregate(); the new stuff is not yet documented in 
the Function Reference but I'd like to explain what we've got and 
invite comments.

First, in the current doc, the "x" and "byvar" parameters must be 
(single) series, but now they can be either single series or named 
lists of series. If you give "x" as a series you get extra columns 
to the right, holding the aggregated values of each of the members 
of x. If you give "y" as a series you get a multi-level "by". The 
number of rows in the output matrix is then the number of 
combinations of the distinct values of the "byvar" variables.

Second, we've added a column to the output matrix (between the 
"byvar" values and the aggregated "x" columns) showing the count of 
observations associated with each (combination of) "byvar" values.
It seems to me this would be useful, but what do you think?

Third, when we allow more than one "by" variable a new policy 
question comes up: what do we do about combinations of by-values 
that are not actually found in the dataset (count = 0)? Either we 
skip such combinations, or we explicitly show a count of zero, and 
fill out the aggregated-x columns with NaN (not-a-number). I'm 
inclined to think it might be helpful to show the zeros explicitly, 
and so that's what we do right now, but again I'd like to hear what 
people think.

Example script using the new functionality:

<hansl>
open pizza4.gdt
series ages = age<=20 ? 0 : (age>20 && age <=40) ? 1 : 2
list Y = female college ages
list X = pizza income
matrix m = aggregate(X, Y, mean)
print m
</hansl>

(BTW you can find pizza4.gdt via google)

Allin Cottrell

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

[Gretl-users] the aggregate() function