There was some interest expressed here lately in the business of
"aggregating" some variable of interest "by" the distinct values of
some (discrete) variable -- e.g. finding mean income by gender or
race.
In response to this, we added the aggregate() function (it's in CVS
and the snapshots for Windows and OS X). Now I've extended the
functionality of aggregate(); the new stuff is not yet documented in
the Function Reference but I'd like to explain what we've got and
invite comments.
First, in the current doc, the "x" and "byvar" parameters must be
(single) series, but now they can be either single series or named
lists of series. If you give "x" as a series you get extra columns
to the right, holding the aggregated values of each of the members
of x. If you give "y" as a series you get a multi-level "by". The
number of rows in the output matrix is then the number of
combinations of the distinct values of the "byvar" variables.
Second, we've added a column to the output matrix (between the
"byvar" values and the aggregated "x" columns) showing the count of
observations associated with each (combination of) "byvar" values.
It seems to me this would be useful, but what do you think?
Third, when we allow more than one "by" variable a new policy
question comes up: what do we do about combinations of by-values
that are not actually found in the dataset (count = 0)? Either we
skip such combinations, or we explicitly show a count of zero, and
fill out the aggregated-x columns with NaN (not-a-number). I'm
inclined to think it might be helpful to show the zeros explicitly,
and so that's what we do right now, but again I'd like to hear what
people think.
Example script using the new functionality:
<hansl>
open pizza4.gdt
series ages = age<=20 ? 0 : (age>20 && age <=40) ? 1 : 2
list Y = female college ages
list X = pizza income
matrix m = aggregate(X, Y, mean)
print m
</hansl>
(BTW you can find pizza4.gdt via google)
Allin Cottrell