On Thu, 10 Sep 2009, Irwin, James R wrote:
Hi. Wondering if anyone can point me toward how to get Gretl to
do the equivalent of STATA's collapse command. For example, I
have a data set with about 1,000 observations with YEAR X2 and
X3 (where YEAR is an integer with values from 1760 to 1880).
I want to get a data set that is the count and average of X2 by
YEAR. In STATA I write
collapse (count) num2=X2 (mean) avg2=X2, by(YEAR)
and I get a data set with that is YEAR and counts and means of
the variable X2 for each year.
From what I've seen of Gretl it seems this should be a trivial
exercise but I seem to be stumped.
Thanks for your consideration. -- jim irwin (economic historian,
trying to migrate from STATA).
Welcome to the gretl list, and I hope we can help you to migrate
without too much pain!
I have to admit that Stata's "collapse" command seems oddly
specific to me -- I mean, I wouldn't have thought that such an
apparently specialized operation would merit a command to itself.
But maybe that just shows lack of imagination on my part!
Anyway, yes, gretl can do this sort of thing but you have to roll
your own "collapse". My approach below is to create a matrix
containing the "collapsed" values, then substitute this matrix for
the current dataset.
For illustration I'm using a stylized version of the dataset you
describe, namely this plain text file
<collapse.txt>
obs YEAR X2
1 1950 1
2 1950 2
3 1950 3
4 1950 4
5 1951 2
6 1951 NA
7 1951 4
8 1951 5
9 1952 3
10 1952 4
11 1952 5
12 1952 6
13 1952 7
</collapse.txt>
Note that I've thrown in a missing observation ("NA") just for
fun.
Now here's the gretl script to do the job (without the
explanations this could be a lot more compact):
<script>
open collapse.txt
# make a matrix containing the distinct values of YEAR
matrix yrvals = values(YEAR)
# and find out how many values there are
scalar nvals = rows(yrvals)
# then allocate an appropriately sized zero matrix
matrix newdata = zeros(nvals, 3)
# now fill the 'newdata' matrix
loop i=1..nvals
# restrict the sample to the given YEAR
smpl (YEAR = yrvals[i]) --restrict
# record the YEAR we're looking at
newdata[i,1] = yrvals[i]
# count the (non-missing) observations on X2
newdata[i,2] = sum(ok(X2))
# and record the mean of X2
newdata[i,3] = mean(X2)
# re-establish the full data range for next iteration
smpl --full
endloop
# create a new dataset with the right number of observations,
# preserving current matrices
nulldata nvals --preserve
# and dump out the columns of 'newdata' into series
series YEAR = newdata[,1]
series X2count = newdata[,2]
series X2avg = newdata[,3]
# then see what we got
print YEAR X2count X2avg --byobs
</script>
Allin Cottrell