Hi,

I really appreciate the new functionalities of aggregate()!
At first some comments:
    a) including the cases is very useful
    b) in terms of the NaN rows in a matrix, well I'm not that sure, in my function I skip them at the moment, but perhaps sometimes
        it's exactly those combinations one likes to know. Is there already a way that generates a copy of a matrix without lines including missings?
    c) what about the mcov() function as 'aggregator function'?

Second, I'm of the opinion that there is a quite important and user friendly option missing:
It's about combinatorics.
Although one has e.g. 5 discrete variables one wants e.g. ONLY ALL BIVARIATE combinations.
This should be done by GRETL and not by hand.
I found a solution that looks like this (including the nice pack and unpack functions!).
In the GRETL function this could look like: aggregate(X, Y, mean,2) where the last argument is 'nelem(Y)' as standard.

<hansl>
open pizza4.gdt
set echo off
series ages = age<=20 ? 0 : (age>20 && age <=40) ? 1 : 2

list X = pizza income
list Y = female college ages

# by hand bivariate interaction, rather easy with three variables...
list Y1 = female college
list Y2 = female ages
list Y3 = college ages

matrix m2a = aggregate(X, Y1, mean)
matrix m2b = aggregate(X, Y2, mean)
matrix m2c = aggregate(X, Y3, mean)

# all three
matrix m3 = aggregate(X, Y, mean)

# comparison to my solution

bundle MVS = null
MVS["LIDs"] = getIDs(Y)
# for automatic bivariate interaction
scalar inter = 3
matrix codes ={}
loop for i=2..inter -q
    Com$i =COMBIS(MVS[LIDs],i,&MVS)
    loop for c=1..cols(Com$i) -q
        string listName = "L$i_$c"
        list @listName = Com$i[,c]
        agg(X,@listName,&codes)
        # or if u want to store the matrices in a bundle
        agg(X,@listName,&codes,1,&MVS)
    endloop
endloop


# FUNCTIONS ************************************************************************

#********************* function 1 *****************************************************

function scalar choose(scalar n, scalar k)
    scalar binC = gammafun(n+1)/(gammafun(k+1)*gammafun(n-k+1))
return binC
end function


#********************* function 2 *****************************************************

function matrix COMBIS (matrix x, scalar m, bundle *bun[null])

  scalar n = rows(x)

  if (n < m)
       print "Not possible!"
  endif
  scalar e = 0
  scalar h = m
  matrix a = seq(1,m)

  matrix r = x[a]
  scalar lenr = rows(r)
  scalar count = choose(n, m)  # choose is 'n over k'
  matrix dimuse = {m, count}
  # gerneral structur of output matrix
  matrix out = mshape(r, lenr, count)

  if (m > 0)
    scalar i = 2
    scalar nmmp1 = n - m + 1
    loop while (a[1] != nmmp1) -q
      if (e < n - h)
         scalar h = 1
         scalar e = a[m]
         matrix j = 1
     else
          scalar e = a[m - h]
          scalar h += 1
          matrix j = seq(1,h)
      endif 
      a[m - h + j] = e + j
      r = x[a]
      out[, i] = r
      i+= 1
      endloop
  endif
 
  if  !isnull(bun)
      sprintf num "%g", m
      bun["combis_@num"]=out
  endif
  return out 
end function

#********************* function 3 *****************************************************

function series pack(list X, matrix *pwrof2)
    /*
    a check should be necessary that all series in X
    are discrete; for the moment, let's just skip it
    */

    series ret = 0
    pwrtot = 1
    k = nelem(X)
    # first build the nvalues vector
    # starting values are
    pwrof2 = {}
    n = 1
    k = 1
         
    loop foreach i X --quiet     
        ret += k*X.$i
        # some fun to generate a unique number
        n = 2^ceil(log2(1+max(X.$i)))
        k *= n
        # save the key for unpacking
        pwrof2 |= n
    end loop

    return ret
end function

#********************* function 4 *****************************************************

function matrix unpack(scalar x, matrix pwr)

    # this function retrurnes the conditions for the factor steps
    # number of codes which has to be decrypted
    # for each factor one code (not for each factor step!)
    k = rows(pwr)

    matrix ret = zeros(k,1)

    loop i=1..k --quiet
        h = pwr[i]
        a = x % h
        ret[i] = a
        # prepare for next factor
        x = (x-a)/h
    end loop

    return ret
end function

#********************* function 5 *****************************************************

function void agg(list endos, list groups, matrix *codes, scalar func[1], bundle *bun[null], scalar MLE[0], series *Indices[null])
   
    matrix calc = {}
    scalar m = nelem(endos) 

    scalar COLsneeded = func*m
    u = pack(groups, &codes)
    n_u = values(u)
    matrix calc = zeros(rows(n_u),1+nelem(groups)+COLsneeded)
    matrix calc_start = zeros(rows(n_u),1)
   
    sprintf lName "%s", strsub(varname(groups),",","_")
   
    loop i = 1..rows(n_u) -q
        r = n_u[i]
        smpl u==r --res --rep
        scalar Ns = $nobs
        calc[i,1] = Ns
        matrix c = unpack(r, codes)
        step1 = nelem(groups) +1
        calc[i,2:step1] = c'
        step2 = step1 +1
        step3 = step1 + m
        if func==1
            calc[i,step2:step3] = meanc({endos})
            step3++
        elif func==2
            calc[i,step2:step3] = meanc({endos})
            step4 = step3 + m
            step3++           
            calc[i,step3:step4] = sd({endos})
            steps = step4+1
        endif
       if  !isnull(Indices)
           calc_start[i,1]=r
       endif      
    end loop  
   
   
    if  !isnull(Indices)
        string NAMEcolMVS = "ID cases "
    else
        string NAMEcolMVS = "cases "
    endif
    NAMEcolMVS += strsub(varname(groups),","," ")
    string NAMEcolENDOS = strsub(varname(endos),","," ")
    colnames(calc_start,"ID")
 
        matrix take1 = seq(1,nelem(groups)+1)
        if  !isnull(Indices)
            matrix adding = calc_start~calc[,take1]
            colnames(adding,NAMEcolMVS)
            if  !isnull(bun)
                bun["MVS_@lName"]=adding
            else
                print "Infos"
                eval adding
            endif
        else
            matrix adding = calc[,take1]
            colnames(adding,NAMEcolMVS)
            if  !isnull(bun)
                bun["MVS_@lName"]=adding
            else
                print "Infos"
                eval adding
            endif
        endif
        matrix take2 = seq(1,nelem(endos))      
        loop for f=1..func -q
            if f==1
                take2 = take2.+cols(take1)
                matrix adding2 = calc[,take2]
                colnames(adding2,NAMEcolENDOS)
                if  !isnull(bun)
                    bun["means_@lName"]=adding2
                else
                     print "means"
                     eval adding2
                endif
            elif f==2
                take2 = take2 .+ m
                matrix adding2 = calc[,take2]
                colnames(adding2,NAMEcolENDOS)
                if  !isnull(bun)
                    bun["sds_@lName"]=adding2
                else
                     print "sds"
                     eval adding2
                endif
            endif          
         endloop

    smpl full
    if  !isnull(Indices)
        Indices = u
    endif    
   
end function

#********************* function 6 *****************************************************

function matrix getIDs(list G)
    matrix IDs = {}
    loop foreach i G -q
        IDs |= varnum(G.$i)
    endloop
    return IDs
end function

# END FUNCTIONS *************************************************************



<hansl>


27.02.2013 23:54, Allin Cottrell:
There was some interest expressed here lately in the business of 
"aggregating" some variable of interest "by" the distinct values of 
some (discrete) variable -- e.g. finding mean income by gender or 
race.

In response to this, we added the aggregate() function (it's in CVS 
and the snapshots for Windows and OS X). Now I've extended the 
functionality of aggregate(); the new stuff is not yet documented in 
the Function Reference but I'd like to explain what we've got and 
invite comments.

First, in the current doc, the "x" and "byvar" parameters must be 
(single) series, but now they can be either single series or named 
lists of series. If you give "x" as a series you get extra columns 
to the right, holding the aggregated values of each of the members 
of x. If you give "y" as a series you get a multi-level "by". The 
number of rows in the output matrix is then the number of 
combinations of the distinct values of the "byvar" variables.

Second, we've added a column to the output matrix (between the 
"byvar" values and the aggregated "x" columns) showing the count of 
observations associated with each (combination of) "byvar" values.
It seems to me this would be useful, but what do you think?

Third, when we allow more than one "by" variable a new policy 
question comes up: what do we do about combinations of by-values 
that are not actually found in the dataset (count = 0)? Either we 
skip such combinations, or we explicitly show a count of zero, and 
fill out the aggregated-x columns with NaN (not-a-number). I'm 
inclined to think it might be helpful to show the zeros explicitly, 
and so that's what we do right now, but again I'd like to hear what 
people think.

Example script using the new functionality:

<hansl>
open pizza4.gdt
series ages = age<=20 ? 0 : (age>20 && age <=40) ? 1 : 2
list Y = female college ages
list X = pizza income
matrix m = aggregate(X, Y, mean)
print m
</hansl>

(BTW you can find pizza4.gdt via google)

Allin Cottrell
_______________________________________________
Gretl-users mailing list
Gretl-users@lists.wfu.edu
http://lists.wfu.edu/mailman/listinfo/gretl-users