On Wed, 24 Sep 2008, Gordon Hughes wrote:
In August I raised the possibility of extending the dummify
function to accommodate syntax such as
list dlist = dummify(x, n)
where n <= 0 means that no category is dropped, while n > 0
means that the n-th category is dropped. For this to work, it
would be necessary to require that x is a series, whereas in its
current version dummify(X) will work with a list X...
I'm playing with this at present. If I remember right, there
seemed to a consensus last time round that we don't really lose by
confining the dummify() function to a single series argument (not
a list), since it's likely to be confusing to run dummify on a
list anyway (and if you really want to do that you can use a
"foreach" loop).
Suggestion: allow the syntax
list L = dummify(x)
for series x, in which case all the dummies are generated; and
also support
list L = dummify(x, val)
which treats 'val' as the omitted category. (That is, the second
argument to dummify() is optional).
That leaves a question: is it easier/more intuitive to read 'val'
as denoting the val'th category when the distinct values of x are
ordered, or as the condition x == val? I tend to think the latter
is better. Example: in relation to the variable Y in greene22_2,
we have
? matrix v = values(Y)
Generated matrix v
? v
v (6 x 1)
0
1
2
3
7
12
To generate dummies for all values of Y other than 7, do we do
list DL = dummify(Y, 5) # or 0-based, (Y, 4)??
or
list DL = dummify(Y, 7) # what I tend to favor
On the latter approach, you could do
DL = dummify(x, min(x))
DL = dummify(x, max(x))
to skip the first or last categories without counting them.
Allin.