Hello all,
I'm working on the Heckit estimator, and I need some feedback on the
treatment of missing observations.
Some background first: suppose we want to estimate a model
y = X \beta + u (1)
but we only observe y if another variable d equals one. Assume that
P(d=1) = \Phi(Z \gamma) (2)
As you all know, there are two way to estimate \beta:
a) Estimate a probit model first for d, compute the Mills ratios and stick
them into (1), which you estimate by OLS. With an appropriate correction
of the OLS std. errors, what you get is the "two-step" estimator.
b) Estimate \beta, \gamma and the correlation parameter together by
maximum likelihood. This is arguably preferable.
Now suppose you have some missing observations in X, in Z or both (far
from unusual in large micro datasets). Obviously, for the ML estimator you
can only use the observations that have no missing values for any of the
variables.
With the two-step estimtors, however, you may have different samples for
the two equations (1) and (2): if there are missing data in X only,
nothing forbids you from estimating (2) on the full sample and then (1) on
the subset for which you actually have data.
Would this be good or bad? The answer I gave myself so far is that on one
hand, if you use all the data for (2), you end up with better estimates of
\gamma, which in turn gives you better estimates of the Mills factor and
hence of \beta. This, of course, assuming that the probabilistic mechanism
which dictates which rows of X are missing is independent of everything
else; otherwise, this could be a VERY bad idea.
For the two-step estimator, Stata uses matching samples. What should WE
do?
Comments welcome. And, oh, before you say "let the user choose", let me
just say that yes, this is a possibility, but then, what should the
default behaviour be?
Riccardo (Jack) Lucchetti
Dipartimento di Economia
Università Politecnica delle Marche
r.lucchetti(a)univpm.it
http://www.econ.univpm.it/lucchetti