On Tue, 19 Jul 2011, Riccardo (Jack) Lucchetti wrote:
On Mon, 18 Jul 2011, Allin Cottrell wrote:
> I'm thinking it might be good to revise the functions
> firstobs() and lastobs() so they restrict their checks for
> non-missing values to the current sample range (right now they
> scan the entire dataset).
>
> This would be backward incompatible, but I'd be surprised if
> it would cause much trouble. Although it's so stated in the
> manual, I think it's unintuitive for function writers that
> firstobs() and lastobs() can give you observation indices that
> are outside of the sample passed by the caller, and therefore
> inaccessible.
The more I think about this, the more I get confused.
Let me start from the easy part: Allin's proposal makes a lot of sense to me,
and I am 100% in favour. [...]
This said, I've begun to think about the usefulness of having those two
functions at all. Clearly, they're no use for cross-section data.
No quite true, see below.
How about panel datasets? Either they return something entirely
different (vectors, maybe?) or they're no use in this case,
either. Moreover: I would guess that what most people would use
firstobs() and lastobs() for is some sort of loop-based algorithm
which deals with time-series data. In that case, either you're
absolutely sure you never get NAs between firstobs() and lastobs()
or you need to put some sort of check in place. But if you do,
then what's the use of firstobs() and lastobs()?
First of all, these functions are rather old and were arguably not
very well thought out in the first place.
They give you the first or last valid observation in the dataset,
but they don't guarantee that the range thus defined contains no
NAs. However, they can serve as the basis for a check, as in
<hansl>
t1 = firstobs(y)
t2 = lastobs(y)
smpl t1 t2
if sum(missing(y) > 0
# error condition
endif
</hansl>
Note that a check such as this is probably most common for time
series data, but it's also relevant if you want to apply an
estimator that can't handle "internal" NAs, on any sort of
data.
But we now have better ways of doing this check, for example:
<hansl>
catch smpl y --contiguous
if $error
funcerr "found internal NAs"
endif
</hansl>
At this point I'm inclined in Berend's direction: maybe if we're
going to keep firstobs() and lastobs() we should leave their
sematics unchanged, because that way they do something that
smpl --contiguous can't do, and that might be useful in some
cases.
Allin