Hi all,
I've begun to explore the issue of the numerical performance of OLS
regression, where you want to condition on a qualitative variable with
many different values, that is you want to run something like
ols y X dummify(fac)
where "fac" is a discrete variable with a high number of possible valid
values (call it h).
Normally, you don't really care for all the parameters; you just want
the OLS subvector for X (call it beta). Of course, a special case of the
above is fixed-effect estimation in panel data, but the problem is in
fact a little bit more general than that.
If nelem(X) = k, that would lead to regressing y on a list with k+h-1
elements. If the sample size n and h are both large, that takes a lot of
RAM, and it's very inefficient, since (as is well known) you can compute
beta in a much more clever way via the Frisch-Waugh theorem.
The attached script does just that[*], and compares execution time for
both approaches, so you can play with it.
My question to the community is: would it be worthwhile to implement the
"specialised" algorithm natively? Something like
fols y X fac
where "fols" stands for "factorised OLS"? Or maybe as an option to the
ols command? Or maybe as a function? Having such a command (or function)
would of course just pay off just in cases when both n and h are large.
Is this worth the effort?
[*] The attached function just computes beta, not all the auxiliary
quantities. But those are easy to add.
-------------------------------------------------------
Riccardo (Jack) Lucchetti
Dipartimento di Scienze Economiche e Sociali (DiSES)
Università Politecnica delle Marche
(formerly known as Università di Ancona)
r.lucchetti(a)univpm.it
http://www2.econ.univpm.it/servizi/hpp/lucchetti
-------------------------------------------------------