Hi all,
I've begun to explore the issue of the numerical performance of OLS regression, where you want to condition on a qualitative variable with many different values, that is you want to run something like
ols y X dummify(fac)
where "fac" is a discrete variable with a high number of possible valid values (call it h).
Normally, you don't really care for all the parameters; you just want the OLS subvector for X (call it beta). Of course, a special case of the above is fixed-effect estimation in panel data, but the problem is in fact a little bit more general than that.
If nelem(X) = k, that would lead to regressing y on a list with k+h-1 elements. If the sample size n and h are both large, that takes a lot of RAM, and it's very inefficient, since (as is well known) you can compute beta in a much more clever way via the Frisch-Waugh theorem.
The attached script does just that[*], and compares execution time for both approaches, so you can play with it.
My question to the community is: would it be worthwhile to implement the "specialised" algorithm natively? Something like
fols y X fac
where "fols" stands for "factorised OLS"? Or maybe as an option to the ols command? Or maybe as a function? Having such a command (or function) would of course just pay off just in cases when both n and h are large. Is this worth the effort?
[*] The attached function just computes beta, not all the auxiliary quantities. But those are easy to add.
------------------------------------------------------- Riccardo (Jack) Lucchetti Dipartimento di Scienze Economiche e Sociali (DiSES) Università Politecnica delle Marche (formerly known as Università di Ancona) r.lucchetti@univpm.it http://www2.econ.univpm.it/servizi/hpp/lucchetti -------------------------------------------------------