Hi all,

I've begun to explore the issue of the numerical performance of OLS regression, where you want to condition on a qualitative variable with many different values, that is you want to run something like

ols y X dummify(fac)

where "fac" is a discrete variable with a high number of possible valid values (call it h).

Normally, you don't really care for all the parameters; you just want the OLS subvector for X (call it beta). Of course, a special case of the above is fixed-effect estimation in panel data, but the problem is in fact a little bit more general than that.

If nelem(X) = k, that would lead to regressing y on a list with k+h-1 elements. If the sample size n and h are both large, that takes a lot of RAM, and it's very inefficient, since (as is well known) you can compute beta in a much more clever way via the Frisch-Waugh theorem.

The attached script does just that[*], and compares execution time for both approaches, so you can play with it.

My question to the community is: would it be worthwhile to implement the "specialised" algorithm natively? Something like

fols y X fac

where "fols" stands for "factorised OLS"? Or maybe as an option to the ols command? Or maybe as a function? Having such a command (or function) would of course just pay off just in cases when both n and h are large. Is this worth the effort?

[*] The attached function just computes beta, not all the auxiliary quantities. But those are easy to add.


-------------------------------------------------------
  Riccardo (Jack) Lucchetti
  Dipartimento di Scienze Economiche e Sociali (DiSES)

  Università Politecnica delle Marche
  (formerly known as Università di Ancona)

  r.lucchetti@univpm.it
  http://www2.econ.univpm.it/servizi/hpp/lucchetti
-------------------------------------------------------