On Fri, 24 Mar 2017, Allin Cottrell wrote:
 On Fri, 24 Mar 2017, Sven Schreiber wrote:
> Am 23.03.2017 um 19:05 schrieb Allin Cottrell:
> 
>> 1) The convention when calculating BIC from a model estimated by least
>> squares is to set "k" to the number of regression coefficients
(leaving
>> aside the error variance), while the convention under MLE is to include
>> the variance estimator in k. (Or at least I think that's a fair
>> statement of the case.)
> 
> One more follow-up here: Can you give a source for the convention? I guess 
> in principle one can make the case that also the error variance could be 
> fixed a priori and not estimated, and so k should change accordingly. But 
> right now I don't see why that argument wouldn't apply to OLS as well.
> (Or are there some block-diagonal and/or asymptotic independence arguments 
> that would apply to one estimator here and not the other?)
 I don't know of any canonical source of the convention, and in fact it's not 
 universal. Some writers argue for including the variance parameter in the "k" 
 count for least squares, but it seems that most software doesn't do that 
 (Stata, SAS, SPSS at least). R does include the extra term, however. William 
 Greene doesn't include it, in his account of info criteria in Econometric 
 Analysis, but he doesn't comment on the matter.
 I guess using k = (number of regressors) in the least squares case is 
 motivated by the fact that k in that sense is the standard measure of
 loss of degrees of freedom in estimation. 
My guess is that, for the purpose information criteria are designed for, 
including the variance or not is quite irrelevant, so either version is 
legitimate.
The main virtue of an IC is being consistent, that is, to pick the right 
model with probability 1. In an OLS model, you don't really get a choice 
as to whether estimating the variance or not (you have to), so "picking 
the right model" essentially means "making the right choice on the 
regressors". So in that case I suppose (but I don't have a proof, it's 
just my gut feeling), that using ln(n)*k or ln(k)*(k+1) as a penalty term 
doesn't make a difference asymptotically. Of course, as usual, in finite 
samples asymptotically equivalent choices may not be equivalent at all 
(see eg Wald vs LM vs LR tests).
As the saying goes, "in theory, there's no difference between theory and 
practice; in practice, there is".
But again, this is just my intuition, and I could be very wrong.
-------------------------------------------------------
   Riccardo (Jack) Lucchetti
   Dipartimento di Scienze Economiche e Sociali (DiSES)
   Università Politecnica delle Marche
   (formerly known as Università di Ancona)
   r.lucchetti(a)univpm.it
   
http://www2.econ.univpm.it/servizi/hpp/lucchetti
-------------------------------------------------------