On Wed, Apr 16, 2025 at 11:33 AM Sven Schreiber
<sven.schreiber(a)fu-berlin.de> wrote:
I've been puzzled by real-world applications with lots of potential
regressors and how the optimization of the lambda sequence turns out,
especially with the BIC. Gretl seems to tell me quite often that the
optimal model is one where the R2 attains 100% more or less, which I
find implausible.
I'm attaching an artificial example based on the shipped fat.inp script.
I changed the single-lambda setup to one with nlambda=50. The output
seems to suggest to use the very liberal model with 81 non-zero coeffs
(for n=80). It seems incredible that this really can be true, especially
given the true model in the background. Is there some problem with the
BIC calculation in these "fat" cases?
Yes, it seems clear that BIC is not an effective criterion in the fat
case. The trouble is that as a saturated model is approached the SSR
will continue to decline somewhat, increasing the log-likelihood,
while the penalty 'k' in the BIC formula (which in regls lasso is the
number of non-zero coefficients) tends to stabilize at around the
number of observations. The net effect is that the BIC continues to
"improve" regardless as lambda shrinks. Use of cross validation, on
the other hand, produces sensible results.
I notice that glmnet does two relevant things in the fat case: (1) it
sets by default a relatively large value (0.01) for the smallest
lambda fraction when the user just specifies a number of lambda
values, and (2) it automatically terminates exploration of small
lambda when the R^2 reaches 0.991 or so. We might do something
similar. Glmnet doesn't produce BIC values (though I see that Scikit
Learn does); if we continue to show them we should probably issue a
warning in the fat case.
Allin