Linear Model Selection and Regularization


Why Consider Alternatives to Least Squares?

  • Prediction Accuracy:

    • especially when $p > n4, to control the variance
  • Model Interpretability:

    • By removing irrelevant features — that is, by setting the corresponding coefficient estimates to zero — we can obtain a model that is more easily interpreted

    • We will present some approaches for automatically performing feature selection


Three Classes of Methods


Subset Selection

  • Best Subset Selection Procedures

    • Let \(M_0\) denote the null model, which contains no predictors. This model simply predicts the sample mean for each observation

    • For \(k = 1,2,...p\)

      • Fit all \(\binom{p}{k}\) models that contain exactly \(k\) predictors

      • Pick the best among these \(\binom{p}{k}\) models and call it \(M_k\)

        • Where best is defined as having the smallest RSS or equivalently largest \(R^2\)
    • Select a single best model from among \(M_0,...,M_p\) using cross-validation prediction error, \(C_p\) (AIC), BIC, or adjusted \(R^2\)

  • Example: Credit Data Set


  • For each possible model containing a subset of the ten predictors in the Credit data set, the RSS and \(R_2\) are displayed. The red frontier tracks the best model for a given number of predictors, according to RSS and \(R_2\). Though the data set contains only ten predictors, the x-axis ranges from 1 to 11, since one of the variables is categorical and takes on three values, leading to the creation of two dummy variables

Extensions to Other Models

  • Although we have presented best subset selection here for least squares regression, the same ideas apply to other types of models, such as logistic regression

  • The deviance— negative two times the maximized log-likelihood— plays the role of RSS for a broader class of models


Stepwise Selection


Forward Stepwise Selection

  • Forward stepwise selection begins with a model containing no predictors, and then adds predictors to the model, one-at-a-time, until all of the predictors are in the model

  • In particular, at each step the variable that gives the greatest additional improvement to the fit is added to the model

  • Computational advantage over best subset selection is clear

  • It is not guaranteed to find the best possible model out of all \(2^p\) models containing subsets of the \(p\) predictors

  • Forward Stepwise Selection

    • Let \(M_0\) denote the null model, which contains no predictors

    • For \(k = 0,...,p-1\):

      • Consider all \(p-k\) models that augment the predictors in \(M_k\) with one additional predictor

      • Choose the best among these \(p-k\) models, and call it \(M_{k+1}\). Here best is defined as having the smallest RSS or highest \(R_2\)

    • Select a single best model from among \(M_0,...,M_p\) using cross-validated prediction error, \(C_p\) (AIC), BIC, or adjusted \(R^2\)

  • Example: Credit Data


  • The first four selected models for best subset selection and forward stepwise selection on the Credit data set. The first three models are identical but the fourth models differ

Backward Stepwise Selection


Choosing the Optimal Model


Estimating Test Error: Two Approaches


\(C_p\), AIC, BIC< and Adjusted \(R^2\)

  • These techniques adjust the training error for the model size, and can be used to select among a set of models with different numbers of variables

  • The figure below displays \(C_p\), BIC, and adjusted \(R^2\) for the best model of each size produced by the best subset selection on the Credit data set