Recall the linear model:

- \(Y = B_0 + B_1X_1 + ... + B_pX_p + \epsilon\)

We will consider some approaches for extending the linear model framework. In Chapter 7, we will examine generalizations of the linear model in order to accommodate non-linear, but still additive, relationships

In Chapter 8 we consider even more general non-linear models

Despite its simplicity, the linear model has distinct advantages in terms of its interpretability and often shows good predictive performance

Hence we discuss some ways in which the simple linear model can be improved, by replacing ordinary least squares fitting with some alternative fitting procedures

Prediction Accuracy:

- especially when $p > n4, to control the variance

Model Interpretability:

By removing irrelevant features â€” that is, by setting the corresponding coefficient estimates to zero â€” we can obtain a model that is more easily interpreted

We will present some approaches for automatically performing feature selection

Subset Selection:

We identify a subset of the \(p\) predictors that we believe to be related to the response

We then fit a model using least squares on the reduced set of variables

Shrinkage:

We fit a model involving all \(p\) predictors, but the estimated coefficients are shrunken towards zero relative to the least squares estimates

This shrinkage (also known as regularization) has the effect of reducing variance and can also perform variable selection

Dimension Reduction:

We project the \(p\) predictors into a M-dimensional subspace, where \(M < p\)

This is achieved by computing \(M\) different linear combinations, or projections, of the variables

Then these M projections are used as predictors to fit a linear regression model by least squares

Best Subset Selection Procedures

Let \(M_0\) denote the null model, which contains no predictors. This model simply predicts the sample mean for each observation

For \(k = 1,2,...p\)

Fit all \(\binom{p}{k}\) models that contain exactly \(k\) predictors

Pick the best among these \(\binom{p}{k}\) models and call it \(M_k\)

- Where best is defined as having the smallest RSS or equivalently largest \(R^2\)

Select a single best model from among \(M_0,...,M_p\) using cross-validation prediction error, \(C_p\) (AIC), BIC, or adjusted \(R^2\)

Example: Credit Data Set

- For each possible model containing a subset of the ten predictors in the Credit data set, the RSS and \(R_2\) are displayed. The red frontier tracks the best model for a given number of predictors, according to RSS and \(R_2\). Though the data set contains only ten predictors, the x-axis ranges from 1 to 11, since one of the variables is categorical and takes on three values, leading to the creation of two dummy variables

Although we have presented best subset selection here for least squares regression, the same ideas apply to other types of models, such as logistic regression

The devianceâ€” negative two times the maximized log-likelihoodâ€” plays the role of RSS for a broader class of models

For computational reasons, best subset selection cannot be applied with very large \(p\)

Best subset selection may also suffer from statistical problems when \(p\) is large: larger the search space, the higher the chance of finding models that look good on the training data, even though they might not have any predictive power on future data

Thus an enormous search space can lead to overfitting and high variance of the coefficient estimates

For both of these reasons, stepwise methods, which explore a far more restricted set of models, are attractive alternatives to best subset selection

Forward stepwise selection begins with a model containing no predictors, and then adds predictors to the model, one-at-a-time, until all of the predictors are in the model

In particular, at each step the variable that gives the greatest additional improvement to the fit is added to the model

Computational advantage over best subset selection is clear

It is not guaranteed to find the best possible model out of all \(2^p\) models containing subsets of the \(p\) predictors

Forward Stepwise Selection

Let \(M_0\) denote the null model, which contains no predictors

For \(k = 0,...,p-1\):

Consider all \(p-k\) models that augment the predictors in \(M_k\) with one additional predictor

Choose the best among these \(p-k\) models, and call it \(M_{k+1}\). Here best is defined as having the smallest RSS or highest \(R_2\)

Select a single best model from among \(M_0,...,M_p\) using cross-validated prediction error, \(C_p\) (AIC), BIC, or adjusted \(R^2\)

Example: Credit Data

- The first four selected models for best subset selection and forward stepwise selection on the Credit data set. The first three models are identical but the fourth models differ

Like forward stepwise selection, backward stepwise selection provides an efficient alternative to best subset selection

However, unlike forward stepwise selection, it begins with the full least squares model containing all \(p\) predictors, and then iteratively removes the least useful predictor, one-at-a-time

Like forward stepwise selection, the backward selection approach searches through only \(1 + p(p + 1)/2\) models, and so can be applied in settings where p is too large to apply best subset selection

Like forward stepwise selection, backward stepwise selection is not guaranteed to yield the best model containing a subset of the \(p\) predictors

Backward selection requires that the number of samples n is larger than the number of variables \(p\) (so that the full model can be fit). In contrast, forward stepwise can be used even when \(n < p\), and so is the only viable subset method when \(p\) is very large

Backward Stepwise Selection

Let \(M_p\) denote the full model, which contains all \(p\) predictors

For \(k=p, p-1,...,1\):

Consider all \(k\) models that contain all but one of the predictors in \(M_k\) for a total of \(k-1\) predictors

Choose the best among these \(k\) models, and call it \(M_{k-1}\). Here best is defined as having the smallest RSS or highest \(R^2\)

Select a single best model from among \(M_0,...,M_p\) using cross-validation prediction error, \(C_p\) (AIC), BIC, or adjusted \(R^2\)

The model containing all of the predictors will always have the smallest RSS and the largest \(R^2\), since these quantities are related to the training error

We wish to choose a model with low test error, not a model with low training error. Recall that training error is usually a poor estimate of test error

Therefore, RSS and \(R^2\) are not suitable for selecting the best model among a collection of models with different numbers of predictors

We can indirectly estimate test error by making an adjustment to the training error to account for the bias due to overfitting

We can directly estimate the test error, using either a validation set approach or a cross-validation approach

These techniques adjust the training error for the model size, and can be used to select among a set of models with different numbers of variables

The figure below displays \(C_p\), BIC, and adjusted \(R^2\) for the best model of each size produced by the best subset selection on the

`Credit`

data set