Resampling Methods

Cross-Validation and the Bootstrap

Training Error versus Test Error

Validation-Set Approach


#set seed

Auto <- ISLR2::Auto

#Split the data into training and validation
split_auto  <- initial_split(Auto, prop = 0.5)
train_auto  <- training(split_auto)
test_auto   <- testing(split_auto)

#estimate a linear model <- lm(mpg ~ horsepower, data = train_auto)

#MSE <- predict(, newdata = test_auto, type = "response")
mean((test_auto$mpg -^2)

#Therefore, the estimated test MSE for the linear regression fit is 23.94

K-Fold Cross-Validation

  • Widely used approach for estimating test error

  • Estimates can be used to select best model, and to give an idea of the test error of the final chosen model

  • Idea is to randomly divide the data into \(K\) equal-sized parts. We leave out part \(k\), fit the model to the other \(K − 1\) parts (combined), and then obtain predictions for the left-out \(kth\) part

  • This is done in turn for each part \(k = 1, 2, ..., k\) and then the results are combined

  • Example

    • Divide data into \(K\) roughly equal-sized parts (\(K = 5\) for this example)

  • Let the \(K\) parts be \(C_1, C_2,...,C_k\) denotes the indices of the observations in part \(k\)

    • There are \(n_k\) observations in part \(k\): if \(N\) is a multiple of \(K\), then \(n_k = n/K\)

    • Compute \(CV_{(K)} = \sum_{k=1}^{k} \frac{n_k}{n} MSE_k\)

      • where \(MSE_k = \sum_{i \in C_k} (y_i - \hat{y_i})^2 / n_k\) and \(\hat{y_i}\) is the fit for observation \(i\), obtained from the data with part \(k\) removed

      • Setting \(K = n\) yields \(n\)-fold or leave-one out cross validation (LOOVC)

  • With least-squares linear or polynomial regression, an amazing shortcut makes the cost of LOOCV the same as that of a single model fit! The following formula holds

    • \(CV_{(n)} = \frac{1}{n} \sum_{i=1}^{n} (\frac{y_i-\hat{y_i}}{1-h_i})^2\)

    • Where \(y_i\) is the \(ith\) fitted value from the original least squares fit and \(h_i\) is the leverage

      • This is like the ordinary MSE, except the \(ith\) residual is divided by \(1 - h_i\)
    • LOOCV is sometimes useful, but typically doesn’t shake up the data enough. The estimates from each fold are highly correlated and hence their average can have high variance

    • A better choice is \(K = 5\) or 10

## Loading required package: lattice
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##     lift
#set seed

Auto <- ISLR2::Auto

#specify the cross-validation method
ctrl <- trainControl(method = "cv", number = 10)

#fit a regression model and use k-fold CV to evaluate performance <- train(mpg ~ horsepower, data = Auto, method = "lm", trControl = ctrl)

#view summary of k-fold CV               

#view final model$finalModel

#view prediction for each fold$resample

Other Issues with Cross-Validation

Cross-Validation for Classification Problems

  • We divide the data into \(K\) roughly equal-sized parts \(C_1, C_2, ..., C_k\)

    • Where \(C_k\) denotes the indices of the observations in part \(k\)

    • There are \(n_k\) observations in part \(k\): if \(n\) is a multiple of \(k\), then \(n_k = n/K\)

    • Compute

      • \(CV_{(K)} = \sum_{k=1}^{K} \frac{n_k}{n} Err_k\)

        • Where \(Err_k = \sum_{i \in C_k} I(y_i \neq \hat{y_i}) / n_k\)
      • The estimated standard deviation of \(CV_K\) is

        • \(\hat{SE}(CV_K = \sqrt{\frac{1}{K} \sum_{k=1}^{K} \frac{(Err_k-\bar{Err_k})^2}{K-1}}\)

        • This is a useful estimate, but strictly speaking, not quite valid

Cross-Validation: Right and Wrong