Resampling Methods


Cross-Validation and the Bootstrap


Training Error versus Test Error


Validation-Set Approach


library(tidyverse)
library(ISLR2)
library(rsample)

#set seed
set.seed(1234)


#data
Auto <- ISLR2::Auto


#Split the data into training and validation
split_auto  <- initial_split(Auto, prop = 0.5)
train_auto  <- training(split_auto)
test_auto   <- testing(split_auto)


#estimate a linear model
lm.fit <- lm(mpg ~ horsepower, data = train_auto)
summary(lm.fit)


#MSE
pred.fit <- predict(lm.fit, newdata = test_auto, type = "response")
mean((test_auto$mpg - pred.fit)^2)

#Therefore, the estimated test MSE for the linear regression fit is 23.94

K-Fold Cross-Validation

  • Widely used approach for estimating test error

  • Estimates can be used to select best model, and to give an idea of the test error of the final chosen model

  • Idea is to randomly divide the data into \(K\) equal-sized parts. We leave out part \(k\), fit the model to the other \(K − 1\) parts (combined), and then obtain predictions for the left-out \(kth\) part

  • This is done in turn for each part \(k = 1, 2, ..., k\) and then the results are combined

  • Example

    • Divide data into \(K\) roughly equal-sized parts (\(K = 5\) for this example)


  • Let the \(K\) parts be \(C_1, C_2,...,C_k\) denotes the indices of the observations in part \(k\)

    • There are \(n_k\) observations in part \(k\): if \(N\) is a multiple of \(K\), then \(n_k = n/K\)

    • Compute \(CV_{(K)} = \sum_{k=1}^{k} \frac{n_k}{n} MSE_k\)


      • where \(MSE_k = \sum_{i \in C_k} (y_i - \hat{y_i})^2 / n_k\) and \(\hat{y_i}\) is the fit for observation \(i\), obtained from the data with part \(k\) removed


      • Setting \(K = n\) yields \(n\)-fold or leave-one out cross validation (LOOVC)


  • With least-squares linear or polynomial regression, an amazing shortcut makes the cost of LOOCV the same as that of a single model fit! The following formula holds

    • \(CV_{(n)} = \frac{1}{n} \sum_{i=1}^{n} (\frac{y_i-\hat{y_i}}{1-h_i})^2\)

    • Where \(y_i\) is the \(ith\) fitted value from the original least squares fit and \(h_i\) is the leverage

      • This is like the ordinary MSE, except the \(ith\) residual is divided by \(1 - h_i\)
    • LOOCV is sometimes useful, but typically doesn’t shake up the data enough. The estimates from each fold are highly correlated and hence their average can have high variance

    • A better choice is \(K = 5\) or 10

library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
#set seed
set.seed(1234)


#data
Auto <- ISLR2::Auto


#specify the cross-validation method
ctrl <- trainControl(method = "cv", number = 10)


#fit a regression model and use k-fold CV to evaluate performance
model.fit <- train(mpg ~ horsepower, data = Auto, method = "lm", trControl = ctrl)


#view summary of k-fold CV               
print(model.fit)


#view final model
model.fit$finalModel


#view prediction for each fold
model.fit$resample

Other Issues with Cross-Validation


Cross-Validation for Classification Problems

  • We divide the data into \(K\) roughly equal-sized parts \(C_1, C_2, ..., C_k\)

    • Where \(C_k\) denotes the indices of the observations in part \(k\)

    • There are \(n_k\) observations in part \(k\): if \(n\) is a multiple of \(k\), then \(n_k = n/K\)

    • Compute

      • \(CV_{(K)} = \sum_{k=1}^{K} \frac{n_k}{n} Err_k\)

        • Where \(Err_k = \sum_{i \in C_k} I(y_i \neq \hat{y_i}) / n_k\)
      • The estimated standard deviation of \(CV_K\) is

        • \(\hat{SE}(CV_K = \sqrt{\frac{1}{K} \sum_{k=1}^{K} \frac{(Err_k-\bar{Err_k})^2}{K-1}}\)

        • This is a useful estimate, but strictly speaking, not quite valid


Cross-Validation: Right and Wrong