\[ \text{MSE}(\lambda) = \mbox{E}\{[\widehat{Y}(\lambda) - Y]^2 \} \]
\[ \hat{\mbox{MSE}}(\lambda) = \frac{1}{N}\sum_{i = 1}^N \left\{\widehat{y}_i(\lambda) - y_i\right\}^2 \]
\[ \hat{\mbox{MSE}}(\lambda) = \frac{1}{N}\sum_{i = 1}^N \left\{\widehat{y}_i(\lambda) - y_i\right\}^2 \]
Imagine a world in which repeat data collection.
Take a large of number samples \(B\) define:
\[ \frac{1}{B} \sum_{b=1}^B \frac{1}{N}\sum_{i=1}^N \left\{\widehat{y}_i^b(\lambda) - y_i^b\right\}^2 \]
We can’t do this in practice.
But we can try to immitate it.
\[ \mbox{MSE}(\lambda) \approx\frac{1}{B} \sum_{b = 1}^B \frac{1}{N}\sum_{i = 1}^N \left(\widehat{y}_i^b(\lambda) - y_i^b\right)^2 \]
We want to generate a dataset that can be thought of as independent random sample, and do this \(B\) times.
The K in K-fold cross validation, represents the number of time \(B\).
For each sample we simply pick \(M = N/B\) observations at random and think of these as a random sample \(y_1^b, \dots, y_M^b\), with \(b = 1\).
We call this the validation set.
Now we can fit the model in the training set, then compute the apparent error on the independent set:
\[ \hat{\mbox{MSE}}_b(\lambda) = \frac{1}{M}\sum_{i = 1}^M \left(\widehat{y}_i^b(\lambda) - y_i^b\right)^2 \]
\[\hat{\mbox{MSE}}_1(\lambda),\dots, \hat{\mbox{MSE}}_B(\lambda)\]
\[ \hat{\mbox{MSE}}(\lambda) = \frac{1}{B} \sum_{b = 1}^B \hat{\mbox{MSE}}_b(\lambda) \]
\[ \hat{\mbox{MSE}}(\lambda) = \frac{1}{B} \sum_{b = 1}^B \hat{\mbox{MSE}}_b(\lambda) \]
How do we pick the cross validation fold?
Large values of \(B\) are preferable because the training data better imitates the original dataset.
However, larger values of \(B\) will have much slower computation time: for example, 100-fold cross validation will be 10 times slower than 10-fold cross validation.
For this reason, the choices of \(B = 5\) and \(B = 10\) are popular.
One way we can improve the variance of our final estimate is to take more samples.
To do this, we would no longer require the training set to be partitioned into non-overlapping sets.
Instead, we would just pick \(B\) sets of some size at random.
We have described how to use cross validation to optimize parameters.
However, we now have to take into account the fact that the optimization occurred on the training data and we therefore need an estimate of our final algorithm based on data that was not used to optimize the choice.
Here is where we use the test set we separated early on.
and obtain a final estimate of our expected loss.
However, note that last cross validation iteration means that our entire compute time gets multiplied by \(K\).
You will soon learn that fitting each algorithm takes time because we are performing many complex computations.
As a result, we are always looking for ways to reduce this time.
For the final evaluation, we often just use the one test set.
Typically, cross-validation involves partitioning the original dataset into a training set to train the model and a testing set to evaluate it.
With the bootstrap approach you can create multiple different training datasets via bootstrapping.
This method is sometimes called bootstrap aggregating or bagging.
In bootstrap resampling, we create a large number of bootstrap samples from the original training dataset.
Each bootstrap sample is created by randomly selecting observations with replacement, usually the same size as the original training dataset.
For each bootstrap sample, we fit the model and compute the MSE estimate on the observations not selected in the random sampling, referred to as the out-of-bag observations.
These out-of-bag observations serve a similar role to a validation set in standard cross-validation.
We then average the MSEs obtained in the out-of-bag observations.
This approach is actually the default approach in the caret package.
We describe how to implement resampling methods with the caret package next.