2024-12-09
There dozens of machine learning algorithms.
Many of these algorithms are implemented in R.
However, they are distributed via different packages, developed by different authors, and often use different syntax.
The caret package tries to consolidate these differences and provide consistency.
We use the 2 or 7 example to illustrate.
Then we apply it to the larger MNIST dataset.
train
functionFunctions such as lm
, glm
, qda
, lda
, knn3
, rpart
and randomForrest
use different syntax, have different argument names, and produce objects of different types.
The caret train
function lets us train different algorithms using similar syntax.
train
functionpredict
functionThe predict
function is very useful for machine learning applications.
Here is an example with regression:
predict
function\[ \widehat{p}(\mathbf{x}) = \widehat{\beta}_0 + \widehat{\beta}_1 x_1 + \widehat{\beta}_2 x_2 \]
x_1
and x_2
in the test set mnist_27$test
.predict
functionpredict
functionpredict
does not always return objects of the same type
it depends on what type of object it is applied to.
To learn about the specifics, you need to look at the help file specific for the type of fit object that is being used.
predict
functionpredict
is actually a special type of function in R called a generic function.
Generic functions call other functions depending on what kind of object it receives.
So if predict
receives an object coming out of the lm
function, it will call predict.lm
.
If it receives an object coming out of glm
, it calls predict.glm
.
If from knn3
, it calls predict.knn3
, and so on.
predict
functionpredict
and many machine learning algorithms define their own predict
function.predict
functionAs with train
, caret unifies the use of predict
with the function predict.train
.
This function takes the output of train
and produces prediction of categories or estimates of \(p(\mathbf{x})\).
predict
functionpredict
functionWhen an algorithm includes a tuning parameter, train
automatically uses a resampling method to estimate MSE and decide among a few default candidate values.
To find out what parameter or parameters are optimized, you can read the caret manual.
predict
functionggplot
function.The argument highlight
highlights the max.
By default, the resampling is performed by taking 25 bootstrap samples, each comprised of 25% of the observations.
We change this using the trControl
argument. More on this later.
For the kNN
method, the default is to try \(k=5,7,9\).
We change this using the tuneGrid
argument.
Let’s try seq(5, 101, 2)
.
Since we are fitting \(49 \times 25 = 1225\) kNN models, running this code will take several seconds.
Note
Because resampling methods are random procedures, the same code can result in different results.
To assure reproducible results you should set the seed, as we did at the start of this chapter.
The function predict
will use this best performing model.
Here is the accuracy of the best model when applied to the test set, which we have not yet used because the cross validation was done on the training set:
Bootstrapping is not always the best approach to resampling.
If we want to change our resampling method, we can use the trainControl
function.
For example, the code below runs 10-fold cross validation.
Note
results
component of the train
output includes several summary statistics related to the variability of the cross validation estimates:Because we want this example to run on a small laptop and in less than one hour, we will consider a subset of the dataset.
We will sample 10,000 random rows from the training set and 1,000 random rows from the test set:
When fitting models to large datasets, we recommend using matrices instead of data frames, as matrix operations tend to be faster.
If the matrices lack column names, you can assign names based on their position:
We often transform predictors before running the machine algorithm.
We also remove predictors that are clearly not useful.
We call these steps preprocessing.
Examples of preprocessing include standardizing the predictors, taking the log transform of some predictors, removing predictors that are highly correlated with others, and removing predictors with very few non-unique values or close to zero variation.
predictors.
The caret package features the preProcess
function, which allows users to establish a predefined set of preprocessing operations based on a training set.
This function is designed to apply these operations to new datasets without recalculating anything on the test set, ensuring that all preprocessing steps are consistent and derived solely from the training data.
pp <- preProcess(x, method = c("nzv", "center"))
centered_subsetted_x_test <- predict(pp, newdata = x_test)
dim(centered_subsetted_x_test)
[1] 1000 252
train
function in caret includes a preProcess
argument that allows users to specify which preprocessing steps to apply automatically during model training.predict
function defaults to using the best performing algorithm fit with the entire training data: