ML In Practice

Author

Rafael A. Irizarry

Published

December 9, 2024

Keywords

Machine Learning

Machine learning in practice

Now that we have learned several methods and explored them with simple examples, we will try them out on a real example: the MNIST digits.
We can load this data using the following dslabs package:

library(dslabs) 
mnist <- read_mnist()

Loading required package: ggplot2

Loading required package: lattice

randomForest 4.7-1.2

Type rfNews() to see new features/changes/bug fixes.


Attaching package: 'randomForest'

The following object is masked from 'package:ggplot2':

    margin

Machine learning in practice

We will sample 10,000 random rows from the training set and 1,000 random rows from the test set:

set.seed(1990) 

index <- sample(nrow(mnist$train$images), 10000) 
x <- mnist$train$images[index,] 
y <- factor(mnist$train$labels[index]) 

index <- sample(nrow(mnist$test$images), 1000) 
x_test <- mnist$test$images[index,] 
y_test <- factor(mnist$test$labels[index])

Note we make the outcomes factors.

Machine learning in practice

Assing names to columns for caret.

colnames(x) <- 1:ncol(mnist$train$images) 
colnames(x_test) <- colnames(x)

Parallelization

During cross-validation or bootstrapping, the process of fitting models to different samples or using varying parameters can be performed independently.
Imagine you are fitting 100 models; if you had access to 100 computers, you could theoretically speed up the process by a factor of 100 by fitting each model on a separate computer and then aggregating the results.
Most modern computers, are equipped with multiple processors that allow for such parallel execution.

Parallelization

The caret package is set up to run in parallel but you have to let R know that you are want to parallelize your work.
To do this we can use the doParallel package:

library(doParallel)

Loading required package: foreach

Loading required package: iterators

Loading required package: parallel

nc <- detectCores() - 1 # it is convention to leave 1 core for OS 
cl <- makeCluster(nc) 
registerDoParallel(cl)

Parallelization

If you do use parallelization, make sure to let R know you are done with the following lines of code:

stopCluster(cl) 
stopImplicitCluster()

Warning

When parallelizing tasks across multiple processors, it’s important to consider the risk of running out of memory.
Each processor might require a copy of the data or substantial portions of it, which can multiply overall memory demands.
This is especially challenging if the data or models are large.

k-nearest neighbors

We therefore us cross-validation to estimate our MSE.
The first step is to optimize for \(k\).

train_knn <- train(x, y, method = "knn",  
                   preProcess = "nzv", 
                   trControl = trainControl("cv", 
                                            number = 20, 
                                            p = 0.95), 
                   tuneGrid = data.frame(k = seq(1, 7, 2)))

k-nearest neighbors

Once we optimize our algorithm, the predict function defaults to using the best performing algorithm fit with the entire training data:

y_hat_knn <- predict(train_knn, x_test, type = "raw")

We achieve relatively high accuracy:

confusionMatrix(y_hat_knn, factor(y_test))$overall["Accuracy"]

Accuracy 
   0.952

Dimension reduction with PCA

An alternative to removing low variance columns directly is to use dimension reduction on the feature matrix before applying the algorithms.

pca <- prcomp(x)

We can actually explain, say, 75% of the variability in the predictors with a small number of dimensions:

p <- which(cumsum(pca$sdev^2)/sum(pca$sdev^2) >= 0.75)[1]

Dimension reduction with PCA

We can then re-run our algorithm on these 33 features:

fit_knn_pca <- knn3(pca$x[,1:p], y, k = train_knn$bestTune)

Dimension reduction with PCA

When predicting, it is important that we not use the test set when finding the PCs nor any summary of the data, as this could result in overtraining.
We therefrom compute the averages needed for centering and the rotation on the training set:

newdata <-  sweep(x_test, 2, colMeans(x)) %*% pca$rotation[,1:p] 
y_hat_knn_pca <- predict(fit_knn_pca, newdata, type = "class")

Dimension reduction with PCA

We obtain similar accuracy, while using only 33 dimensions:

confusionMatrix(y_hat_knn_pca, factor(y_test))$overall["Accuracy"]

Accuracy 
   0.965

Dimension reduction with PCA

In this example, we used the \(k\) optimized for the raw data, not the principal components.
Note that to obtain an unbiased MSE estimate we have to recompute the PCA for each cross-validation sample and apply to the validation set.

Dimension reduction with PCA

Because the train function includes PCA as one of the available preprocessing operations we can achieve this with this modification of the code above:

train_knn_pca <- train(x, y, method = "knn",  
                       preProcess = c("nzv", "pca"), 
                       trControl = trainControl("cv", 
                                                number = 20, 
                                                p = 0.95, 
                                                preProcOptions = 
                                                  list(pcaComp = p)), 
                       tuneGrid = data.frame(k = seq(1, 7, 2))) 
y_hat_knn_pca <- predict(train_knn_pca, x_test, type = "raw") 
confusionMatrix(y_hat_knn_pca, factor(y_test))$overall["Accuracy"]

Accuracy 
   0.969

Dimension reduction with PCA

A limitation of this approach is that we don’t get to optimize the number of PCs used in the analysis.
To do this we need to write our own method.

Random Forest

With the random forest algorithm several parameters can be optimized, but the main one is mtry, the number of predictors that are randomly selected for each tree.

Random Forest

This is also the only tuning parameter that the caret function train permits when using the default implementation from the randomForest package.

library(randomForest) 
train_rf <- train(x, y, method = "rf",  
                  preProcess = "nzv", 
                  tuneGrid = data.frame(mtry = seq(5, 15))) 
y_hat_rf <- predict(train_rf, x_test, type = "raw")

Random Forest

Now that we have optimized our algorithm, we are ready to fit our final model:

y_hat_rf <- predict(train_rf, x_test, type = "raw")

As with kNN, we also achieve high accuracy:

confusionMatrix(y_hat_rf, y_test)$overall["Accuracy"]

Accuracy 
   0.952

By optimizing some of the other algorithm parameters, we can achieve even higher accuracy.

Testing

The default method for estimating accuracy used by the train function is to test prediction on 25 bootstrap samples.
This can result in long compute times.
We can use the system.time function to estimate how long it takes to run the algorithm once:

nzv <- nearZeroVar(x) 
system.time({fit_rf <- randomForest(x[, -nzv], y,  mtry = 9)})

   user  system elapsed 
 61.121   0.412  61.549

Testing

One way to reduce run time is to use k-fold cross validation with a smaller number of test sets.
A popular choice is leaving out 5 test sets with 20% of the data.
To use this we set the trControl argument in train to:

trControl = trainControl(method = "cv", number = 5, p = .8)

Testing

For random forest, we can also speed up the training step by running less trees per fit.
After running the algorithm once, we can use the plot function to see how the error rate changes as the number of trees grows.

Testing

We can see that error rate stabilizes after200 trees:

plot(fit_rf)

Testing

We can use this finding to speed up the cross validation procedure.
Specifically, because the default is 500, by adding the argument ntree = 200 to the call to train
The procedure will finish 2.5 times faster.

Variable importance

The following function computes the importance of each feature:

imp <- importance(fit_rf)

Variable importance

We can see which features are being used most by plotting an image:

mat <- rep(0, ncol(x)) 
mat[-nzv] <- imp 
image(matrix(mat, 28, 28))

Diagnostics

Ensembles

The idea of an ensemble is similar to the idea of combining data from different pollsters to obtain a better estimate of the true support for each candidate.
In machine learning, one can usually greatly improve the final results by combining the results of different algorithms.
Here is a simple example where we compute new class probabilities by taking the average of random forest and kNN.

Ensembles

We can see that the accuracy improves:

p_rf <- predict(fit_rf, x_test[,-nzv], type = "prob")   
p_rf <- p_rf / rowSums(p_rf) 
p_knn_pca  <- predict(train_knn_pca, x_test, type = "prob") 
p <- (p_rf + p_knn_pca)/2 
y_pred <- factor(apply(p, 1, which.max) - 1) 
confusionMatrix(y_pred, y_test)$overall["Accuracy"]

Accuracy 
    0.97

Ensembles

We have just built an ensemble with just two algorithms.
By combing more similarly performing, but uncorrelated, algorithms we can improve accuracy further.