library(caret)
library(dslabs)
library(randomForest)
library(matrixStats)
fn <- tempfile()
download.file("https://github.com/datasciencelabs/2025/raw/refs/heads/main/data/pset-10-mnist.rds", fn)
dat <- readRDS(fn)
file.remove(fn)Problem Set 10
Instructions
- Use only base R functions and the packages:
dslabs,caret,randomForest, andmatrixStats - The dataset is provided at the link below
- Show all code used to arrive at your answers
Load the data
The data for this problem set is provided by this link: https://github.com/datasciencelabs/2025/raw/refs/heads/main/data/pset-10-mnist.rds
Read this object into R:
The object dat is a list with two components: dat$train and dat$test. Each contains images (a matrix of pixel values) and labels (the true digits). Note that dat$test$labels is NA because you will predict these values.
For this problem set, we will work exclusively with the training data, creating our own train/validation/test splits.
- How many observations are in
dat$train$images? How many predictors (pixels) does each observation have? How many unique digits (classes) are indat$train$labels?
## your code here- To speed up computation, create a balanced subset of the training data with exactly 500 observations per digit (5,000 observations total). Store the images in a matrix called
xand the labels in a factor calledy. Useset.seed(2025)before sampling.
Hint: For each digit 0–9, sample 500 observations.
set.seed(2025)
## your code here- Using
createDataPartition, split your subset (xandy) into:
- 80% training set (
x_train,y_train) - 20% test set (
x_test,y_test)
Use set.seed(2025) and report the number of observations in each set.
set.seed(2025)
## your code here- Fit a k-nearest neighbors model with
k = 5using theknn3function fromcaret. Predict on the test set and report the overall accuracy.
Hint: Use knn3(y_train ~ ., data = data.frame(x_train, y_train = y_train), k = 5), then use predict() with type = "class".
## your code here- Using the predictions from Question 4, create a confusion matrix with
confusionMatrix(). Which digit is most often confused with the digit “4”? Report the digit and the count of misclassifications.
Hint: Extract the table from the confusion matrix and examine the row corresponding to digit 4.
## your code here- For the digit “7”, compute the sensitivity and specificity from your kNN model (k=5). Treat “7” as the positive class and all other digits as negative.
Hint: Create a binary outcome: y_binary <- factor(ifelse(y_test == "7", "7", "other"), levels = c("7", "other")) and similarly for predictions, then use sensitivity() and specificity().
## your code here- Use the
trainfunction incaretto find the optimal value ofkfor kNN. Use 4-fold cross-validation and testkvalues:seq(1, 9, by = 2). What value ofkmaximizes accuracy? What is the cross-validated accuracy for this optimalk?
Use set.seed(2025) before calling train.
Note: This may take a few minutes to run.
set.seed(2025)
## your code here- Use
nearZeroVarto identify predictors (pixel columns) with near-zero variance inx_train. How many predictors are flagged for removal? Create a new training matrixx_train_nzvand test matrixx_test_nzvwith these predictors removed.
## your code here- Fit a kNN model (using
train) withkvalues from Question 7 andpreProcess = "nzv"to remove near-zero variance predictors automatically. Use 5-fold cross-validation. Does the cross-validated accuracy improve compared to Question 7? Report the optimalkand the corresponding accuracy.
Use set.seed(2025).
set.seed(2025)
## your code here- Fit a Random Forest model using
randomForest()withntree = 100and the near-zero variance predictors removed (usex_train_nzvandx_test_nzvfrom Question 8). Useset.seed(2025)and report the test set accuracy.
set.seed(2025)
## your code here- You now have three models:
- kNN with optimal k (from Question 7 or 9)
- Random Forest (from Question 10)
- A simple baseline: predict the most common digit in the training set for all test observations
Compute the test set accuracy for all three approaches. Which performs best?
## your code here- Create a binary classification problem: digit “3” vs. all others. Using the training data with near-zero variance predictors removed:
- Fit a logistic regression model (
glmwithfamily = "binomial")
- Obtain predicted probabilities for the test set
- Compute the True Positive Rate (TPR) and False Positive Rate (FPR) for cutoffs:
seq(0.1, 0.9, by = 0.1)
- Plot FPR (x-axis) vs TPR (y-axis) to create a simple ROC curve
Report the TPR and FPR for cutoff = 0.5.
## your code here- Using
x_train_nzv:
- Perform PCA using
prcomp()(no need to scale since pixels are already on the same scale)
- How many principal components are needed to explain at least 80% of the variance?
- Create a new training set using only these principal components
- Fit a kNN model (k=5) on the PCA-transformed data and report test set accuracy
Hint: Transform x_test_nzv using the same PCA rotation matrix from the training set.
Use set.seed(2025) when fitting kNN.
set.seed(2025)
## your code here- Create an ensemble by combining predictions from:
- Your best kNN model (from Question 9)
- Your Random Forest model (from Question 10)
Use majority vote: for each test observation, predict the digit that received the most votes from the two models. If there’s a tie, use the kNN prediction.
Report the ensemble’s test set accuracy. Does it improve upon the individual models?
## your code here- Using the confusion matrix from your Random Forest model (Question 10):
- Which pair of distinct digits is most often confused with each other? (i.e., which off-diagonal cell has the highest value when you sum
confusionMatrix[i,j] + confusionMatrix[j,i]?)
- For the most-confused pair, compute the average pixel intensity for each digit in the training set and visualize both as 28×28 images using
image()oras.raster(). Briefly describe (1-2 sentences) why you think these two digits are confused.
Hint: Convert the pixel vector to a 28×28 matrix and plot with image(matrix(avg_pixels, 28, 28)[, 28:1]).
## your code here