Problem Set 10

Published

December 19, 2025

Instructions

Use only base R functions and the packages: dslabs, caret, randomForest, and matrixStats
The dataset is provided at the link below
Show all code used to arrive at your answers

Load the data

The data for this problem set is provided by this link: https://github.com/datasciencelabs/2025/raw/refs/heads/main/data/pset-10-mnist.rds

Read this object into R:

library(caret)
library(dslabs)
library(randomForest)
library(matrixStats)

fn <- tempfile()
download.file("https://github.com/datasciencelabs/2025/raw/refs/heads/main/data/pset-10-mnist.rds", fn)
dat <- readRDS(fn)
file.remove(fn)

The object dat is a list with two components: dat$train and dat$test. Each contains images (a matrix of pixel values) and labels (the true digits). Note that dat$test$labels is NA because you will predict these values.

For this problem set, we will work exclusively with the training data, creating our own train/validation/test splits.

How many observations are in dat$train$images? How many predictors (pixels) does each observation have? How many unique digits (classes) are in dat$train$labels?

## your code here

To speed up computation, create a balanced subset of the training data with exactly 500 observations per digit (5,000 observations total). Store the images in a matrix called x and the labels in a factor called y. Use set.seed(2025) before sampling.

Hint: For each digit 0–9, sample 500 observations.

set.seed(2025)
## your code here

Using createDataPartition, split your subset (x and y) into:

80% training set (x_train, y_train)
20% test set (x_test, y_test)

Use set.seed(2025) and report the number of observations in each set.

set.seed(2025)
## your code here

Fit a k-nearest neighbors model with k = 5 using the knn3 function from caret. Predict on the test set and report the overall accuracy.

Hint: Use knn3(y_train ~ ., data = data.frame(x_train, y_train = y_train), k = 5), then use predict() with type = "class".

## your code here

Using the predictions from Question 4, create a confusion matrix with confusionMatrix(). Which digit is most often confused with the digit “4”? Report the digit and the count of misclassifications.

Hint: Extract the table from the confusion matrix and examine the row corresponding to digit 4.

## your code here

For the digit “7”, compute the sensitivity and specificity from your kNN model (k=5). Treat “7” as the positive class and all other digits as negative.

Hint: Create a binary outcome: y_binary <- factor(ifelse(y_test == "7", "7", "other"), levels = c("7", "other")) and similarly for predictions, then use sensitivity() and specificity().

## your code here

Use the train function in caret to find the optimal value of k for kNN. Use 4-fold cross-validation and test k values: seq(1, 9, by = 2). What value of k maximizes accuracy? What is the cross-validated accuracy for this optimal k?

Use set.seed(2025) before calling train.

Note: This may take a few minutes to run.

set.seed(2025)
## your code here

Use nearZeroVar to identify predictors (pixel columns) with near-zero variance in x_train. How many predictors are flagged for removal? Create a new training matrix x_train_nzv and test matrix x_test_nzv with these predictors removed.

## your code here

Fit a kNN model (using train) with k values from Question 7 and preProcess = "nzv" to remove near-zero variance predictors automatically. Use 5-fold cross-validation. Does the cross-validated accuracy improve compared to Question 7? Report the optimal k and the corresponding accuracy.

Use set.seed(2025).

set.seed(2025)
## your code here

Fit a Random Forest model using randomForest() with ntree = 100 and the near-zero variance predictors removed (use x_train_nzv and x_test_nzv from Question 8). Use set.seed(2025) and report the test set accuracy.

set.seed(2025)
## your code here

You now have three models:

kNN with optimal k (from Question 7 or 9)
Random Forest (from Question 10)
A simple baseline: predict the most common digit in the training set for all test observations

Compute the test set accuracy for all three approaches. Which performs best?

## your code here

Create a binary classification problem: digit “3” vs. all others. Using the training data with near-zero variance predictors removed:

Fit a logistic regression model (glm with family = "binomial")
Obtain predicted probabilities for the test set
Compute the True Positive Rate (TPR) and False Positive Rate (FPR) for cutoffs: seq(0.1, 0.9, by = 0.1)
Plot FPR (x-axis) vs TPR (y-axis) to create a simple ROC curve

Report the TPR and FPR for cutoff = 0.5.

## your code here

Using x_train_nzv:

Perform PCA using prcomp() (no need to scale since pixels are already on the same scale)
How many principal components are needed to explain at least 80% of the variance?
Create a new training set using only these principal components
Fit a kNN model (k=5) on the PCA-transformed data and report test set accuracy

Hint: Transform x_test_nzv using the same PCA rotation matrix from the training set.

Use set.seed(2025) when fitting kNN.

set.seed(2025)
## your code here

Create an ensemble by combining predictions from:

Your best kNN model (from Question 9)
Your Random Forest model (from Question 10)

Use majority vote: for each test observation, predict the digit that received the most votes from the two models. If there’s a tie, use the kNN prediction.

Report the ensemble’s test set accuracy. Does it improve upon the individual models?

## your code here

Using the confusion matrix from your Random Forest model (Question 10):

Which pair of distinct digits is most often confused with each other? (i.e., which off-diagonal cell has the highest value when you sum confusionMatrix[i,j] + confusionMatrix[j,i]?)
For the most-confused pair, compute the average pixel intensity for each digit in the training set and visualize both as 28×28 images using image() or as.raster(). Briefly describe (1-2 sentences) why you think these two digits are confused.

Hint: Convert the pixel vector to a 28×28 matrix and plot with image(matrix(avg_pixels, 28, 28)[, 28:1]).

## your code here