Algorithms

2024-12-11

Examples of algorithms

  • There are hundreds of machine learning algorithms.

  • Here we provide a few examples spanning rather different approaches.

  • We will use the mnist_27 dataset from dslabs.

library(tidyverse) 
library(caret) 
library(dslabs) 

Logistic regression

  • We previously used linear regression to predict classes by fitting the model:

\[ p(\mathbf{x}) = \mbox{Pr}(Y=1 \mid X_1=x_1 , X_2 = x_2) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 \]

  • We used least squares after assigning numeric values of 0 and 1 to \(y\), and used lm as if the data were continuous.

Logistic regression

  • An obvious problem with this approach is that \(\widehat{p}(\mathbf{x})\) can be negative and larger than 1:
fit_lm <- lm(y ~ x_1 + x_2, 
             data = mutate(mnist_27$train,y = ifelse(y == 7, 1, 0))) 
range(fit_lm$fitted) 
[1] -0.22  1.92

Logistic regression

  • To avoid this, we can apply an approach more appropriate for binary data:

\[ \log \frac{p(\mathbf{x})}{1-p(\mathbf{x})} = \beta_0 + \beta_1 x_1 + \beta_2 x_2 \]

Logistic regression

We can use the glm function to fit the model:

fit_glm <- glm(y ~ x_1 + x_2, data = mnist_27$train, family = "binomial") 
p_hat_glm <- predict(fit_glm, mnist_27$test, type = "response") 
y_hat_glm <- factor(ifelse(p_hat_glm > 0.5, 7, 2)) 
confusionMatrix(y_hat_glm, mnist_27$test$y)$overall["Accuracy"]
Accuracy 
   0.775 
  • We see that logistic regression performs similarly to regression.

Logistic regression

  • This is not surprising given that the estimate of \(\widehat{p}(\mathbf{x})\) looks similar:

Logistic regression

  • Just like regression, the decision rule is a line:

\[ g^{-1}(\widehat{\beta}_0 + \widehat{\beta}_1 x_1 + \widehat{\beta}_2 x_2) = 0.5 \implies\\ \widehat{\beta}_0 + \widehat{\beta}_1 x_1 + \widehat{\beta}_2 x_2 = g(0.5) = 0 \implies \\ x_2 = -\widehat{\beta}_0/\widehat{\beta}_2 -\widehat{\beta}_1/\widehat{\beta}_2 x_1 \]

Generative models

  • With binary outcomes the smallest true error we can achieve is determined by Bayes’ rule, which is a decision rule based on the true conditional probability:

\[ p(\mathbf{x}) = \mbox{Pr}(Y = 1 \mid \mathbf{X}=\mathbf{x}) \]

  • We have described approaches to estimating \(p(\mathbf{x})\).

Generative models

  • In all these approaches, we estimate the conditional probability directly and do not consider the distribution of the predictors.

  • These are referred to as discriminative approaches.

  • However, Bayes’ theorem tells us that knowing the distribution of the predictors \(\mathbf{X}\) may be useful.

  • Methods that model the joint distribution of \(Y\) and \(\mathbf{X}\) are referred to as generative models

  • We model how \(Y\) and \(\mathbf{X}\) are generated.

Naive Bayes

  • Recall that Bayes rule tells us that we can rewrite \(p(\mathbf{x})\) as follows:

\[ p(\mathbf{x}) = \mbox{Pr}(Y = 1|\mathbf{X}=\mathbf{x}) = \frac{f_{\mathbf{X}|Y = 1}(\mathbf{x}) \mbox{Pr}(Y = 1)} { f_{\mathbf{X}|Y = 0}(\mathbf{x})\mbox{Pr}(Y = 0) + f_{\mathbf{X}|Y = 1}(\mathbf{x})\mbox{Pr}(Y = 1) } \]

  • with \(f_{\mathbf{X}|Y = 1}\) and \(f_{\mathbf{X}|Y = 0}\) representing the distribution functions of the predictor \(\mathbf{X}\) for the two classes \(Y = 1\) and \(Y = 0\).

Controlling prevalence

  • One useful feature of the Naive Bayes approach is that it includes a parameter to account for differences in prevalence.

  • Using our sample, we estimate \(f_{X|Y = 1}\), \(f_{X|Y = 0}\) and \(\pi\).

  • If we use hats to denote the estimates, we can write \(\widehat{p}(x)\) as:

\[ \widehat{p}(x)= \frac{\widehat{f}_{X|Y = 1}(x) \widehat{\pi}} { \widehat{f}_{X|Y = 0}(x)(1-\widehat{\pi}) + \widehat{f}_{X|Y = 1}(x)\widehat{\pi} } \]

  • We can change prevalence by pluggin in other values instead of \(\widehat{\pi}\)

Naive Bayes

  • If we can estimate these conditional distributions of the predictors, we can develop a powerful decision rule.

  • However, this is a big if.

  • When \(\mathbf{X}\) has many dimensions and we do not have much information about the distribution, Naive Bayes will be practically impossible to implement.

  • However, with a small number of predictors and many categories generative models can be quite powerful.

Quadratic discriminant analysis

  • Quadratic Discriminant Analysis (QDA) is a version of Naive Bayes in which we assume that the distributions \(p_{\mathbf{X}|Y = 1}(x)\) and \(p_{\mathbf{X}|Y = 0}(\mathbf{x})\) are multivariate normal.

  • The simple example we described in the previous section is actually QDA.

  • Let’s now look at a slightly more complicated case: the 2 or 7 example.

  • In this example, we have two predictors so we assume each one is bivariate normal.

Quadratic discriminant analysis

  • We can estiamte the paramters from the data:
params <- mnist_27$train |>  
  group_by(y) |>  
  summarize(avg_1 = mean(x_1), avg_2 = mean(x_2),  
            sd_1= sd(x_1), sd_2 = sd(x_2),  
            r = cor(x_1, x_2)) 
params 
# A tibble: 2 × 6
  y     avg_1 avg_2   sd_1   sd_2     r
  <fct> <dbl> <dbl>  <dbl>  <dbl> <dbl>
1 2     0.136 0.287 0.0670 0.0600 0.415
2 7     0.238 0.290 0.0745 0.104  0.468

Quadratic discriminant analysis

  • With these estimates in place, all we need are the prevalence \(\pi\) to compute:

\[ \widehat{p}(\mathbf{x})= \frac{\widehat{f}_{\mathbf{X}|Y = 1}(\mathbf{x}) \widehat{\pi}} { \widehat{f}_{\mathbf{X}|Y = 0}(x)(1-\widehat{\pi}) + \widehat{f}_{\mathbf{X}|Y = 1}(\mathbf{x})\widehat{\pi} } \]

Quadratic discriminant analysis

Here is the fitted model:

Quadratic discriminant analysis

  • We can fit QDA using the qda function the MASS package:
train_qda <- MASS::qda(y ~ ., data = mnist_27$train) 
y_hat <- predict(train_qda, mnist_27$test)$class 
  • We see that we obtain relatively good accuracy:
confusionMatrix(y_hat, mnist_27$test$y)$overall["Accuracy"]  
Accuracy 
   0.815 

Quadratic discriminant analysis

  • The conditional probability looks relatively good, although it does not fit as well as the kernel smoothers:

Quadratic discriminant analysis

  • One reason QDA does not work as well as the kernel methods is because the assumption of normality does not quite hold:

Quadratic discriminant analysis

  • QDA can work well here, but it becomes harder to use as the number of predictors increases.

  • Here we have 2 predictors and had to compute 4 means, 4 SDs, and 2 correlations.

  • Notice that if we have 10 predictors, we have 45 correlations for each class.

  • The formula is \(K\times p(p-1)/2\), which gets big fast.

  • Once the number of parameters approaches the size of our data, the method becomes impractical due to overfitting.

Linear discriminant analysis

  • A relatively simple solution to QDA’s problem of having too many parameters is to assume that the correlation structure is the same for all classes.

  • This reduces the number of parameters we need to estimate.

Linear discriminant analysis

The estimated distribution look like this:

Linear discriminant analysis

  • We can LDA using the MASS lda function:
  • This added constraint lowers the number of parameters, the rigidity lowers our accuracy to:
confusionMatrix(y_hat, mnist_27$test$y)$overall["Accuracy"] 
Accuracy 
   0.775 

Linear discriminant analysis

In fact we can show that the boundary is a line:

Connection to distance

  • The normal density is:

\[ f(x) = \frac{1}{\sqrt{2\pi} \sigma} \exp\left\{ - \frac{(x-\mu)^2}{\sigma^2}\right\} \]

  • If we remove the constant \(1/(\sqrt{2\pi} \sigma)\) and then take the log, we get:

\[ \frac{(x-\mu)^2}{\sigma^2} \]

Connection to distance

  • Note this ithe negative of a distance squared scaled by the standard deviation.

\[ \frac{(x-\mu)^2}{\sigma^2} \]

  • For higher dimensions, the same is true except the scaling is more complex and involves correlations.

CART

  • Anything based on distance or distributions will face the course of dimensionality

  • In high dimensions the nearest neighbors will actually define a large region.

  • This makes it hard to estimate local non-linearities.

  • Regression trees use a completely different approach: directly partition the prediction space.

CART motivation

  • To motivate this section, we will use a new dataset:
names(olive) 
 [1] "region"      "area"        "palmitic"    "palmitoleic" "stearic"    
 [6] "oleic"       "linoleic"    "linolenic"   "arachidic"   "eicosenoic" 

CART motivation

  • We try to predict the region using the fatty acid composition:
table(olive$region) 

Northern Italy       Sardinia Southern Italy 
           151             98            323 
  • We remove the area column because we won’t use it as a predictor.
olive <- select(olive, -area) 

CART motivation

  • Using kNN, we can achieve the following accuracy:
library(caret) 
fit <- train(region ~ .,  method = "knn",  
             tuneGrid = data.frame(k = seq(1, 15, 2)),  
             data = olive) 
fit$results[1,2]
[1] 0.969

CART motivation

  • However, a bit of data exploration reveals that we should be able to do even better:

CART motivation

  • We should be able to predict perfectly:

CART motivation

CART motivation

  • Decision trees like this are often used in practice.

Regression trees

  • When using trees, and the outcome is continuous, we call the approach a regression tree.

  • To introduce regression trees, we will use the 2008 poll data used in previous sections to describe the basic idea of how we build these algorithms.

  • As with other machine learning algorithms, we will try to estimate the conditional expectation \(f(x) = \mbox{E}(Y | X = x)\) with \(Y\) the poll margin and \(x\) the day.

Regression trees

Regression trees

This fits the model:

library(rpart) 
fit <- rpart(margin ~ ., data = polls_2008) 

There are rules to decide when to stop.

Regression trees

plot(fit, margin = 0.1) 
text(fit, cex = 0.75) 

Regression trees

Regression trees

  • If we let it go to the end we get:

Regression trees

  • Picking the parameters that controls when to stop:
library(caret) 
train_rpart <- train(margin ~ .,  
                     method = "rpart", 
                     tuneGrid = data.frame(cp = seq(0, 0.05, len = 25)), 
                     data = polls_2008) 

Regression trees

Regression trees

  • We can also prune:
fit <- rpart(margin ~ ., data = polls_2008, 
             control = rpart.control(cp = 0)) 
pruned_fit <- prune(fit, cp = 0.01) 

Classification (decision) trees

  • Apply it to 2 or 7 data:
train_rpart <- train(y ~ ., 
                     method = "rpart", 
                     tuneGrid = data.frame(cp = seq(0.0, 0.1, len = 25)), 
                     data = mnist_27$train) 
y_hat <- predict(train_rpart, mnist_27$test) 
confusionMatrix(y_hat, mnist_27$test$y)$overall["Accuracy"]
Accuracy 
    0.81 

Classification (decision) trees

Here is the estimate of \(p(\mathbf{x})\):

Random forests

  • Apply it to the polls data:
library(randomForest) 
fit <- randomForest(margin ~ ., data = polls_2008)  

Random forests

How many trees?

rafalib::mypar() 
plot(fit) 

Random forests

  • The resulting estimate is obtained with:
y_hat <-  predict(fit, newdata = polls_2008)

Random forests

Final estimate:

Random forests

Random forests

Apply it to 2 or 7 data:

library(randomForest) 
train_rf <- randomForest(y ~ ., data = mnist_27$train) 
confusionMatrix(predict(train_rf, mnist_27$test), 
                mnist_27$test$y)$overall["Accuracy"]
Accuracy 
    0.82 

Random forests

Here is the estimate of \(p(\mathbf{x})\):

Random forests

We increase smoothness with the nodesize parameter:

train_rf_31 <- randomForest(y ~ ., data = mnist_27$train, nodesize = 31) 

Random forests

This provides a better estimate: