Algorithms

2024-12-11

Examples of algorithms

There are hundreds of machine learning algorithms.
Here we provide a few examples spanning rather different approaches.
We will use the mnist_27 dataset from dslabs.

library(tidyverse) 
library(caret) 
library(dslabs)

Logistic regression

We previously used linear regression to predict classes by fitting the model:

\[ p(\mathbf{x}) = \mbox{Pr}(Y=1 \mid X_1=x_1 , X_2 = x_2) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 \]

We used least squares after assigning numeric values of 0 and 1 to \(y\), and used lm as if the data were continuous.

Logistic regression

An obvious problem with this approach is that \(\widehat{p}(\mathbf{x})\) can be negative and larger than 1:

fit_lm <- lm(y ~ x_1 + x_2, 
             data = mutate(mnist_27$train,y = ifelse(y == 7, 1, 0))) 
range(fit_lm$fitted)

[1] -0.22  1.92

Logistic regression

To avoid this, we can apply an approach more appropriate for binary data:

\[ \log \frac{p(\mathbf{x})}{1-p(\mathbf{x})} = \beta_0 + \beta_1 x_1 + \beta_2 x_2 \]

Logistic regression

We can use the glm function to fit the model:

fit_glm <- glm(y ~ x_1 + x_2, data = mnist_27$train, family = "binomial") 
p_hat_glm <- predict(fit_glm, mnist_27$test, type = "response") 
y_hat_glm <- factor(ifelse(p_hat_glm > 0.5, 7, 2)) 
confusionMatrix(y_hat_glm, mnist_27$test$y)$overall["Accuracy"]

Accuracy 
   0.775

We see that logistic regression performs similarly to regression.

Logistic regression

This is not surprising given that the estimate of \(\widehat{p}(\mathbf{x})\) looks similar:

Logistic regression

Just like regression, the decision rule is a line:

\[ g^{-1}(\widehat{\beta}_0 + \widehat{\beta}_1 x_1 + \widehat{\beta}_2 x_2) = 0.5 \implies\\ \widehat{\beta}_0 + \widehat{\beta}_1 x_1 + \widehat{\beta}_2 x_2 = g(0.5) = 0 \implies \\ x_2 = -\widehat{\beta}_0/\widehat{\beta}_2 -\widehat{\beta}_1/\widehat{\beta}_2 x_1 \]

Generative models

With binary outcomes the smallest true error we can achieve is determined by Bayes’ rule, which is a decision rule based on the true conditional probability:

\[ p(\mathbf{x}) = \mbox{Pr}(Y = 1 \mid \mathbf{X}=\mathbf{x}) \]

We have described approaches to estimating \(p(\mathbf{x})\).

Generative models

In all these approaches, we estimate the conditional probability directly and do not consider the distribution of the predictors.
These are referred to as discriminative approaches.
However, Bayes’ theorem tells us that knowing the distribution of the predictors \(\mathbf{X}\) may be useful.
Methods that model the joint distribution of \(Y\) and \(\mathbf{X}\) are referred to as generative models
We model how \(Y\) and \(\mathbf{X}\) are generated.

Naive Bayes

Recall that Bayes rule tells us that we can rewrite \(p(\mathbf{x})\) as follows:

\[ p(\mathbf{x}) = \mbox{Pr}(Y = 1|\mathbf{X}=\mathbf{x}) = \frac{f_{\mathbf{X}|Y = 1}(\mathbf{x}) \mbox{Pr}(Y = 1)} { f_{\mathbf{X}|Y = 0}(\mathbf{x})\mbox{Pr}(Y = 0) + f_{\mathbf{X}|Y = 1}(\mathbf{x})\mbox{Pr}(Y = 1) } \]

with \(f_{\mathbf{X}|Y = 1}\) and \(f_{\mathbf{X}|Y = 0}\) representing the distribution functions of the predictor \(\mathbf{X}\) for the two classes \(Y = 1\) and \(Y = 0\).

Controlling prevalence

One useful feature of the Naive Bayes approach is that it includes a parameter to account for differences in prevalence.
Using our sample, we estimate \(f_{X|Y = 1}\), \(f_{X|Y = 0}\) and \(\pi\).
If we use hats to denote the estimates, we can write \(\widehat{p}(x)\) as:

\[ \widehat{p}(x)= \frac{\widehat{f}_{X|Y = 1}(x) \widehat{\pi}} { \widehat{f}_{X|Y = 0}(x)(1-\widehat{\pi}) + \widehat{f}_{X|Y = 1}(x)\widehat{\pi} } \]

We can change prevalence by pluggin in other values instead of \(\widehat{\pi}\)

Naive Bayes

If we can estimate these conditional distributions of the predictors, we can develop a powerful decision rule.
However, this is a big if.
When \(\mathbf{X}\) has many dimensions and we do not have much information about the distribution, Naive Bayes will be practically impossible to implement.
However, with a small number of predictors and many categories generative models can be quite powerful.

Quadratic discriminant analysis

Quadratic Discriminant Analysis (QDA) is a version of Naive Bayes in which we assume that the distributions \(p_{\mathbf{X}|Y = 1}(x)\) and \(p_{\mathbf{X}|Y = 0}(\mathbf{x})\) are multivariate normal.
The simple example we described in the previous section is actually QDA.
Let’s now look at a slightly more complicated case: the 2 or 7 example.
In this example, we have two predictors so we assume each one is bivariate normal.

Quadratic discriminant analysis

We can estiamte the paramters from the data:

params <- mnist_27$train |>  
  group_by(y) |>  
  summarize(avg_1 = mean(x_1), avg_2 = mean(x_2),  
            sd_1= sd(x_1), sd_2 = sd(x_2),  
            r = cor(x_1, x_2)) 
params

# A tibble: 2 × 6
  y     avg_1 avg_2   sd_1   sd_2     r
  <fct> <dbl> <dbl>  <dbl>  <dbl> <dbl>
1 2     0.136 0.287 0.0670 0.0600 0.415
2 7     0.238 0.290 0.0745 0.104  0.468

Quadratic discriminant analysis

With these estimates in place, all we need are the prevalence \(\pi\) to compute:

\[ \widehat{p}(\mathbf{x})= \frac{\widehat{f}_{\mathbf{X}|Y = 1}(\mathbf{x}) \widehat{\pi}} { \widehat{f}_{\mathbf{X}|Y = 0}(x)(1-\widehat{\pi}) + \widehat{f}_{\mathbf{X}|Y = 1}(\mathbf{x})\widehat{\pi} } \]

Quadratic discriminant analysis

Here is the fitted model:

Quadratic discriminant analysis

We can fit QDA using the qda function the MASS package:

train_qda <- MASS::qda(y ~ ., data = mnist_27$train) 
y_hat <- predict(train_qda, mnist_27$test)$class

We see that we obtain relatively good accuracy:

confusionMatrix(y_hat, mnist_27$test$y)$overall["Accuracy"]

Accuracy 
   0.815

Quadratic discriminant analysis

The conditional probability looks relatively good, although it does not fit as well as the kernel smoothers:

Quadratic discriminant analysis

One reason QDA does not work as well as the kernel methods is because the assumption of normality does not quite hold:

Quadratic discriminant analysis

QDA can work well here, but it becomes harder to use as the number of predictors increases.
Here we have 2 predictors and had to compute 4 means, 4 SDs, and 2 correlations.
Notice that if we have 10 predictors, we have 45 correlations for each class.
The formula is \(K\times p(p-1)/2\), which gets big fast.
Once the number of parameters approaches the size of our data, the method becomes impractical due to overfitting.

Linear discriminant analysis

A relatively simple solution to QDA’s problem of having too many parameters is to assume that the correlation structure is the same for all classes.
This reduces the number of parameters we need to estimate.

Linear discriminant analysis

The estimated distribution look like this:

Linear discriminant analysis

We can LDA using the MASS lda function:

This added constraint lowers the number of parameters, the rigidity lowers our accuracy to:

confusionMatrix(y_hat, mnist_27$test$y)$overall["Accuracy"]

Accuracy 
   0.775

Linear discriminant analysis

In fact we can show that the boundary is a line:

Connection to distance

The normal density is:

\[ f(x) = \frac{1}{\sqrt{2\pi} \sigma} \exp\left\{ - \frac{(x-\mu)^2}{\sigma^2}\right\} \]

If we remove the constant \(1/(\sqrt{2\pi} \sigma)\) and then take the log, we get:

\[ \frac{(x-\mu)^2}{\sigma^2} \]

Connection to distance

Note this ithe negative of a distance squared scaled by the standard deviation.

\[ \frac{(x-\mu)^2}{\sigma^2} \]

For higher dimensions, the same is true except the scaling is more complex and involves correlations.

CART

Anything based on distance or distributions will face the course of dimensionality
In high dimensions the nearest neighbors will actually define a large region.
This makes it hard to estimate local non-linearities.
Regression trees use a completely different approach: directly partition the prediction space.

CART motivation

To motivate this section, we will use a new dataset:

names(olive)

 [1] "region"      "area"        "palmitic"    "palmitoleic" "stearic"    
 [6] "oleic"       "linoleic"    "linolenic"   "arachidic"   "eicosenoic"

CART motivation

We try to predict the region using the fatty acid composition:

table(olive$region)


Northern Italy       Sardinia Southern Italy 
           151             98            323

We remove the area column because we won’t use it as a predictor.

olive <- select(olive, -area)

CART motivation

Using kNN, we can achieve the following accuracy:

library(caret) 
fit <- train(region ~ .,  method = "knn",  
             tuneGrid = data.frame(k = seq(1, 15, 2)),  
             data = olive) 
fit$results[1,2]

[1] 0.969

CART motivation

However, a bit of data exploration reveals that we should be able to do even better:

CART motivation

We should be able to predict perfectly:

CART motivation

Decision trees like this are often used in practice.

Regression trees

When using trees, and the outcome is continuous, we call the approach a regression tree.
To introduce regression trees, we will use the 2008 poll data used in previous sections to describe the basic idea of how we build these algorithms.
As with other machine learning algorithms, we will try to estimate the conditional expectation \(f(x) = \mbox{E}(Y | X = x)\) with \(Y\) the poll margin and \(x\) the day.

Regression trees

This fits the model:

library(rpart) 
fit <- rpart(margin ~ ., data = polls_2008)

There are rules to decide when to stop.

Regression trees

plot(fit, margin = 0.1) 
text(fit, cex = 0.75)

Regression trees

If we let it go to the end we get:

Regression trees

Picking the parameters that controls when to stop:

library(caret) 
train_rpart <- train(margin ~ .,  
                     method = "rpart", 
                     tuneGrid = data.frame(cp = seq(0, 0.05, len = 25)), 
                     data = polls_2008)

Regression trees

We can also prune:

fit <- rpart(margin ~ ., data = polls_2008, 
             control = rpart.control(cp = 0)) 
pruned_fit <- prune(fit, cp = 0.01)

Classification (decision) trees

Apply it to 2 or 7 data:

train_rpart <- train(y ~ ., 
                     method = "rpart", 
                     tuneGrid = data.frame(cp = seq(0.0, 0.1, len = 25)), 
                     data = mnist_27$train) 
y_hat <- predict(train_rpart, mnist_27$test) 
confusionMatrix(y_hat, mnist_27$test$y)$overall["Accuracy"]

Accuracy 
    0.81

Classification (decision) trees

Here is the estimate of \(p(\mathbf{x})\):

Random forests

Apply it to the polls data:

library(randomForest) 
fit <- randomForest(margin ~ ., data = polls_2008)

Random forests

How many trees?

rafalib::mypar() 
plot(fit)

Random forests

The resulting estimate is obtained with:

y_hat <-  predict(fit, newdata = polls_2008)

Random forests

Final estimate:

Random forests

Apply it to 2 or 7 data:

library(randomForest) 
train_rf <- randomForest(y ~ ., data = mnist_27$train) 
confusionMatrix(predict(train_rf, mnist_27$test), 
                mnist_27$test$y)$overall["Accuracy"]

Accuracy 
    0.82

Random forests

Here is the estimate of \(p(\mathbf{x})\):

Random forests

We increase smoothness with the nodesize parameter:

train_rf_31 <- randomForest(y ~ ., data = mnist_27$train, nodesize = 31)

Random forests

This provides a better estimate: