With binary outcomes the smallest true error we can achieve is determined by Bayes’ rule, which is a decision rule based on the true conditional probability:
with \(f_{\mathbf{X}|Y = 1}\) and \(f_{\mathbf{X}|Y = 0}\) representing the distribution functions of the predictor \(\mathbf{X}\) for the two classes \(Y = 1\) and \(Y = 0\).
Controlling prevalence
One useful feature of the Naive Bayes approach is that it includes a parameter to account for differences in prevalence.
Using our sample, we estimate \(f_{X|Y = 1}\), \(f_{X|Y = 0}\) and \(\pi\).
If we use hats to denote the estimates, we can write \(\widehat{p}(x)\) as:
We can change prevalence by pluggin in other values instead of \(\widehat{\pi}\)
Naive Bayes
If we can estimate these conditional distributions of the predictors, we can develop a powerful decision rule.
However, this is a big if.
When \(\mathbf{X}\) has many dimensions and we do not have much information about the distribution, Naive Bayes will be practically impossible to implement.
However, with a small number of predictors and many categories generative models can be quite powerful.
Quadratic discriminant analysis
Quadratic Discriminant Analysis (QDA) is a version of Naive Bayes in which we assume that the distributions \(p_{\mathbf{X}|Y = 1}(x)\) and \(p_{\mathbf{X}|Y = 0}(\mathbf{x})\) are multivariate normal.
The simple example we described in the previous section is actually QDA.
Let’s now look at a slightly more complicated case: the 2 or 7 example.
In this example, we have two predictors so we assume each one is bivariate normal.
We try to predict the region using the fatty acid composition:
table(olive$region)
Northern Italy Sardinia Southern Italy
151 98 323
We remove the area column because we won’t use it as a predictor.
olive <-select(olive, -area)
CART motivation
Using kNN, we can achieve the following accuracy:
library(caret) fit <-train(region ~ ., method ="knn", tuneGrid =data.frame(k =seq(1, 15, 2)), data = olive) fit$results[1,2]
[1] 0.969
CART motivation
However, a bit of data exploration reveals that we should be able to do even better:
CART motivation
We should be able to predict perfectly:
CART motivation
CART motivation
Decision trees like this are often used in practice.
Regression trees
When using trees, and the outcome is continuous, we call the approach a regression tree.
To introduce regression trees, we will use the 2008 poll data used in previous sections to describe the basic idea of how we build these algorithms.
As with other machine learning algorithms, we will try to estimate the conditional expectation \(f(x) = \mbox{E}(Y | X = x)\) with \(Y\) the poll margin and \(x\) the day.
Regression trees
Regression trees
This fits the model:
library(rpart) fit <-rpart(margin ~ ., data = polls_2008)
There are rules to decide when to stop.
Regression trees
plot(fit, margin =0.1) text(fit, cex =0.75)
Regression trees
Regression trees
If we let it go to the end we get:
Regression trees
Picking the parameters that controls when to stop:
library(caret) train_rpart <-train(margin ~ ., method ="rpart", tuneGrid =data.frame(cp =seq(0, 0.05, len =25)), data = polls_2008)
Regression trees
Regression trees
We can also prune:
fit <-rpart(margin ~ ., data = polls_2008, control =rpart.control(cp =0)) pruned_fit <-prune(fit, cp =0.01)
Classification (decision) trees
Apply it to 2 or 7 data:
train_rpart <-train(y ~ ., method ="rpart", tuneGrid =data.frame(cp =seq(0.0, 0.1, len =25)), data = mnist_27$train) y_hat <-predict(train_rpart, mnist_27$test) confusionMatrix(y_hat, mnist_27$test$y)$overall["Accuracy"]
Accuracy
0.81
Classification (decision) trees
Here is the estimate of \(p(\mathbf{x})\):
Random forests
Apply it to the polls data:
library(randomForest) fit <-randomForest(margin ~ ., data = polls_2008)