32 Conditional probabilities and expectations

In machine learning applications, we rarely can predict outcomes perfectly. For example,
- spam detectors often miss emails that are clearly spam,
- Siri often misunderstands the words we are saying, and
- your bank at times thinks your card was stolen when it was not.
The most common reason for not being able to build perfect algorithms is that it is impossible.
To see this, note that most datasets will include groups of observations with the same exact observed values for all predictors, but with different outcomes.
Because our prediction rules are functions, equal inputs (the predictors) implies equal outputs (the predictions).
Therefore, for a challenge in which the same predictors are associated with different outcomes across different individual observations, it is impossible to predict correctly for all these cases.

32.1 Conditional probabilities

We use the notation \((X_1 = x_1,\dots,X_p=x_p)\) to represent the fact that we have observed values \(x_1,\dots,x_p\) for features \(X_1, \dots, X_p\).
This does not imply that the outcome \(Y\) will take a specific value. Instead, it implies a specific probability.
We denote the conditional probabilities for each class \(k\) with:

\[ \mbox{Pr}(Y=k \mid X_1 = x_1,\dots,X_p=x_p), \, \mbox{for}\,k=1,\dots,K \]

We will use the bold letters like this: \(\mathbf{X} \equiv (X_1,\dots,X_p)^\top\) and \(\mathbf{x} \equiv (x_1,\dots,x_p)^\top\).
We will also use the following notation for the conditional probability of being class \(k\):

\[ p_k(\mathbf{x}) = \mbox{Pr}(Y=k \mid \mathbf{X}=\mathbf{x}), \, \mbox{for}\, k=1,\dots,K \]

Note

WDo not confuse this with the \(p\) that represents the number of predictors.

These probabilities guide the construction of an algorithm that makes the best prediction: for any given \(\mathbf{x}\), we will predict the class \(k\) with the largest probability among \(p_1(x), p_2(x), \dots p_K(x)\).
In mathematical notation, we write it like this:

\[\hat{Y} = \max_k p_k(\mathbf{x})\]

In machine learning, we refer to this as Bayes’ Rule.
But this is a theoretical rule since, in practice, we don’t know \(p_k(\mathbf{x}), k=1,\dots,K\).
Estimating these conditional probabilities can be thought of as the main challenge of machine learning.
The better our probability estimates \(\hat{p}_k(\mathbf{x})\), the better our predictor \(\hat{Y}\).
So how well we predict depends on two things:
1. how close are the \(\max_k p_k(\mathbf{x})\) to 1 or 0 (perfect certainty) and
2. how close our estimates \(\hat{p}_k(\mathbf{x})\) are to \(p_k(\mathbf{x})\).
We can’t do anything about the first restriction as it is determined by the nature of the problem, so
our energy goes into finding ways to best estimate conditional probabilities.
The first restriction does imply that we have limits as to how well even the best possible algorithm can perform.
in some challenges we will be able to achieve almost perfect accuracy, with digit readers for example,
in others our success is restricted by the randomness of the process, with movie recommendations for example.
It is important to remember that defining our prediction by maximizing the probability is not always optimal in practice and depends on the context.
As discussed above, sensitivity and specificity may differ in importance.
But even in these cases, having a good estimate of the \(p_k(x), k=1,\dots,K\) will suffice for us to build optimal prediction models, since we can control the balance between specificity and sensitivity however we wish.

32.2 Conditional expectations

For binary data, you can think of the probability \(\mbox{Pr}(Y=1 \mid \mathbf{X}=\mathbf{x})\) as the proportion of 1s in the stratum of the population for which \(\mathbf{X}=\mathbf{x}\).
Many of the algorithms we will learn can be applied to both categorical and continuous data due to the connection between conditional probabilities and conditional expectations.
Because the expectation is the average of values \(y_1,\dots,y_n\) in the population, in the case in which the \(y\)s are 0 or 1:

\[ \mbox{E}(Y \mid \mathbf{X}=\mathbf{x})=\mbox{Pr}(Y=1 \mid \mathbf{X}=\mathbf{x}). \]

As a result, we often only use the expectation to denote both the conditional probability and conditional expectation.
We assume that the outcome follows the same conditional distribution.

32.3 Conditional expectations minimizes squared loss function

Why do we care about the conditional expectation in machine learning?
This is because the expected value has an attractive mathematical property: it minimizes the MSE. Specifically, of all possible predictions \(\hat{Y}\),

\[ \hat{Y} = \mbox{E}(Y \mid \mathbf{X}=\mathbf{x}) \, \mbox{ minimizes } \, \mbox{E}\{ (\hat{Y} - Y)^2 \mid \mathbf{X}=\mathbf{x} \} \]

Due to this property, a succinct description of the main task of machine learning is that we use data to estimate:

\[ f(\mathbf{x}) \equiv \mbox{E}( Y \mid \mathbf{X}=\mathbf{x} ) \]

for any set of features \(\mathbf{x} = (x_1, \dots, x_p)^\top\).

This is easier said than done, since this function can take any shape and \(p\) can be very large.
Consider a case in which we only have one predictor \(x\). The expectation \(\mbox{E}\{ Y \mid X=x \}\) can be any function of \(x\): a line, a parabola, a sine wave, a step function, anything.
It gets even more complicated when we consider instances with large \(p\), in which case \(f(\mathbf{x})\) is a function of a multidimensional vector \(\mathbf{x}\). For example, in our digit reader example \(p = 784\)!
The main way in which competing machine learning algorithms differ is in their approach to estimating this conditional expectation.