

Rafael A. Irizarry


December 4, 2024


Machine Learning

Conditional probabilities and expectations

  • In machine learning applications, we rarely can predict outcomes perfectly.

  • The most common reason for not being able to build perfect algorithms is that it is impossible.

  • To see this, consider that most datasets will include groups of observations with the same exact observed values for all predictors, but with different outcomes.

Conditional probabilities and expectations

  • Because our prediction rules are functions, equal inputs (the predictors) implies equal outputs (the predictions).

  • Therefore, for a challenge in which the same predictors are associated with different outcomes across different individual observations, it is impossible to predict correctly for all these cases.

  • It therefor makes sense to consider conditional probabilities:

\[ \mbox{Pr}(Y=k \mid X_1 = x_1,\dots,X_p=x_p), \, \mbox{for}\,k=1,\dots,K \]

Conditional probabilities

  • We will also use the following notation for the conditional probability of being class \(k\):

\[ p_k(\mathbf{x}) = \mbox{Pr}(Y=k \mid \mathbf{X}=\mathbf{x}), \, \mbox{for}\, k=1,\dots,K \]

  • Notice that the \(p_k(\mathbf{x})\) have to add up to 1 for each \(\mathbf{x}\), so once we know \(K-1\), we know all.

Conditional probabilities

  • When the outcome is binary, we only need to know 1, so we drop the \(k\) and use the notation:

\[p(\mathbf{x}) = \mbox{Pr}(Y=1 \mid \mathbf{X}=\mathbf{x})\]

  • Do not be confused by the fact that we use \(p\) for two different things: the conditional probability \(p(\mathbf{x})\) and the number of predictors \(p\).

Conditional probabilities

  • These probabilities guide the construction of an algorithm that makes the best prediction:

\[\hat{Y} = \max_k p_k(\mathbf{x})\]

  • In machine learning, we refer to this as Bayes’ Rule.

  • But this is a theoretical rule since, in practice, we don’t know \(p_k(\mathbf{x}), k=1,\dots,K\).

Conditional probabilities

  • Estimating these conditional probabilities can be thought of as the main challenge of machine learning.

  • The better our probability estimates \(\hat{p}_k(\mathbf{x})\), the better our predictor \(\hat{Y}\).

Conditional probabilities

How well we predict depends on two things:

  • how close are the \(\max_k p_k(\mathbf{x})\) to 1 or 0 (perfect certainty) and
  • how close our estimates \(\hat{p}_k(\mathbf{x})\) are to \(p_k(\mathbf{x})\).

We can’t do anything about the first restriction as it is determined by the nature of the problem, so our energy goes into finding ways to best estimate conditional probabilities.

Conditional expectations

  • For binary data, you can think of the probability \(\mbox{Pr}(Y=1 \mid \mathbf{X}=\mathbf{x})\) as the proportion of 1s in the stratum of the population for which \(\mathbf{X}=\mathbf{x}\).

\[ \mbox{E}(Y \mid \mathbf{X}=\mathbf{x})=\mbox{Pr}(Y=1 \mid \mathbf{X}=\mathbf{x}). \]

  • As a result, we often only use the expectation to denote both the conditional probability and conditional expectation.

Conditional expectations

  • Why do we care about the conditional expectation in machine learning?

  • This is because the expected value has an attractive mathematical property: it minimizes the MSE.

\[ \hat{Y} = \mbox{E}(Y \mid \mathbf{X}=\mathbf{x}) \, \mbox{ minimizes } \, \mbox{E}\{ (\hat{Y} - Y)^2 \mid \mathbf{X}=\mathbf{x} \} \]

Conditional expectations

  • Due to this property, a succinct description of the main task of machine learning is that we use data to estimate:

\[ f(\mathbf{x}) \equiv \mbox{E}( Y \mid \mathbf{X}=\mathbf{x} ) \]

  • for any set of features \(\mathbf{x} = (x_1, \dots, x_p)^\top\).

  • This is easier said than done, since this function can take any shape and \(p\) can be very large.