2024-10-21
Many data generation procedures can be effectively modeled as draws from an urn.
We can model the process of polling likely voters as drawing 0s for one party and 1s for the other.
Epidemiologist assume subjects in their studies are a random sample from the population of interest.
Similarly, in experimental research, we often assume that the individual organisms we are studying, for example worms, flies, or mice, are a random sample from a larger population.
Randomized experiments can be modeled by draws from an urn, reflecting the way individuals are assigned into group; when getting assigned, individuals draw their group at random.
Sampling models are therefore ubiquitous in data science.
Casino games offer a plethora of real-world cases in which sampling models are used to answer specific questions.
We will therefore start with these examples.
Suppose a very small casino hires you to consult on whether they should set up roulette wheels.
We will assume that 1,000 people will play, and that the only game available on the roulette wheel is to bet on red or black.
The casino wants you to predict how much money they will make or lose.
They want a range of values and, in particular, they want to know what’s the chance of losing money.
If this probability is too high, they will decide against installing roulette wheels.
We are going to define a random variable \(S\) that will represent the casino’s total winnings.
This is a roullette:
Let’s start by constructing the urn.
A roulette wheel has 18 red pockets, 18 black pockets and 2 green ones.
So playing a color in one game of roulette is equivalent to drawing from this urn:
The 1,000 outcomes from 1,000 people playing are independent draws from this urn.
If red comes up, the gambler wins, and the casino loses a dollar, resulting random variable being -$1.
Otherwise, the casino wins a dollar, and the random variable is $1.
color
.This is a sampling model, as it models the random behavior through the sampling of draws from an urn.
The total winnings \(S\) is simply the sum of these 1,000 independent draws:
If you rerun the code above, you see that \(S\) changes every time.
\(S\) is a random variable.
The probability distribution of a random variable informs us about the probability of the observed value falling in any given interval.
For example, if we want to know the probability that we lose money, we are asking the probability that \(S\) is in the interval \((-\infty,0)\).
If we can define a cumulative distribution function \(F(a) = \mbox{Pr}(S\leq a)\), we can answer any question about the probability of events defined by \(S\).
We call this \(F\) the random variable’s distribution function.
Probability and Statistics classes dedicate much time to calculating or approximating these.
We can also estimate the distribution function for \(S\) using a Monte Carlo simulation.
a
?This will be a very good approximation of \(F(a)\).
This allows us to easily answer the casino’s question: How likely is it that we will lose money?
It is quite low:
We see that the distribution appears to be approximately normal.
A QQ-plot will confirm that the normal approximation is close to a perfect approximation.
Remeber, if the distribution is normal, all we need to define it are the average and the standard deviation (SD).
Since we have the original values from which the distribution is created, we can easily compute these with mean(s)
and sd(s)
.
The blue curve added to the histogram is a normal density with this average and standard deviation.
Note
Statistical theory offers a method to derive the distribution of random variables defined as the sum of independent random draw of numbers from an urn:
Specifically, in our example above, we can demonstrate that \((S+n)/2\) follows a binomial distribution.
We therefore do not need to run Monte Carlo simulations to determine the probability distribution of \(S\).
The simulations were conducted for illustrative purposes.
We can use the function dbinom
and pbinom
to compute the probabilities exactly.
Before we continue, let’s establish an important distinction and connection between the distribution of a list of numbers and a probability distribution.
Any list of numbers \(x_1,\dots,x_n\) has a distribution.
It does not have a probability distribution because they are not random.
We define \(F(a)\) as the function that indicates what proportion of the list is less than or equal to \(a\).
Given their usefulness as summaries when the distribution is approximately normal, we also define the average and standard deviation.
x
:A random variable \(X\) has a distribution function.
To define this, we do not need a list of numbers; it is a theoretical concept.
We define the distribution as the \(F(a)\) that answers the question: What is the probability that \(X\) is less than or equal to \(a\)?
However, if \(X\) is defined by drawing from an urn containing numbers, then there is a list: the list of numbers inside the urn.
In this case, the distribution of that list is the probability distribution of \(X\), and the average and standard deviation of that list are the expected value and standard error of the random variable.
These outcomes form a list of numbers, and the distribution of this list will be a very good approximation of the probability distribution of \(X\).
The longer the list, the better the approximation.
The average and standard deviation of this list will approximate the expected value and standard error of the random variable.
In statistical textbooks, upper case letters denote random variables, and we will adhere to this convention.
Lower case letters are used for observed values.
You will see some notation that include both.
For example, you will see events defined as \(X \leq x\).
Here \(X\) is a random variable and \(x\) is an arbitrary value and not random.
So, for example, \(X\) might represent the number on a die roll and \(x\) will represent an actual value we see: 1, 2, 3, 4, 5, or 6.
In this case, the probability of \(X=x\) is 1/6 regardless of the observed value \(x\).
We can discuss what we expect \(X\) to be, what values are probable, but we can’t discuss what value \(X\) is.
Once we have data, we do see a realization of \(X\).
Therefore, data analysts often speak of what could have been after observing what actually happened.
We will now review the mathematical theory that allows us to approximate the probability distributions for the sum of draws.
The same approach we use for the sum of draws will be useful for describing the distribution of averages and proportion, which we will need to understand how polls work.
The first important concept to learn is the expected value.
\[\mbox{E}[X]\]
A random variable will vary around its expected value in a manner that if you take the average of many, many draws, the average will approximate the expected value.
This approximation improves as you take more draws, making the expected value a useful quantity to compute.
\[ \mbox{E}[X] = \sum_{i=1}^n x_i \,\mbox{Pr}(X = x_i) \]
\[ \mbox{E}[X] = \frac{1}{n}\sum_{i=1}^n x_i \]
\[ \mbox{E}[X] = \int_a^b x f(x) \]
\[ \mbox{E}[X] = (20 + -18)/38 \]
You might consider it a bit counterintuitive to say that \(X\) varies around 0.05 when it only takes the values 1 and -1.
To make sense of the expected value in this context is by realizing that, if we play the game over and over, the casino wins, on average, 5 cents per game.
\[\mbox{E}[X] = ap + b(1-p)\]
The reason we define the expected value is because this mathematical definition turns out to be useful for approximating the probability distributions of sum.
This, in turn, is useful for describing the distribution of averages and proportions.
The first useful fact is that the expected value of the sum of the draws is the number of draws \(\times\) the average of the numbers in the urn.
Therefore, if 1,000 people play roulette, the casino expects to win, on average, about 1,000 \(\times\) $0.05 = $50.
However, this is an expected value.
How different can one observation be from the expected value? The casino really needs to know this.
What is the range of possibilities? If negative numbers are too likely, they will not install roulette wheels.
Statistical theory once again answers this question.
The standard error (SE) gives us an idea of the size of the variation around the expected value.
In statistics books, it’s common to use:
\[\mbox{SE}[X] = \sqrt{\mbox{Var}[X]}\]
to denote the standard error of a random variable.
For discrete random variable with possible outcomes \(x_1,\dots,x_n\), the standard error is defined as:
\[ \mbox{SE}[X] = \sqrt{\sum_{i=1}^n \left(x_i - E[X]\right)^2 \,\mbox{Pr}(X = x_i)}, \]
\[ \mbox{SE}[X] = \sqrt{\frac{1}{n}\sum_{i=1}^n (x_i - \bar{x})^2} \mbox{ with } \bar{x} = \frac{1}{n}\sum_{i=1}^n x_i \]
\[ \mbox{SE}[X] = \sqrt{\int_a^b \left(x-\mbox{E}[X]\right)^2 f(x)\,\mathrm{d}x} \]
\[| b - a |\sqrt{p(1-p)}.\]
\[ | 1 - (-1) | \sqrt{10/19 \times 9/19} \]
or:
Since one draw is obviously the sum of just one draw, we can use the formula above to calculate that the random variable defined by one draw has an expected value of 0.05 and a SE of about 1.
This makes sense since we obtain either 1 or -1, with 1 slightly favored over -1.
\[ \sqrt{\mbox{number of draws}} \times \mbox{ SD of the numbers in the urn} \]
As a result, when 1,000 people bet on red, the casino is expected to win $50 with a standard error of $32.
It therefore seems like a safe bet to install more roulette wheels.
But we still haven’t answered the question: How likely is the casino to lose money? The CLT will help in this regard.
Note
The exact probability for the casino winnings can be computed precisely, rather than approximately, using the binomial distribution.
However, here we focus on the CLT, which can be applied more broadly to sums of random variables in a way that the binomial distribution cannot.
The Central Limit Theorem (CLT) tells us that when the number of draws, also called the sample size, is large, the probability distribution of the sum of the independent draws is approximately normal.
Given that sampling models are used for so many data generation processes, the CLT is considered one of the most important mathematical insights in history.
Previously, we discussed that if we know that the distribution of a list of numbers is approximated by the normal distribution, all we need to describe the list are the average and standard deviation.
We also know that the same applies to probability distributions.
The Central Limit Theorem (CLT) tells us that the sum \(S\) is approximated by a normal distribution.
Using the formulas above, we know that the expected value and standard error are:
The CLT works when the number of draws is large, but “large” is a relative term.
In many circumstances, as few as 30 draws is enough to make the CLT useful.
In some specific instances, as few as 10 is enough.
However, these should not be considered general rules.
Note that when the probability of success is very small, much larger sample sizes are needed.
By way of illustration, let’s consider the lottery.
In the lottery, the chances of winning are less than 1 in a million.
Thousands of people play so the number of draws is very large.
Yet the number of winners, the sum of the draws, range between 0 and 4.
This sum is certainly not well approximated by a normal distribution, so the CLT does not apply, even with the very large sample size.
This is generally true when the probability of a success is very low.
In these cases, the Poisson distribution is more appropriate.
You can explore the properties of the Poisson distribution using dpois
and ppois
.
You can generate random variables following this distribution with rpois
.
However, we won’t cover the theory here.
There are several useful mathematical results that we used above and often employ when working with data.
We list them below.
The expected value of the sum of random variables is the sum of each random variable’s expected value.
We can write it like this:
\[ \mbox{E}[X_1+X_2+\dots+X_n] = \mbox{E}[X_1] + \mbox{E}[X_2]+\dots+\mbox{E}[X_n] \]
If \(X\) represents independent draws from the urn, then they all have the same expected value.
Let’s denote the expected value with \(\mu\) and rewrite the equation as:
\[ \mbox{E}[X_1+X_2+\dots+X_n]= n\mu \]
The expected value of a non-random constant times a random variable is the non-random constant times the expected value of a random variable.
This is easier to explain with symbols:
\[ \mbox{E}[aX] = a\times\mbox{E}[X] \]
To understand why this is intuitive, consider changing units.
If we change the units of a random variable, such as from dollars to cents, the expectation should change in the same way.
\[ \mbox{E}[(X_1+X_2+\dots+X_n) / n]= \mbox{E}[X_1+X_2+\dots+X_n] / n = n\mu/n = \mu \]
The square of the standard error of the sum of independent random variables is the sum of the square of the standard error of each random variable.
This one is easier to understand in math form:
\[ \mbox{SE}[X_1+X_2+\dots+X_n] = \sqrt{\mbox{SE}[X_1]^2 + \mbox{SE}[X_2]^2+\dots+\mbox{SE}[X_n]^2 } \]
The square of the standard error is referred to as the variance in statistical textbooks.
Note that this particular property is not as intuitive as the previous three and more in depth explanations can be found in statistics textbooks.
The standard error of a non-random constant times a random variable is the non-random constant times the random variable’s standard error.
As with the expectation:
\[ \mbox{SE}[aX] = a \times \mbox{SE}[X] \]
\[ \begin{aligned} \mbox{SE}[(X_1+X_2+\dots+X_n) / n] &= \mbox{SE}[X_1+X_2+\dots+X_n]/n \\ &= \sqrt{\mbox{SE}[X_1]^2+\mbox{SE}[X_2]^2+\dots+\mbox{SE}[X_n]^2}/n \\ &= \sqrt{\sigma^2+\sigma^2+\dots+\sigma^2}/n\\ &= \sqrt{n\sigma^2}/n\\ &= \sigma / \sqrt{n} \end{aligned} \]
If \(X\) is a normally distributed random variable, then if \(a\) and \(b\) are non-random constants, \(aX + b\) is also a normally distributed random variable.
All we are doing is changing the units of the random variable by multiplying by \(a\), then shifting the center by \(b\).
Note that statistical textbooks use the Greek letters \(\mu\) and \(\sigma\) to denote the expected value and standard error, respectively.
This is because \(\mu\) is the Greek letter for \(m\), the first letter of mean, which is another term used for expected value.
Similarly, \(\sigma\) is the Greek letter for \(s\), the first letter of standard error.
The given equation reveals crucial insights for practical scenarios.
Specifically, it suggests that the standard error can be minimized by increasing the sample size, \(n\), and we can quantify this reduction.
However, this principle holds true only when the variables \(X_1, X_2, \dots, X_n\) are independent.
If they are not, the estimated standard error can be significantly off.
We later introduce the concept of correlation, which quantifies the degree to which variables are interdependent.
If the correlation coefficient among the \(X\) variables is \(\rho\), the standard error of their average is:
\[ \mbox{SE}\left(\bar{X}\right) = \sigma \sqrt{\frac{1 + (n-1) \rho}{n}} \]
The key observation here is that as \(\rho\) approaches its upper limit of 1, the standard error increases.
Notably, in the situation where \(\rho = 1\), the standard error, \(\mbox{SE}(\bar{X})\), equals \(\sigma\), and it becomes unaffected by the sample size \(n\).
An important implication of result 4 above is that the standard error of the average becomes smaller and smaller as \(n\) grows larger.
When \(n\) is very large, then the standard error is practically 0 and the average of the draws converges to the average of the urn.
This is known in statistical textbooks as the law of large numbers or the law of averages.
The law of averages is sometimes misinterpreted.
For example, if you toss a coin 5 times and see a head each time, you might hear someone argue that the next toss is probably a tail because of the law of averages: on average we should see 50% heads and 50% tails.
A similar argument would be to say that red “is due” on the roulette wheel after seeing black come up five times in a row.
Yet these events are independent so the chance of a coin landing heads is 50%, regardless of the previous 5.
The same principle applies to the roulette outcome.
The law of averages applies only when the number of draws is very large and not in small samples.
After a million tosses, you will definitely see about 50% heads regardless of the outcome of the first five tosses.
Another funny misuse of the law of averages is in sports when TV sportscasters predict a player is about to succeed because they have failed a few times in a row.