2024-12-16
Association is not causation is perhaps the most important lesson one can learn in a statistics class.
Correlation is not causation is another way to say this.
Throughout the statistics part of the book, we have described tools useful for quantifying associations between variables.
However, we must be careful not to over-interpret these associations.
There are many reasons that a variable \(X\) can be correlated with a variable \(Y\), without having any direct effect on \(Y\).
Today we describe three: spurious correlation, reverse causation, and confounders.
More here: http://tylervigen.com/spurious-correlations
Referred to as data dredging, data fishing, or data snooping.
It’s basically a form of what in the US they call cherry picking.
# A tibble: 1,000,000 × 2
group r
<int> <dbl>
1 710822 0.805
2 226015 0.792
3 841121 0.783
4 838373 0.776
5 131974 0.774
6 372183 0.770
7 572463 0.767
8 924637 0.766
9 804367 0.760
10 544498 0.759
# ℹ 999,990 more rows
This particular form of data dredging is referred to as p-hacking.
P-hacking is a topic of much discussion because it poses a problem in scientific publications.
Since publishers tend to reward statistically significant results over negative results, there is an incentive to report significant results.
This does not necessarily happen due to unethical behavior, but rather as a result of statistical ignorance or wishful thinking.
In advanced statistics courses, you can learn methods to adjust for these multiple comparisons.
This high correlation is driven by the one outlier.
There is an alternative to the sample correlation for estimating the population correlation that is robust to outliers.
It is called Spearman correlation.
Another way association is confused with causation is when the cause and effect are reversed.
An example of this is claiming that tutoring makes students perform worse because they test lower than peers that are not tutored.
In this case, the tutoring is not causing the low test scores, but the other way around.
Quote from NY Times:
When we examined whether regular help with homework had a positive impact on children’s academic performance, we were quite startled by what we found. Regardless of a family’s social class, racial or ethnic background, or a child’s grade level, consistent homework help almost never improved test scores or grades…
\[ X_i = \beta_0 + \beta_1 y_i + \varepsilon_i, i=1, \dots, N \]
The model fits the data very well.
The model is technically correct.
The estimates and p-values were obtained correctly as well.
What is wrong here is the interpretation.
Confounders are perhaps the most common reason that leads to associations begin misinterpreted.
If \(X\) and \(Y\) are correlated, we call \(Z\) a confounder if changes in \(Z\) cause changes in both \(X\) and \(Y\).
Earlier, when studying baseball data, we saw how Home Runs were a confounder that resulted in a higher correlation than expected when studying the relationship between Bases on Balls and Runs.
In some cases, we can use linear models to account for confounders.
However, this is not always the case.
Incorrect interpretation due to confounders is ubiquitous in the lay press and they are often hard to detect.
Here, we present a widely used example related to college admissions.
two_by_two <- admissions |> group_by(gender) |>
summarize(total_admitted = round(sum(admitted / 100 * applicants)),
not_admitted = sum(applicants) - sum(total_admitted))
two_by_two |>
mutate(percent = total_admitted/(total_admitted + not_admitted)*100)
# A tibble: 2 × 4
gender total_admitted not_admitted percent
<chr> <dbl> <dbl> <dbl>
1 men 1198 1493 44.5
2 women 557 1278 30.4
What’s going on?
This actually can happen if an uncounted confounder is driving most of the variability.
The case we have just covered is an example of Simpson’s paradox.
It is called a paradox because we see the sign of the correlation flip when comparing the entire publication to specific strata.
You can see that \(X\) and \(Y\) are negatively correlated.
However, once we stratify by \(Z\) (shown in different colors below), another pattern emerges.
It is really \(Z\) that is negatively correlated with \(X\).
If we stratify by \(Z\), the \(X\) and \(Y\) are actually positively correlated, as seen in the plot above.