Association Not Causation

2024-12-16

Association is not causation

Association is not causation is perhaps the most important lesson one can learn in a statistics class.
Correlation is not causation is another way to say this.
Throughout the statistics part of the book, we have described tools useful for quantifying associations between variables.
However, we must be careful not to over-interpret these associations.

Association is not causation

There are many reasons that a variable \(X\) can be correlated with a variable \(Y\), without having any direct effect on \(Y\).
Today we describe three: spurious correlation, reverse causation, and confounders.

Spurious correlation

More here: http://tylervigen.com/spurious-correlations

Referred to as data dredging, data fishing, or data snooping.
It’s basically a form of what in the US they call cherry picking.

Spurious correlation

library(tidyverse) 
N <- 25 
g <- 1000000 
sim_data <- tibble(group = rep(1:g, each = N),  
                   x = rnorm(N*g),  
                   y = rnorm(N*g))

Spurious correlation

res <- sim_data |>  
  group_by(group) |>  
  summarize(r = cor(x, y)) |>  
  arrange(desc(r)) 
res

# A tibble: 1,000,000 × 2
    group     r
    <int> <dbl>
 1 710822 0.805
 2 226015 0.792
 3 841121 0.783
 4 838373 0.776
 5 131974 0.774
 6 372183 0.770
 7 572463 0.767
 8 924637 0.766
 9 804367 0.760
10 544498 0.759
# ℹ 999,990 more rows

We see a maximum correlation of 0.805.

Spurious correlation

sim_data |> filter(group == res$group[which.max(res$r)]) |> 
  ggplot(aes(x, y)) + 
  geom_point() +  
  geom_smooth(method = "lm")

Spurious correlation

Sample correlation is a random variable:

Spurious correlation

It’s simply a mathematical fact that if we observe random correlations that are expected to be 0, but have a standard error of 0.204, the largest one will be close to 1.

Spurious correlation

library(broom) 
sim_data |>  
  filter(group == res$group[which.max(res$r)]) |> 
  summarize(tidy(lm(y ~ x))) |>  
  filter(term == "x")

# A tibble: 1 × 5
  term  estimate std.error statistic    p.value
  <chr>    <dbl>     <dbl>     <dbl>      <dbl>
1 x        0.764     0.118      6.50 0.00000124

Spurious correlation

This particular form of data dredging is referred to as p-hacking.
P-hacking is a topic of much discussion because it poses a problem in scientific publications.
Since publishers tend to reward statistically significant results over negative results, there is an incentive to report significant results.

Spurious correlation

In epidemiology and the social sciences, for example, researchers may look for associations between an adverse outcome and several exposures, and report only the one exposure that resulted in a small p-value.

Spurious correlation

Furthermore, they might try fitting several different models to account for confounding and choose the one that yields the smallest p-value.

Spurious correlation

In experimental disciplines, an experiment might be repeated more than once, yet only the results of the one experiment with a small p-value reported.

Spurious correlation

This does not necessarily happen due to unethical behavior, but rather as a result of statistical ignorance or wishful thinking.
In advanced statistics courses, you can learn methods to adjust for these multiple comparisons.

Outliers

set.seed(1985) 
x <- rnorm(100,100,1) 
y <- rnorm(100,84,1) 
x[-23] <- scale(x[-23]) 
y[-23] <- scale(y[-23])

Outliers

cor(x,y)

[1] 0.988

This high correlation is driven by the one outlier.

Outliers

If we remove this outlier, the correlation is greatly reduced to almost 0:

cor(x[-23], y[-23])

[1] -0.0442

Outliers

There is an alternative to the sample correlation for estimating the population correlation that is robust to outliers.
It is called Spearman correlation.

Outliers

The outlier is no longer associated with a very large value, and the correlation decreases significantly:

cor(rank(x), rank(y))

[1] 0.00251

Spearman correlation can also be calculated like this:

cor(x, y, method = "spearman")

[1] 0.00251

Reversing cause and effect

Another way association is confused with causation is when the cause and effect are reversed.
An example of this is claiming that tutoring makes students perform worse because they test lower than peers that are not tutored.
In this case, the tutoring is not causing the low test scores, but the other way around.

Reversing cause and effect

Quote from NY Times:

When we examined whether regular help with homework had a positive impact on children’s academic performance, we were quite startled by what we found. Regardless of a family’s social class, racial or ethnic background, or a child’s grade level, consistent homework help almost never improved test scores or grades…

Reversing cause and effect

A more likely possibility is that the children needing regular parental help, receive this help because they don’t perform well in school.

Reversing cause and effect

If we fit the model:

\[ X_i = \beta_0 + \beta_1 y_i + \varepsilon_i, i=1, \dots, N \]

where \(X_i\) is the father height and \(y_i\) is the son height, we do get a statistically significant result.

galton_heights |> summarize(tidy(lm(father ~ son)))

# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   40.9      4.40        9.29 5.47e-17
2 son            0.407    0.0636      6.40 1.36e- 9

Reversing cause and effect

The model fits the data very well.
The model is technically correct.
The estimates and p-values were obtained correctly as well.
What is wrong here is the interpretation.

Confounders

Confounders are perhaps the most common reason that leads to associations begin misinterpreted.
If \(X\) and \(Y\) are correlated, we call \(Z\) a confounder if changes in \(Z\) cause changes in both \(X\) and \(Y\).

Confounders

Earlier, when studying baseball data, we saw how Home Runs were a confounder that resulted in a higher correlation than expected when studying the relationship between Bases on Balls and Runs.
In some cases, we can use linear models to account for confounders.
However, this is not always the case.

Confounders

Incorrect interpretation due to confounders is ubiquitous in the lay press and they are often hard to detect.
Here, we present a widely used example related to college admissions.

UC Berkeley admissions

two_by_two <- admissions |> group_by(gender) |>  
  summarize(total_admitted = round(sum(admitted / 100 * applicants)),  
            not_admitted = sum(applicants) - sum(total_admitted)) 
two_by_two |> 
  mutate(percent = total_admitted/(total_admitted + not_admitted)*100)

# A tibble: 2 × 4
  gender total_admitted not_admitted percent
  <chr>           <dbl>        <dbl>   <dbl>
1 men              1198         1493    44.5
2 women             557         1278    30.4

two_by_two <- select(two_by_two, -gender)
chisq.test(two_by_two)$p.value

[1] 1.06e-21

UC Berkeley admissions

Closer inspection shows a paradoxical result:

admissions |> select(major, gender, admitted) |> 
  pivot_wider(names_from = "gender", values_from = "admitted") |> 
  mutate(women_minus_men = women - men)

# A tibble: 6 × 4
  major   men women women_minus_men
  <chr> <dbl> <dbl>           <dbl>
1 A        62    82              20
2 B        63    68               5
3 C        37    34              -3
4 D        33    35               2
5 E        28    24              -4
6 F         6     7               1

UC Berkeley admissions

What’s going on?
This actually can happen if an uncounted confounder is driving most of the variability.

UC Berkeley admissions

Confounding explained

Average after stratifying

If we average the difference by major, we find that the percent is actually 3.5% higher for women.

admissions |>  group_by(gender) |> 
  summarize(average = mean(admitted))

# A tibble: 2 × 2
  gender average
  <chr>    <dbl>
1 men       38.2
2 women     41.7

Simpson’s paradox

The case we have just covered is an example of Simpson’s paradox.
It is called a paradox because we see the sign of the correlation flip when comparing the entire publication to specific strata.

Simpson’s paradox

Simulated \(X\), \(Y\), and \(Z\):

Simpson’s paradox

You can see that \(X\) and \(Y\) are negatively correlated.
However, once we stratify by \(Z\) (shown in different colors below), another pattern emerges.

Simpson’s paradox

It is really \(Z\) that is negatively correlated with \(X\).
If we stratify by \(Z\), the \(X\) and \(Y\) are actually positively correlated, as seen in the plot above.