23  Association tests

The statistical models studied up to now are appropriate for continuous outcomes. We have not yet discussed inference for binary, categorical, and ordinal data. To give a very specific example, we will consider a case study examining funding success rates in the Netherlands, by gender.

23.1 Case study: funding success rates

A 2014 PNAS paper1 analyzed success rates from funding agencies in the Netherlands and concluded that their:

results reveal gender bias favoring male applicants over female applicants in the prioritization of their “quality of researcher” (but not “quality of proposal”) evaluations and success rates, as well as in the language use in instructional and evaluation materials.

The main evidence for this conclusion comes down to a comparison of the percentages. Table S1 in the paper includes the information we need. Here are the three columns showing the overall outcomes:

          discipline applications_total success_rates_total
1  Chemical sciences                122                26.2
2  Physical sciences                174                20.1
3            Physics                 76                26.3
4         Humanities                396                16.4
5 Technical sciences                251                17.1
6  Interdisciplinary                183                15.8

We have these values for each gender:

names(research_funding_rates)
 [1] "discipline"          "applications_total"  "applications_men"   
 [4] "applications_women"  "awards_total"        "awards_men"         
 [7] "awards_women"        "success_rates_total" "success_rates_men"  
[10] "success_rates_women"

We can compute the totals that were successful and the totals that were not as follows:

totals <- research_funding_rates |> 
  select(-discipline) |> 
  summarize_all(sum) |>
  summarize(yes_men = awards_men, 
            no_men = applications_men - awards_men, 
            yes_women = awards_women, 
            no_women = applications_women - awards_women) 

So we see that a larger percent of men than women received awards:

totals |> summarize(percent_men = yes_men/(yes_men+no_men),
                    percent_women = yes_women/(yes_women+no_women))
  percent_men percent_women
1     0.17737     0.1489899

But could this be due just to random variability? Here we learn how to perform inference for this type of data.

23.2 Chi-square Test

Imagine we have 290, 1,345, 177, 1,011 applicants, some are men and some are women and some get funded, whereas others don’t. We saw that the success rates for men and woman were:

totals |> summarize(percent_men = yes_men/(yes_men+no_men),
                    percent_women = yes_women/(yes_women+no_women))
  percent_men percent_women
1     0.17737     0.1489899

respectively. Would we see this again if we randomly assign funding at the overall rate:

rate <- with(totals, (yes_men + yes_women))/sum(totals)
rate
[1] 0.1654269

The Chi-square test answers this question. The first step is to create the two-by-two data table:

two_by_two <- with(totals, data.frame(awarded = c("no", "yes"), 
                                      men = c(no_men, yes_men),
                                      women = c(no_women, yes_women)))
two_by_two
  awarded  men women
1      no 1345  1011
2     yes  290   177

The general idea of the Chi-square test is to compare this two-by-two table to what you expect to see, which would be:

with(totals, data.frame(awarded = c("no", "yes"), 
                        men = (no_men + yes_men) * c(1 - rate, rate),
                        women = (no_women + yes_women) * c(1 - rate, rate)))
  awarded       men    women
1      no 1364.5271 991.4729
2     yes  270.4729 196.5271

We can see that more men than expected and fewer women than expected received funding. However, under the null hypothesis these observations are random variables. The Chi-square test tells us how likely it is to see a deviation this large or larger. This test uses an asymptotic result, similar to the CLT, related to the sums of independent binary outcomes. The R function chisq.test takes a two-by-two table and returns the results from the test:

chisq_test <- chisq.test(two_by_two[, -1])

We see that the p-value is 0.0509:

chisq_test$p.value
[1] 0.05091372

23.3 Generalized linear models

  • We presented a way to perform hypothesis testing for determining if there is association between two binary outcome, but we have not yet described how to quantify effects.

  • Can we estimate the effect of being a woman in funding success in the Netherlands? Note that if our outcomes are binary, then the linear models presented in the treatment effect chapter are not appropriate because the \(\beta\)s and \(\varepsilon\) are continuous.

*An adaptation of these methods, that is widely used in, for example, medical studies, gives us a way to estimate effects along with their standard errors.

  • The idea is to model a transformation of the expected value of the outcomes with a linear model.

\[ g\{\mbox{E}(Y_i)\} = \beta_0 + \beta_1 x_i \]

To finish describing the model we impose a distribution on \(Y\) such as binomial or Poisson. These are referred to as generalized linear models.

  • We define \(Y_i\) to be 1 if person \(i\) received funding and 0 otherwise and \(x_i\) to be 1 for person \(i\) is a women and 0 for men.

  • For this data the expected value of \(Y_i\) is the probability of funding for person \(i\) \(\mbox{Pr}(Y_i=1)\).

  • We assume the outcomes \(Y_i\) are binomial with \(N=1\) and probability \(p_i\).

  • For binomial data, the most widely used transformation is the logit function \(g(p) = \log \{p / (1-p)\}\) which takes numbers between 0 and 1 to any continuous number. The model looks like this:

\[ \log \frac{\mbox{Pr}(Y_i=1)}{1-\mbox{Pr}(Y_i=1)} = \beta_0 + \beta_1 x_i \]

23.3.1 The odds ratio

  • To understand how \(\beta_1\) can be used to quantify the effect of being a woman on success rates, first note that

\[ \mbox{Pr}(Y_i=1)/\{1-\mbox{Pr}(Y_i=1)\} = \mbox{Pr}(Y_i=1)/\mbox{Pr}(Y_i=0) \]

is the odds of person \(i\) getting funding.

  • This implies that \(e^{\beta_0}\) is the odds for men and \(e^{\beta_0}e^{\beta_1}\) is the odds for women, which implies \(e^{\beta_1}\) is the odds for women divided by the odds for men. This quantity is called the odds ratio.

  • To see this not that if use \(p_1\) and \(p_0\) to denote the probability of success for women and men, respectively, then \(e^\{beta_1\) can be rewritten as

\[ e^{\beta_1} = \frac{p_1}{1-p_1} \, / \, \frac{p_0}{1-p_0} \]

  • \(\beta_1\) therefore quantifies the log odds ratio.

  • Least squares is no longer an optimal way of estimating the parameters and instead we use an approach called maximum likelihood estimation (MLE).

  • A version of the central limit theorem applies and the estimates obtained this way are approximately normal when th number of observations is large.

  • The theory also provides a way to calculate standard errors for the estimates of the \(\beta\)s.

23.3.2 Fitting the model

  • To obtain the maximum likelihood estimates using R we can use the glm function with the family argument set to binomial. This defaults to using the logit transformation.

  • We do not have the individual level data, but because we our model assumes the probability of success is the same for all women and all men, then the number of success can be modeled as binomial with \(N_1\) trials and probability \(p_1\) for women and binomial with \(N_0\) trials and probability \(p_0\) for men, with \(N_1\) and \(N_0\) the total number of women and men.

  • In this case the glm function is used like this:

success <- with(totals, c(yes_men, yes_women))
failure <- with(totals, c(no_men, no_women))
gender <- factor(c("men", "women"))
fit <- glm(cbind(success, failure) ~ gender, family = "binomial") 
coefficients(summary(fit))
              Estimate Std. Error    z value      Pr(>|z|)
(Intercept) -1.5342684 0.06474387 -23.697508 3.825454e-124
genderwomen -0.2082771 0.10407015  -2.001315  4.535850e-02
  • The estimate of the odds ratio is 0.811982 which is interpreted as the odds being lowered by 20% for women as compared to men.

  • But is this due to chance?

  • We already noted that the p-value is about 0.05, but the GLM approach also permits us to compute confidence intervals using the confint function:

exp(confint(fit, 2))
    2.5 %    97.5 % 
0.6613153 0.9946601 
Note

We have used a simple version of GLMs in which the only variable is binary. However, the method can be expanded to use multiple variables including continuous ones. However, in these contexts the log odds ratio interpretation becomes more complex. Also note that we have shown just one version of GLM appropriate for binomial data using a logit transformation. This version is referred to often referred to as logistic regression. However, GLM can be used with other transformation and distributions. You can learn more by consulting a GLM text book.

23.3.3 Simple standard error approximation for two-by-two table odds ratio

  • Using glm we can obtain estimates, standard errors, and confidence intervals for a wide range of models. To do this we use a rather complex algorithms. In the case of two-by-two tables we can obtain a standard error for the log odds ratio using a simple approximation.

  • If our two-by-two tables has the following entries:

Men Women
Awarded a b
Not Awarded c d
  • In this case, the odds ratio is simply

\[ \frac{a/c}{b/d} = \frac{ad}{bc} \].

  • We can confirm we obtain the same estimate as when using glm:
or <- with(two_by_two, women[2]/sum(women) / (women[1]/sum(women)) / ((men[2]/sum(men)) / (men[1]/sum(men))))
c(log(or), fit$coef[2])
            genderwomen 
 -0.2082771  -0.2082771 
  • Statistical theory tells us that when all four entries of the two-by-two table are large enough, then the log odds ratio is approximately normal with standard error

\[ \sqrt{1/a + 1/b + 1/c + 1/d} \]

  • This implies that a 95% confidence interval for the log odds ratio can be formed by:

\[ \log\left(\frac{ad}{bc}\right) \pm 1.96 \sqrt{1/a + 1/b + 1/c + 1/d} \]

  • By exponentiating these two numbers we can construct a confidence interval of the odds ratio.

  • Using R we can compute this confidence interval as follows:

se <- two_by_two |> select(-awarded) |>
  summarize(se = sqrt(sum(1/men) + sum(1/women))) |>
  pull(se)
exp(log(or) + c(-1,1) * qnorm(0.975) * se)
[1] 0.6621581 0.9957060
  • Note that 1 is not included in the confidence interval which must mean that the p-value is smaller than 0.05. We can confirm this using:
2*(1 - pnorm(abs(log(or)), 0, se))
[1] 0.0453586
Warning

Note that the p-values obtained with chisq.test, glm and this simple approximation are all slightly different. This is because these are both based on different approximation approaches.

23.4 Large samples, small p-values

  • As mentioned earlier, reporting only p-values is not an appropriate way to report the results of data analysis.

  • In scientific journals, for example, some studies seem to overemphasize p-values.

  • Some of these studies have large sample sizes and report impressively small p-values.

  • Yet when one looks closely at the results, we realize odds ratios are quite modest: barely bigger than 1.

  • In this case the difference may not be practically significant or scientifically significant.

  • The relationship between odds ratio and p-value is not one-to-one.

  • It depends on the sample size. So a very small p-value does not necessarily mean a very large odds ratio.

  • Notice what happens to the p-value if we multiply our two-by-two table by 10, which does not change the odds ratio:

two_by_two_x_10 <- two_by_two |> 
  select(-awarded) |>
  mutate(men = men*10, women = women*10) 
chisq.test(two_by_two_x_10)$p.value
[1] 2.625423e-10

:::{.callout-note title = “Small count correction”} Note that the log odds ratio is not defined if any of the cells of the two-by-two table is 0. This is because if \(a\), \(b\), \(c\), or \(d\) is 0, the \(\log(\frac{ad}{bc})\) is either the log of 0 or has a 0 in the denominator. For this situation, it is common practice to avoid 0s by adding 0.5 to each cell. This is referred to as the Haldane–Anscombe correction and has been shown, both in practice and theory, to work well. :::

23.5 Exercises

  1. A famous athlete has an impressive career, winning 70% of her 500 career matches. However, this athlete gets criticized because in important events, such as the Olympics, she has a losing record of 8 wins and 9 losses. Perform a Chi-square test to determine if this losing record can be simply due to chance as opposed to not performing well under pressure.
mat <- matrix(c(500*.3, 500*.7, 9,8), 2, 2, byrow = TRUE)
chisq.test(mat, correct = FALSE)

    Pearson's Chi-squared test

data:  mat
X-squared = 4.0631, df = 1, p-value = 0.04383
  1. Why did we use the Chi-square test instead of Fisher’s exact test in the previous exercise?
  1. It actually does not matter, since they give the exact same p-value.
  2. Fisher’s exact and the Chi-square are different names for the same test.
  3. Because the sum of the rows and columns of the two-by-two table are not fixed so the hypergeometric distribution is not an appropriate assumption for the null hypothesis. For this reason, Fisher’s exact test is rarely applicable with observational data.
  4. Because the Chi-square test runs faster.

Answer: c

  1. Compute the odds ratio of “losing under pressure” along with a confidence interval.
lor <- log((mat[1,1]*mat[2,2])/(mat[1,2]*mat[2,1]))
se <- sqrt(sum(1/mat))
exp(lor + c(-1,1)*1.96*se)
[1] 0.1442096 1.0063459
2*pnorm(-abs(lor/se))
[1] 0.05150641
  1. Notice that p-values obtained with the four approaches that produce three different p-values:
chisq.test(mat)$p.value
[1] 0.08037602
chisq.test(mat, correct = FALSE)$p.value
[1] 0.0438292
2*pnorm(-abs(lor/se))
[1] 0.05150641
summary(glm(mat ~ c(1,0), family = "binomial"))$coef[2,4]
[1] 0.0515064

Why is this?

  1. We made a mistake in our code.
  2. These are based on t-statistics so the connection between p-value and confidence intervals does not apply.
  3. Different approximations are used for the p-value and the confidence interval calculation. If we had a larger sample size the match would be better.
  4. We should use the Fisher exact test to get confidence intervals.

Answer: c

  1. Multiply the two-by-two table by 2 and see if the p-value and confidence retrieval are a better match.

23.6 Optional exercises (won’t appear in the midterm)

  1. When analyzing the trump tweets we computed sentiment_counts like this:
library(tidytext)
links <- "https://t.co/[A-Za-z\\d]+|&amp;"
nrc <- get_sentiments("nrc") |> select(word, sentiment)
android_iphone <- trump_tweets |> 
  extract(source, "source", "Twitter for (.*)") |>
  filter(source %in% c("Android", "iPhone") &
           created_at >= ymd("2015-06-17") & 
           created_at < ymd("2016-11-08")) |>
  filter(!is_retweet) |>
  arrange(created_at) |> 
  mutate(text = str_replace_all(text, links, ""))  |>
  unnest_tokens(word, text) |>
  filter(!word %in% stop_words$word &
           !str_detect(word, "^\\d+$")) |>
  mutate(word = str_replace(word, "^'", "")) |>
  filter(!word %in% stop_words$word)

sentiment_counts <- android_iphone |>
  left_join(nrc, by = "word", relationship = "many-to-many") |>
  count(source, sentiment) |>
  pivot_wider(names_from = "source", values_from = "n") |>
  mutate(sentiment = replace_na(sentiment, replace = "none"))

Compute an odds ratio comparing Android to iPhone for each sentiment and add it to the table.

  1. Compute a 95% confidence interval for each odds ratio.

  2. Generate a plot showing the estimated odds ratios along with their confidence intervals.

  3. Test the null hypothesis that there is no difference between tweets from Android and iPhone and report the sentiments with p-values less than 0.05 and more likely to come from Android.

  4. For each sentiment, find the words assigned to that sentiment, keep words that appear at least 25 times, compute the odd ratio for each, and show a barplot for those with odds ratio larger than 2 or smaller than 1/2.


  1. http://www.pnas.org/content/112/40/12349.abstract↩︎