discipline applications_total success_rates_total
1 Chemical sciences 122 26.2
2 Physical sciences 174 20.1
3 Physics 76 26.3
4 Humanities 396 16.4
5 Technical sciences 251 17.1
6 Interdisciplinary 183 15.8
23 Association tests
The statistical models studied up to now are appropriate for continuous outcomes. We have not yet discussed inference for binary, categorical, and ordinal data. To give a very specific example, we will consider a case study examining funding success rates in the Netherlands, by gender.
23.1 Case study: funding success rates
A 2014 PNAS paper1 analyzed success rates from funding agencies in the Netherlands and concluded that their:
results reveal gender bias favoring male applicants over female applicants in the prioritization of their “quality of researcher” (but not “quality of proposal”) evaluations and success rates, as well as in the language use in instructional and evaluation materials.
The main evidence for this conclusion comes down to a comparison of the percentages. Table S1 in the paper includes the information we need. Here are the three columns showing the overall outcomes:
We have these values for each gender:
names(research_funding_rates)
[1] "discipline" "applications_total" "applications_men"
[4] "applications_women" "awards_total" "awards_men"
[7] "awards_women" "success_rates_total" "success_rates_men"
[10] "success_rates_women"
We can compute the totals that were successful and the totals that were not as follows:
<- research_funding_rates |>
totals select(-discipline) |>
summarize_all(sum) |>
summarize(yes_men = awards_men,
no_men = applications_men - awards_men,
yes_women = awards_women,
no_women = applications_women - awards_women)
So we see that a larger percent of men than women received awards:
|> summarize(percent_men = yes_men/(yes_men+no_men),
totals percent_women = yes_women/(yes_women+no_women))
percent_men percent_women
1 0.17737 0.1489899
But could this be due just to random variability? Here we learn how to perform inference for this type of data.
23.2 Chi-square Test
Imagine we have 290, 1,345, 177, 1,011 applicants, some are men and some are women and some get funded, whereas others don’t. We saw that the success rates for men and woman were:
|> summarize(percent_men = yes_men/(yes_men+no_men),
totals percent_women = yes_women/(yes_women+no_women))
percent_men percent_women
1 0.17737 0.1489899
respectively. Would we see this again if we randomly assign funding at the overall rate:
<- with(totals, (yes_men + yes_women))/sum(totals)
rate rate
[1] 0.1654269
The Chi-square test answers this question. The first step is to create the two-by-two data table:
<- with(totals, data.frame(awarded = c("no", "yes"),
two_by_two men = c(no_men, yes_men),
women = c(no_women, yes_women)))
two_by_two
awarded men women
1 no 1345 1011
2 yes 290 177
The general idea of the Chi-square test is to compare this two-by-two table to what you expect to see, which would be:
with(totals, data.frame(awarded = c("no", "yes"),
men = (no_men + yes_men) * c(1 - rate, rate),
women = (no_women + yes_women) * c(1 - rate, rate)))
awarded men women
1 no 1364.5271 991.4729
2 yes 270.4729 196.5271
We can see that more men than expected and fewer women than expected received funding. However, under the null hypothesis these observations are random variables. The Chi-square test tells us how likely it is to see a deviation this large or larger. This test uses an asymptotic result, similar to the CLT, related to the sums of independent binary outcomes. The R function chisq.test
takes a two-by-two table and returns the results from the test:
<- chisq.test(two_by_two[, -1]) chisq_test
We see that the p-value is 0.0509:
$p.value chisq_test
[1] 0.05091372
23.3 Generalized linear models
We presented a way to perform hypothesis testing for determining if there is association between two binary outcome, but we have not yet described how to quantify effects.
Can we estimate the effect of being a woman in funding success in the Netherlands? Note that if our outcomes are binary, then the linear models presented in the treatment effect chapter are not appropriate because the \(\beta\)s and \(\varepsilon\) are continuous.
*An adaptation of these methods, that is widely used in, for example, medical studies, gives us a way to estimate effects along with their standard errors.
- The idea is to model a transformation of the expected value of the outcomes with a linear model.
\[ g\{\mbox{E}(Y_i)\} = \beta_0 + \beta_1 x_i \]
To finish describing the model we impose a distribution on \(Y\) such as binomial or Poisson. These are referred to as generalized linear models.
We define \(Y_i\) to be 1 if person \(i\) received funding and 0 otherwise and \(x_i\) to be 1 for person \(i\) is a women and 0 for men.
For this data the expected value of \(Y_i\) is the probability of funding for person \(i\) \(\mbox{Pr}(Y_i=1)\).
We assume the outcomes \(Y_i\) are binomial with \(N=1\) and probability \(p_i\).
For binomial data, the most widely used transformation is the logit function \(g(p) = \log \{p / (1-p)\}\) which takes numbers between 0 and 1 to any continuous number. The model looks like this:
\[ \log \frac{\mbox{Pr}(Y_i=1)}{1-\mbox{Pr}(Y_i=1)} = \beta_0 + \beta_1 x_i \]
23.3.1 The odds ratio
- To understand how \(\beta_1\) can be used to quantify the effect of being a woman on success rates, first note that
\[ \mbox{Pr}(Y_i=1)/\{1-\mbox{Pr}(Y_i=1)\} = \mbox{Pr}(Y_i=1)/\mbox{Pr}(Y_i=0) \]
is the odds of person \(i\) getting funding.
This implies that \(e^{\beta_0}\) is the odds for men and \(e^{\beta_0}e^{\beta_1}\) is the odds for women, which implies \(e^{\beta_1}\) is the odds for women divided by the odds for men. This quantity is called the odds ratio.
To see this not that if use \(p_1\) and \(p_0\) to denote the probability of success for women and men, respectively, then \(e^\{beta_1\) can be rewritten as
\[ e^{\beta_1} = \frac{p_1}{1-p_1} \, / \, \frac{p_0}{1-p_0} \]
\(\beta_1\) therefore quantifies the log odds ratio.
Least squares is no longer an optimal way of estimating the parameters and instead we use an approach called maximum likelihood estimation (MLE).
A version of the central limit theorem applies and the estimates obtained this way are approximately normal when th number of observations is large.
The theory also provides a way to calculate standard errors for the estimates of the \(\beta\)s.
23.3.2 Fitting the model
To obtain the maximum likelihood estimates using R we can use the
glm
function with thefamily
argument set tobinomial
. This defaults to using the logit transformation.We do not have the individual level data, but because we our model assumes the probability of success is the same for all women and all men, then the number of success can be modeled as binomial with \(N_1\) trials and probability \(p_1\) for women and binomial with \(N_0\) trials and probability \(p_0\) for men, with \(N_1\) and \(N_0\) the total number of women and men.
In this case the
glm
function is used like this:
<- with(totals, c(yes_men, yes_women))
success <- with(totals, c(no_men, no_women))
failure <- factor(c("men", "women"))
gender <- glm(cbind(success, failure) ~ gender, family = "binomial")
fit coefficients(summary(fit))
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.5342684 0.06474387 -23.697508 3.825454e-124
genderwomen -0.2082771 0.10407015 -2.001315 4.535850e-02
The estimate of the odds ratio is 0.811982 which is interpreted as the odds being lowered by 20% for women as compared to men.
But is this due to chance?
We already noted that the p-value is about 0.05, but the GLM approach also permits us to compute confidence intervals using the
confint
function:
exp(confint(fit, 2))
2.5 % 97.5 %
0.6613153 0.9946601
We have used a simple version of GLMs in which the only variable is binary. However, the method can be expanded to use multiple variables including continuous ones. However, in these contexts the log odds ratio interpretation becomes more complex. Also note that we have shown just one version of GLM appropriate for binomial data using a logit transformation. This version is referred to often referred to as logistic regression. However, GLM can be used with other transformation and distributions. You can learn more by consulting a GLM text book.
23.3.3 Simple standard error approximation for two-by-two table odds ratio
Using
glm
we can obtain estimates, standard errors, and confidence intervals for a wide range of models. To do this we use a rather complex algorithms. In the case of two-by-two tables we can obtain a standard error for the log odds ratio using a simple approximation.If our two-by-two tables has the following entries:
Men | Women | |
---|---|---|
Awarded | a | b |
Not Awarded | c | d |
- In this case, the odds ratio is simply
\[ \frac{a/c}{b/d} = \frac{ad}{bc} \].
- We can confirm we obtain the same estimate as when using
glm
:
<- with(two_by_two, women[2]/sum(women) / (women[1]/sum(women)) / ((men[2]/sum(men)) / (men[1]/sum(men))))
or c(log(or), fit$coef[2])
genderwomen
-0.2082771 -0.2082771
- Statistical theory tells us that when all four entries of the two-by-two table are large enough, then the log odds ratio is approximately normal with standard error
\[ \sqrt{1/a + 1/b + 1/c + 1/d} \]
- This implies that a 95% confidence interval for the log odds ratio can be formed by:
\[ \log\left(\frac{ad}{bc}\right) \pm 1.96 \sqrt{1/a + 1/b + 1/c + 1/d} \]
By exponentiating these two numbers we can construct a confidence interval of the odds ratio.
Using R we can compute this confidence interval as follows:
<- two_by_two |> select(-awarded) |>
se summarize(se = sqrt(sum(1/men) + sum(1/women))) |>
pull(se)
exp(log(or) + c(-1,1) * qnorm(0.975) * se)
[1] 0.6621581 0.9957060
- Note that 1 is not included in the confidence interval which must mean that the p-value is smaller than 0.05. We can confirm this using:
2*(1 - pnorm(abs(log(or)), 0, se))
[1] 0.0453586
Note that the p-values obtained with chisq.test
, glm
and this simple approximation are all slightly different. This is because these are both based on different approximation approaches.
23.4 Large samples, small p-values
As mentioned earlier, reporting only p-values is not an appropriate way to report the results of data analysis.
In scientific journals, for example, some studies seem to overemphasize p-values.
Some of these studies have large sample sizes and report impressively small p-values.
Yet when one looks closely at the results, we realize odds ratios are quite modest: barely bigger than 1.
In this case the difference may not be practically significant or scientifically significant.
The relationship between odds ratio and p-value is not one-to-one.
It depends on the sample size. So a very small p-value does not necessarily mean a very large odds ratio.
Notice what happens to the p-value if we multiply our two-by-two table by 10, which does not change the odds ratio:
<- two_by_two |>
two_by_two_x_10 select(-awarded) |>
mutate(men = men*10, women = women*10)
chisq.test(two_by_two_x_10)$p.value
[1] 2.625423e-10
:::{.callout-note title = “Small count correction”} Note that the log odds ratio is not defined if any of the cells of the two-by-two table is 0. This is because if \(a\), \(b\), \(c\), or \(d\) is 0, the \(\log(\frac{ad}{bc})\) is either the log of 0 or has a 0 in the denominator. For this situation, it is common practice to avoid 0s by adding 0.5 to each cell. This is referred to as the Haldane–Anscombe correction and has been shown, both in practice and theory, to work well. :::
23.5 Exercises
- A famous athlete has an impressive career, winning 70% of her 500 career matches. However, this athlete gets criticized because in important events, such as the Olympics, she has a losing record of 8 wins and 9 losses. Perform a Chi-square test to determine if this losing record can be simply due to chance as opposed to not performing well under pressure.
<- matrix(c(500*.3, 500*.7, 9,8), 2, 2, byrow = TRUE)
mat chisq.test(mat, correct = FALSE)
Pearson's Chi-squared test
data: mat
X-squared = 4.0631, df = 1, p-value = 0.04383
- Why did we use the Chi-square test instead of Fisher’s exact test in the previous exercise?
- It actually does not matter, since they give the exact same p-value.
- Fisher’s exact and the Chi-square are different names for the same test.
- Because the sum of the rows and columns of the two-by-two table are not fixed so the hypergeometric distribution is not an appropriate assumption for the null hypothesis. For this reason, Fisher’s exact test is rarely applicable with observational data.
- Because the Chi-square test runs faster.
Answer: c
- Compute the odds ratio of “losing under pressure” along with a confidence interval.
<- log((mat[1,1]*mat[2,2])/(mat[1,2]*mat[2,1]))
lor <- sqrt(sum(1/mat))
se exp(lor + c(-1,1)*1.96*se)
[1] 0.1442096 1.0063459
2*pnorm(-abs(lor/se))
[1] 0.05150641
- Notice that p-values obtained with the four approaches that produce three different p-values:
chisq.test(mat)$p.value
[1] 0.08037602
chisq.test(mat, correct = FALSE)$p.value
[1] 0.0438292
2*pnorm(-abs(lor/se))
[1] 0.05150641
summary(glm(mat ~ c(1,0), family = "binomial"))$coef[2,4]
[1] 0.0515064
Why is this?
- We made a mistake in our code.
- These are based on t-statistics so the connection between p-value and confidence intervals does not apply.
- Different approximations are used for the p-value and the confidence interval calculation. If we had a larger sample size the match would be better.
- We should use the Fisher exact test to get confidence intervals.
Answer: c
- Multiply the two-by-two table by 2 and see if the p-value and confidence retrieval are a better match.
23.6 Optional exercises (won’t appear in the midterm)
- When analyzing the trump tweets we computed
sentiment_counts
like this:
library(tidytext)
<- "https://t.co/[A-Za-z\\d]+|&"
links <- get_sentiments("nrc") |> select(word, sentiment)
nrc <- trump_tweets |>
android_iphone extract(source, "source", "Twitter for (.*)") |>
filter(source %in% c("Android", "iPhone") &
>= ymd("2015-06-17") &
created_at < ymd("2016-11-08")) |>
created_at filter(!is_retweet) |>
arrange(created_at) |>
mutate(text = str_replace_all(text, links, "")) |>
unnest_tokens(word, text) |>
filter(!word %in% stop_words$word &
!str_detect(word, "^\\d+$")) |>
mutate(word = str_replace(word, "^'", "")) |>
filter(!word %in% stop_words$word)
<- android_iphone |>
sentiment_counts left_join(nrc, by = "word", relationship = "many-to-many") |>
count(source, sentiment) |>
pivot_wider(names_from = "source", values_from = "n") |>
mutate(sentiment = replace_na(sentiment, replace = "none"))
Compute an odds ratio comparing Android to iPhone for each sentiment and add it to the table.
Compute a 95% confidence interval for each odds ratio.
Generate a plot showing the estimated odds ratios along with their confidence intervals.
Test the null hypothesis that there is no difference between tweets from Android and iPhone and report the sentiments with p-values less than 0.05 and more likely to come from Android.
For each sentiment, find the words assigned to that sentiment, keep words that appear at least 25 times, compute the odd ratio for each, and show a barplot for those with odds ratio larger than 2 or smaller than 1/2.
http://www.pnas.org/content/112/40/12349.abstract↩︎