chow hf
394 386
2024-11-12
Up to now, all our linear models have been applied to two or more continuous random variables.
We assume the random variables are multivariate normal and use this to motivate a linear model.
This approach covers many real-life examples of linear regression.
Linear models have many other applications.
One of the most popular is to quantify treatment effects in randomized and controlled experiments.
One of the first applications was in agriculture, where different plots of lands were treated with different combinations of fertilizers.
The use of \(Y\) for the outcome in statistics is due to the mathematical theory being developed for crop yield as the outcome.
The same ideas have been applied in other areas.
In randomized trials developed to determine if drugs cure or prevent diseases or if policies have an effect on social or educational outcomes.
We think of the policy intervention as a treatment so we can follow the same mathematical descriptions.
The analyses used in A/B testing are based on treatment effects models.
Factors: Categorical variables used to define subgroups.
Levels: The distinct categories or groups within each factor.
Treatment: The factor we are interested in. Levels are often experimental and control.
These models have also been applied in observational studies as well.
We use linear models to estimate effects of interest while accounting for potential confounders.
For example, to estimate the effect of a diet high in fruits and vegetables on blood pressure, we would have to account for factors such as age, sex, and smoking status.
Mice were randomly selected and divided into two groups: one group receiving a high-fat diet while the other group served as the control.
The data is included in the dslabs package:
Is it possible that the observed difference is simply due to chance?
We can apply the ideas we learned in the statistical inference lectures.
Define \(\mu_1\) and \(\mu_0\) the population averagees for treatment and control.
We want to know if \(\mu_1 - \mu_0 >0\).
We know that the \(\bar{X}_1 - \bar{X}_0\)
# A tibble: 2 × 2
diet average
<fct> <dbl>
1 chow 31.5
2 hf 36.7
follows a normal distribution, with expected value \(\mu_1-\mu_0\) and standard error:
\[ \sqrt{\frac{\sigma_1^2}{N_1} + \frac{\sigma_0^2}{N_0}} \]
\[ \frac{\bar{X}_1 - \bar{X}_0}{\sqrt{\frac{\sigma_1^2}{N_1} + \frac{\sigma_0^2}{N_0}}} \]
Note that we can’t compute this quantity in practice because the \(\sigma_1\) and \(\sigma_0\) are unknown.
However, if we estimate them with the sample standard deviations the CLT still holds and:
\[ t = \frac{\bar{X}_1 - \bar{X}_0}{\sqrt{\frac{s_1^2}{N_1} + \frac{s_0^2}{N_0}}} \]
stats <- mice_weights |>
group_by(diet) |>
summarize(xbar = mean(body_weight), s = sd(body_weight), n = n())
t_stat <- with(stats, (xbar[2] - xbar[1])/sqrt(s[2]^2/n[2] + s[1]^2/n[1]))
t_stat
[1] 9.339096
1-pnorm(t_stat)
as we know it will be very small.Note that when \(N\) is not large enough, then the CLT does not apply and we use the t-distribution:
So the calculation of the p-value is the same except that we use pt
instead of pnorm
.
Specifically, we use
Warning
In the computation above, we computed the probability of t
being as large as what we observed.
However, when our interest spans both directions, for example, either an increase or decrease in weight, we need to compute the probability of t
being as extreme as what we observe.
The formula simply changes to using the absolute value: 1 - pnorm(abs(t-test))
or 1-pt(abs(t_stat), with(stats, n[2]+n[1]-2))
.
Although the t-test is useful for cases in which we compare two treatments, it is common to have other variables affect our outcomes.
For example, what if more male mice received the high-fat diet?
What if their is cage effect?
Linear models permit hypothesis testing in these more general situations.
\[ Y_i = \beta_0 + \beta_1 x_i + \varepsilon_i \]
\[ Y_i = \beta_0 + \beta_1 x_i + \varepsilon_i \]
\(x_i\) being 0 or 1 rather than a continuous variable, allows us to use it in this different context.
Now \(\beta_0\) represents the population average height of the mice on the chow diet and
\(\beta_0 + \beta_1\) represents the population average for the weight of the mice on the high-fat diet.
A nice feature of this model is that \(\beta_1\) represents the treatment effect of receiving the high-fat diet.
The null hypothesis that the high-fat diet has no effect can be quantified as \(\beta_1 = 0\).
Because diet
is a factor with two entries, the lm
function knows to fit the linear model above with a \(x_i\) a indicator variable.
The summary
function shows us the resulting estimate, standard error, and p-value:
broom
, we can write:statistic
computed here is the estimate divided by its standard error:\[\hat{\beta}_1 / \hat{\mbox{SE}}(\hat{\beta}_1)\]
Note
In the linear model description provided here, we assumed \(\varepsilon\) follows a normal distribution.
This assumption permits us to show that the statistics formed by dividing estimates by their estimated standard errors follow t-distribution, which in turn allows us to estimate p-values or confidence intervals.
However, note that we do not need this assumption to compute the expected value and standard error of the least squared estimates.
Furthermore, if the number of observations is large enough, then the central limit theorem applies and we can obtain p-values and confidence intervals even without the normal distribution assumption for the errors.
Note that this experiment included male and female mice, and male mice are known to be heavier.
This explains why the residuals depend on the sex variable:
This misspecification can have real implications; for instance, if more male mice received the high-fat diet, then this could explain the increase.
Conversely, if fewer received it, we might underestimate the diet effect.
Sex could be a confounder, indicating that our model can certainly be improved.
\[ Y_i = \beta_1 x_{i,1} + \beta_2 x_{i,2} + \beta_3 x_{i,3} + \beta_4 x_{i,4} + \varepsilon_i \]
Note that with this representation we allow the diet effect to be different for males and females.
However, with this representation, none of the \(\beta\)s represent the effect of interest: the diet effect.
A powerful feature of linear models is that we can rewrite the model so that the expected value for each group remains the same, but the parameters represent the effects we are interested in.
\[ Y_i = \beta_0 + \beta_1 x_{i,1} + \beta_2 x_{i,2} + \beta_3 x_{i,1} x_{i,2} + \varepsilon_i \]
In statistics, \(\beta_3\) is referred to as an interaction effect.
The \(\beta_0\) is considered the baseline value, which is the average weight of females on the chow diet.
Statistical textbooks describe several other ways in which the model can be rewritten to obtain other types of interpretations.
For example, we might want \(\beta_2\) to represent the overall diet effect (the average between female and male effect) rather than the diet effect on females.
This is achieved by defining what contrasts we are interested in.
In R, we can specific the linear model above using the following:
*
implies that the term that multiplies \(x_{i,1}\) and \(x_{i,2}\) should be included, along with the \(x_{i,1}\) and \(x_{i,2}\) terms.# A tibble: 3 × 7
term estimate std.error statistic p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 diethf 3.88 0.624 6.22 8.02e-10 2.66 5.10
2 sexM 7.53 0.627 12.0 1.27e-30 6.30 8.76
3 diethf:sexM 2.66 0.891 2.99 2.91e- 3 0.912 4.41
\[ Y_i = \beta_0 + \beta_1 x_{i,1} + \beta_2 x_{i,2} + \varepsilon_i \]
+
instead of *
:Scientific studies, particularly within epidemiology and social sciences, frequently omit interaction terms from models due to the high number of variables.
Adding interactions necessitates numerous parameters, which in extreme cases may prevent the model from fitting.
However, this approach assumes that the interaction terms are zero, and if incorrect, it can skew the interpretation of the results.
Conversely, when this assumption is valid, models excluding interactions are simpler to interpret, as parameters are typically viewed as the extent to which the outcome increases with the assigned treatment.
Tip
Linear models are highly flexible and applicable in many contexts.
For example, we can include many more factors than just 2.
We have only just scratched the surface of how linear models can be used to estimate treatment effects.
We highly recommend learning more about this by exploring linear model textbooks and R manuals that cover the use of functions such as lm
, contrasts
, and model.matrix
.
In the examples we have examined, each treatment had only two groups: diet had chow/high-fat, and sex had female/male.
However, variables of interest often have more than one level.
For example, we might have tested a third diet on the mice.
In statistics textbooks, these variables are referred to as a factor, and the groups in each factor are called its levels.
When a factor is included in the formula, the default behavior for lm
is to define the intercept term as the expected value for the first level, and the other coefficient are to represent the difference, or contrast, between the other levels and first.
We can see when we estimate the sex effect with lm
like this:
\[ Y_i = \beta_0 + \beta_1 x_{i,1} + \beta_2 x_{i,2} + \varepsilon_i \]
Unfortunately, this representation has a problem.
Note that the mean for females and males are represented by \(\beta_0 + \beta_1\) and \(\beta_0 + \beta_2\), respectively.
This is a problem because the expected value for each group is just one number, say \(\mu_f\) and \(\mu_m\), and there is an infinite number of ways \(\beta_0 + \beta_1 = \mu_f\) and \(\beta_0 +\beta_2 = \mu_m\) (three unknowns with two equations).
This implies that we can’t obtain a unique least squares estimates.
The model, or parameters, are unidentifiable.
The default behavior in R solves this problem by requiring \(\beta_1 = 0\), forcing \(\beta_0 = \mu_m\), which permits us to solve the system of equations.
Keep in mind that this is not the only constraint that permits estimation of the parameters.
Any linear constraint will do as it adds a third equation to our system.
A widely used constraint is to require \(\beta_1 + \beta_2 = 0\).
To achieve this in R, we can use the argument contrast
in the following way:
We see that the intercept is now larger, reflecting the overall mean rather than just the mean for females.
The other coefficient, \(\beta_1\), represents the contrast between females and the overall mean in our model.
The coefficient for men is not shown because it is redundant: \(\beta_1= -\beta_2\).
The use of this alternative constraint is more practical when a factor has more than one level, and choosing a baseline becomes less convenient.
Furthermore, we might be more interested in the variance of the coefficients rather than the contrasts between groups and the reference level.
\[ Y_i = \beta_0 + \sum_{j=1}^J \beta_j x_{i,j} + \varepsilon_i \]
\[ \frac{1}{J} \sum_{j=1}^J \beta_j = 0 \implies \sum_{j=1}^J \beta_j = 0. \]
\[ \sigma^2_{\text{gen}} \equiv \frac{1}{J}\sum_{j=1}^J \beta_j^2 \]
fit <- lm(body_weight ~ gen, data = mice_weights,
contrasts = list(gen = contr.sum))
contrast(emmeans(fit, ~gen))
contrast estimate SE df t.ratio p.value
gen4 effect -0.122 0.705 775 -0.174 0.8621
gen7 effect -0.812 0.542 775 -1.497 0.3372
gen8 effect -0.113 0.544 775 -0.207 0.8621
gen9 effect 0.149 0.705 775 0.212 0.8621
gen11 effect 0.897 0.540 775 1.663 0.3372
P value adjustment: fdr method for 5 tests
When a factor has more than one level, it is common to want to determine if there is significant variability across the levels rather than specific difference between any given pair of levels.
Analysis of variances (ANOVA) provides tools to do this.
ANOVA provides an estimate of \(\sigma^2_{\text{gen}}\) and a statistical test for the null hypothesis that the factor contributes no variability: \(\sigma^2_{\text{gen}} =0\).
Once a linear model is fit using one or more factors, the aov
function can be used to perform ANOVA.
aov
will fit the model:We do not need to specify the constraint because ANOVA needs to constrain the sum to be 0 for the results to be interpretable.
This analysis indicates that generation is not statistically significant.
Note
We do not include many details, for example, on how the summary statistics and p-values shown by aov
are defined and motivated.
There are several books dedicated to the analysis of variance, and textbooks on linear models often include chapters on this topic.
Those interested in learning more about these topics can consult one of these textbooks.
ANOVA was developed to analyze agricultural data, which included several factors.
We can perform ANOVA with multiple factors:
Df Sum Sq Mean Sq F value Pr(>F)
sex 1 15165 15165 389.799 <2e-16 ***
diet 1 5238 5238 134.642 <2e-16 ***
gen 4 295 74 1.895 0.109
Residuals 773 30074 39
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Warning
One of the key aspects of ANOVA (Analysis of Variance) is its ability to decompose the total variance in the data, represented by \(\sum_{i=1}^n Y_i^2\), into individual contributions attributable to each factor in the study.
However, for the mathematical underpinnings of ANOVA to be valid, the experimental design must be balanced.
This means that for every level of any given factor, there must be an equal representation of the levels of all other factors.
In our study involving mice, the design is unbalanced, requiring a cautious approach in the interpretation of the ANOVA results.
\[ Y_i = \beta_0 + \sum_{j=1}^J \beta_j x_{i,j} + \sum_{k=1}^K \beta_{J+k} x_{i,J+k} + \varepsilon_i \\ \mbox{ with }\sum_{j=1}^J \beta_j=0 \mbox{ and } \sum_{k=1}^K \beta_{J+k} = 0, \]
Specifically, in
\[ Y_i = \beta_0 + \sum_{j=1}^J \beta_j x_{i,j} + \sum_{k=1}^K \beta_{J+k} x_{i,J+k} + \varepsilon_i \]
the \(x_{i,1},\dots,x_{i,J}\) indicator functions for the \(J\) levels in the first factor and \(x_{i,J+1},\dots,x_{i,J+K}\) indicator functions for the \(K\) levels in the second factor.
\[ Y_{i,j,k} = \mu + \alpha_j + \beta_k + \varepsilon_{i,j,k} \]
\[ \sum_{j=1}^J \alpha_j = 0 \text{ and } \sum_{k=1}^K \beta_k = 0 \]