Number of tests | Prob(at least 1 < 0.05) |
---|---|
1 | 0.05 |
2 | 0.10 |
5 | 0.23 |
10 | 0.40 |
100 | 0.99 |
12 Multiple tests – why and when to adjust p-values
In many experiments, researchers have multiple treatments or multiple responses. This results in multiple tests against the null. For example, in an experiment with four treatment levels, there are six combinations of contrasts (A - B, A - C, A - D, B - C, B - D, C - D) and a researcher might be interested in all six p-values. Or, in an experiment with only two levels, there could be six response variables (expression level in six different genes. Or, six different phenotypes). Here, there are six contrasts (B - A for all six response variables) and six tests. Researchers commonly adjust for multiple testing in examples like these (the exception is multiple phenotype responses. Curiously, researchers almost never adjust for this). What is multiple testing and why do we want to adjust? And, something that doesn’t seem to be considered much in the experimental bench biology literature, when do we want to adjust?
- If your experiment has a single response variable (that is, you’re not measuring the effect of treatment on multiple phenotypes or gene expression levels), then you probably don’t want to adjust for multiple tests.
- Adjusting for multiple tests in the experimental bench biology literature is the norm, because researchers are confused on what a “family” is.
- Despite No. 2, researchers almost always fail to adjust for multiple tests when they should, specifically, when there are multiple phenotypes or gene expression levels that answer the same experimental question.
- When you do adjust as in No. 3, you should probably use a method that controls the False Discovery Rate (FDR) and not the Family-Wise Error Rate (FWER). The reason isn’t because the FDR is more liberal, it’s because it’s more consistent with the goals of the research.
12.1 In a family of multiple tests, what is the probability of at least one p-value less than 0.05?
A family of tests is a set of tests that are used to make some inference about one question. For example, “In this mouse model of diabetes, are any genes differently expressed between diabetes mice and wildtype mice”? \(p < 0.05\) for any of the tested genes answers the question. If a researcher only tests one gene, then the probability of \(p < 0.05\) if there is no expression difference between phenotypes is 5%. But if a researcher tests two genes, there is a higher probability than 5% that at least one will have \(p < 0.05\) if there is no expression difference between phenotypes for both genes. And, if a researcher tests ten genes, there is a much higher probability than 5% that at least one will have \(p < 0.05\) if there is no expression difference between phenotypes for any of the ten genes. What are these probabilities?
The probability of a significant p-value in one test is 0.05, so the probability of no non-significant p-values in one test is 1 - 0.05. The probability of no non-significant p-values in two tests is \((1 - 0.05)^2\), so the probability of at least one significant p-value is \(1 - (1 - 0.05)^2\). The probability of no non-significant p-values in ten tests is \((1 - 0.05)^{10}\), so the probability of at least one significant p-value in ten tests is \(1 - (1 - 0.05)^{10}\). Let’s compute these probabilities, and more.
The second column is the family-wise error rate (FWER), the probability of at least one \(p < 0.05\) in a family of tests in which there are no true effects. So the FWER for a family of 10 tests is 40% if our p-values are not adjusted for multiple testing. We adjust p values in experiments with multiple tests in order to control the FWER.
An assumption of these probabilities is the tests are independent. Importantly, this independence assumption is also true for most, but not all, of the adjustment methods.
12.2 Example 1 – In many experiments, we don’t want to adjust for multiple tests (exfig2d data)
In many experiments, we don’t want to adjust for multiple tests because there is only a single test in the family even if there are multiple tests in the experiment. Researchers are adjusting because of a confusion on what a family is. The cost of this adjustment is more conservative (higher) p-values, which leads to less power, which leads to more failed discoveries or more research costs (mice, time) to discover by p-value.
Let’s explore what a family is using data from the article
Figure: Extended Figure 2d
The import code is in Section 12.7.1 below.
In this study, the researchers investigate the control of fat absorption in the intestine by the dorsal motor nucleus of the vagus nerve (DMV) and show that the plant metabolite puerarin mimics this DMV suppression.
The experiment in Extended Figure 2d follows up initial experiments from Figure 1 that show that inactivation of the DMV in mice decreased weight gain on a high fat diet relative to control mice and that DMV-inactive mice had lower plasma triglycerides, higher fat content in the feces, and lower fat absorption in the jejunum part of the small intestine. The DMV was inactivated by Clozapine N-oxide (CNO) binding to a human-designed hM4D(Gi) inhibitory receptor whose expression is controlled by the researchers.
In the experiment in Extended Figure 2d, the response is fecal NEFA (non-esterified fatty acid).
The three treatments are
- “con” is a negative control with a designer receptor that doesn’t bind CNO. The researchers expect fecal NEFA equal to mice with no designed receptors.
- “3q” is the focal treatment using a designer receptor that should activate the DMV. If the DMV regulation of fat absorption is working as thought, the researchers expect increased fat absorption so lower fecal NEFA levels relative to “con”.
- “4i” is a positive control using the DMV-inactivating receptor in the previous experiments. The researchers expect higher NEFA levels relative to “con”, as in the experiments in Fig. 1.
There are three contrasts of single treatments (“3q - con”, “4i - con”, “4i - 3q”) so three tests. The researchers follow the norm in experimental bench biology research and report p-values adjusted for multiple tests. Specifically, the researchers adjusted for the two contrasts comparing a treatment to the negative control (“3q - con” and “4i - con”). This adjustment seems to follow the logic that a family is the set of tests of interest to the researchers (the contrast “4i - 3q” is not of interest). But “of interest to the researchers” doesn’t define a family of tests. Instead, “same question” defines a family of tests. If were were to include both contrasts in a family, what is the question? It would have to be something like “if we manipulate the activity of the DMV, do we get a change in fecal NEFA?”.
This isn’t the question addressed by the research. The researchers have one focal question: “does activating the DMV increase fat absorption and consequently, decrease fecal NEFA levels?” Only one test (the contrast “3q - con”) answers this question so the family has a single test and there is no need to adjust. The researchers are interested in a second question as a check on the experiment itself. This question is something like “Are we actually getting the known response given the intervention”. Only one test (the contrast “4i - con”) answers this question so there is no need to adjust.
Let’s look at the cost of adjustment. The researchers used the Holm-Sidák method to adjust the two p-values.
<- lm(nefa ~ treatment, data = exfig2)
m1 <- emmeans(m1, specs = "treatment")
m1_emm <- contrast(m1_emm,
m1_pairs method = "trt.vs.ctrl",
adjust = "none") |>
summary(infer = TRUE) |>
data.table()
:= holm_sidak(p.value)]
m1_pairs[, holm_sidak <- m1_pairs[, .SD, .SDcols = c("contrast", "estimate", "p.value", "holm_sidak")]
m1_pvals colnames(m1_pvals) <- c("Contrast", "Estimate", "Unadjusted P", "Holm-Sidak P")
|>
m1_pvals kable(digits = c(1, 3, 5, 5)) |>
kable_styling(full_width = FALSE)
Contrast | Estimate | Unadjusted P | Holm-Sidak P |
---|---|---|---|
3q - con | -1.090 | 0.01261 | 0.02506 |
4i - con | 1.048 | 0.01361 | 0.02506 |
The cost is about double because there are two tests. If we use \(p < 0.05\) to act as if there is an effect, then the adjustment costs us nothing here. Adjusting for a small (two to three) number of tests is not going to have much impact on the quality of science either way. Nevertheless
- adjusting is unnecessary in experiments like this, where there are multiple questions addressed by the experiment and each question is answered by a single test. This is the case for many, many experiments in experimental bench biology.
- although small, the increased power with not adjusting is free.
12.2.1 Or, maybe we do want to adjust, but not in the way that is usually done
Extended figure 2d presents only one of six response variables that were measured on each mouse. The other response variables were body weight, fecal triglyceride, jejunal triglyceride, duodenal triglyceride, and ileal triglyceride. Some thoughts:
- Fecal NEFA and Fecal triglyceride effectively answer the same focal question “does activating the DMV increase fat absorption and consequently, decrease fecal fat levels?” So, adjusting for two tests is probably justified but the two tests are the same contrast (“3q - con”) with two different response variables and not different contrasts for the same response variable.
- jejunal, duodenal, and ileal triglyceride effectively answer the same focal question “does activating the DMV increase fat absorption in a specific part of the small intestine”. So adjusting for three tests is probably justified here, but again, the adjustment is for the same contrast (“3q - con”) from the model fit to three different response variables.
- Adjustment for multiple responses is common in some subfields of experimental bench biology that are measuring thousands of responses (gene expression, functional MRI) but I I haven’t seen adjustment for multiple phenotypic responses from a single experiment.
12.3 Example 2 – Tukey adjustment is extremely common, and (mostly) unjustified (fig5l data)
A very common research design in experimental bench biology is a design with two factors, each with two levels, with mice assigned to all 2 x 2 = 4 combinations of the levels. Analysis of these models for two (or more) factors is the focus of Chapter 13. The experiment in Figure 5l is a good example of a 2 x 2 design.
Figure: Figure 5l
The import code is in Section 12.7.2 below
The experiment in Fig. 5 follows up experiments that show that the plant metabolite puerarin binds to the GABA receptor encoded by Gabra1 and increases GABA’s inhibitory effect on the target neuron. In the experiment in Fig. 5, the researchers are investigating the mechanism of the effect of DVM-vagus supression on fat absorption in the jejunum. Figure 5c shows that suppressing DMV using COS activation of the designer 4i inhibitory receptor (as above) results in decreased jejunal microvilli length – the reduced length decreases surface area for absorption. Figure 5d shows that suppressing DMV results in reduced expression of four genes involved in microvilli growth and organization. The design of the experiment in Fig. 5c, d is the same as that in Extended fig 2d above.
Fig. 5l uses a 2 x 2 crossed design to investigate the effect of puerarin with and without a functional Gabra1 GABA receptor.
Here, we’ll focus on the first gene, ezr, which encodes the protein ezrin.
The two factors are
- genotype, with levels “WT” and “KO”. KO is the conditional knockout of Gabra1 in the DMV.
- treatment, with levels “Vehicle” and “Puerarin”. Infusion of a saline solution (Vehicle) or a solution with Puerarin.
All levels of both factors are combined to create the four experimental combinations (the design is fully factorial)
- “WT Vehicle” is the combination “WT” + “Vehicle”. This is the negative control.
- “WT Puerarin” is the combination “WT” + “Puerarin”.
- “KO Vehicle” is the combination “KO” + “Vehicle”.
- “KO Puerarin” is the combination “KO” + “Puerarin”
The primary questions of the experiment are:
- does addition of puerarin reduce RNA expression in mice with functional Gabra1 GABA receptor?
- does deletion of Gabra1 remove this effect of puerarin?
The researchers answer these questions using these two contrasts
- WT Puerarin - WT Vehicle
- KO Puerarin - KO Vehicle
So, there are two research questions, each answered by a single contrast. There is nothing to adjust because each family consists of only a single test.
Look at question 2 closely – its a comparison of contrast 1 and contrast 2. Ultimately, the researchers are asking about the effects of Puerarin with and without a functional Gabra1 GABA receptor, so there is a single primary contrast of interest
(KO Puerarin - KO Vehicle) - (WT Puerarin - WT Vehicle)
This contrast is called the interaction effect or simply the interaction. The interaction is the difference in contrasts 1 and 2. An interaction is a difference of differences. The estimation of interaction effects almost always answers the focal question in 2 x 2 designs but is almost never estimated in experimental bench biology. More information on designs for two factors and interaction effects is in chapter (two-factors?)
What did the researchers do? Of the six pairwise contrasts, only two of which they focussed on in the text, the researchers report the four “simple effects” in the figure and in the archived Excel file. These are:
- (WT Puerarin) - (WT Vehicle) answers the question “how does microvilli-related gene expression change when we add puerarin to mice with fuctional Gabra1?”. This is a positive control for contrast 2.
- (KO Puerarin) - (KO Vehicle) answers the question “how does microvilli-related gene expression change when we add puerarin to mice with knocked out Gabra1?”. This is a focal contrast.
- (KO Vehicle)) - (WT Vehicle) answers the question “how does microvilli-related gene expression change when we knock out Gabra1 and don’t give puerarin administration? This is a positive control for contrast 4.
- (KO Puerarin) - (WT Puerarin) answers the question “how does microvilli-related gene expression change when we knock out Gabra1 but give puerarin administration? This is a focal contrast.
Instead of adjusting for the two tests of interest, or the four computed tests, the researchers followed the norm in experimental bench biology and used the Tukey HSD adjustment for a single family of six tests (all six pairwise tests). Perhaps the norm is following a definition of family as “all tests within an experiment”. Regardless, the adjustment logic in Fig. 5l is inconsistent with that in Fig 2d., where the researchers adjusted for the two tests of interest instead of the set of all three pairwise comparisons in the experiment. In Fig. 5l, the researchers adjust for all six pairwise comparisons, show only four contrasts in the figure and in the Excel file of the data, but only seem interested in the two contrasts that answer the focal questions.
What is the cost of adjustment? In Table 12.2 below, for the two focal tests, I give the unadjusted p-values, the Tukey p-values based on all six pairwise tests, and the Holm p-values based on the four simple effects. The cost of the Tukey and Holm adjustments are given in the last two columns as the ratio of the adjusted to unadjusted p-value. This loss of power is an unnecessary and results from a confusion in the experimental bench biology community of what a family of tests mean. Loss of power leads to more time, more dollars, more animals, and more failed discoveries.
<- lm(ezrin ~ genotype * treatment, data = fig5l)
m1 <- emmeans(m1, specs = c("genotype", "treatment"))
m1_emm <- contrast(m1_emm,
m1_pairs method = "revpairwise",
adjust = "none") |>
summary(infer = TRUE) |>
data.table()
<- contrast(m1_emm,
m1_tukey method = "revpairwise") |>
summary(infer = TRUE) |>
data.table()
<- contrast(m1_emm,
m1_simple method = "revpairwise",
simple = "each",
combine = TRUE,
adjust = "holm") |>
summary(infer = TRUE) |>
data.table()
:= m1_tukey$p.value]
m1_pairs[, tukey c(1,2,5,6), holm := m1_simple[c(1,3,4,2), p.value]]
m1_pairs[
:= tukey/p.value]
m1_pairs[ , cost_tukey := holm/p.value]
m1_pairs[ , cost_holm
<- m1_pairs[, .SD, .SDcols = c("contrast", "estimate", "p.value", "tukey", "holm", "cost_tukey", "cost_holm")]
m1_pvals colnames(m1_pvals) <- c("Contrast", "Estimate", "Unadjusted P", "Tukey P", "Holm P", "Tukey/Unadjusted", "Holm/Unadjusted")
# only the simple effects are of interest. These are rows 1, 2, 5, 6
c(2,5), ] |>
m1_pvals[kable(digits = c(1, 3, 4, 4, 4, 1, 1)) |>
kable_styling(full_width = FALSE)
Contrast | Estimate | Unadjusted P | Tukey P | Holm P | Tukey/Unadjusted | Holm/Unadjusted |
---|---|---|---|---|---|---|
WT Puerarin - WT Vehicle | -0.554 | 0.0032 | 0.0159 | 0.0096 | 5.0 | 3 |
KO Puerarin - KO Vehicle | -0.020 | 0.9085 | 0.9994 | 0.9085 | 1.1 | 1 |
12.3.1 Or, maybe we do want to adjust, but not in the way that is usually done
For each of the contrasts in the Fig. 5l example above, the researchers measured the response in four microvilli-related genes. Each of these genes answers the question posed for a contrast. For example, all four genes answer the question for contrast 1, which is “how does microvilli-related gene expression change when we add puerarin to mice with functional Gabra1?” The family is the four contrasts (WT Puerarin - WT Vehicle) for each of the four genes. The researchers should adjust for the four tests within this family. The next section shows how this is done.
12.4 Example 3 – When we do want to adjust, we often want to use a method that controls the False Discovery Rate (FDR) (lipocalin fig3 data)
When \(p < 0.05\), researchers act as if they have discovered or confirmed something, so \(p < 0.05\) is evidence of a discovery. Consider these two situations
- \(p < 0.05\) and the Null Hypothesis is false – there is a true effect of treatment, or, there is a true difference in means between the two treatment levels. This is a true discovery.
- \(p < 0.05\) but the Null Hypothesis is true – there is no true effect of treatment, or, there is no true difference in means between the two treatment levels. This is a false discovery.
The false discovery rate (FDR) in a batch of tests is
\[ FDR = \frac{number\;of\;false\;discoveries}{number\;of\;false\;discoveries + number\;of\;true\;discoveries} \]
Remember that the family wise error rate (FWER) is the long term probability of at least one \(p < 0.05\) in a batch of tests in which all Null Hypothesis are true – there are no true effects (Section 12.1).
Note that the FWER for a single batch of tests is either 1 (there was at least one false positive) or 0 (there were no false positives). By contrast, an omniscient researcher can compute the FDR for a single batch of tests.
For any batch of tests in which some effects are real, and some are not
- A FWER method of p-value adjustment controls the long-term FWER at \(\leq\) 0.05.
- An FDR method of p-value adjustment controls the long-term FDR at \(\leq\) 0.05.
Figure: Figure 3
The import code is in Section 12.7.3 below
The researchers are investigating the role of Lipocalin-2 (LCN2) on mitochondrial membrane function. Using a LCN2 knockout and the thermogenic molecule CL316,243, the researchers scanned for differences in levels of 246 lipids between WT and KO mice with and without CL316,243 treatment. Of the different classes of lipids, the biggest differences were found in cardiolipins. The panels in Fig 3d-u show cardiolipin levels for each of the four groups for 18 of the 26 cardiolipins with at least one significant p-value for a difference between WT and KO mice.
The two factors are
- genotype, with levels “WT” and “KO”, where KO is the LCN2 knockout.
- treatment, with levels “Saline” and “CL316243”, where CL316243 is administration of CL316,243.
All levels of both factors are combined to create the four treatment combinations
- “WT Saline” is the combination “WT” + “Saline”. This is the negative control.
- “WT CL316243” is the combination “WT” + “CL316,243”.
- “KO Saline” is the combination “KO” + “Saline”.
- “KO CL316243” is the combination “KO” + “CL316,243”
The researchers do not have a model of how any specific cardiolipin will respond to any specific treatment combination. Instead, the research question seems to be something like, “do we see differences in CL synthesis under different combinations of LCN2 activity and CL316,243?” For each cardiolipin, the four simple effect contrasts answer this question. But the question remains the same for each cardiolipin, so there are 104 (4 x 26) tests in the family. Even though the researchers only present results for 18 cardiolipins, we have to include the 8 non-presented cardiolipins because these were also analyzed.
Now this is an example of an experiment where we should adjust p-values for multiple testing!
- Holm. This modification of Bonferroni adjusts p-values to control the family-wise error rate (FWER) and would be relevant if the researchers were asking the question “does at least one cardiolipin have different synthesis level under any combination of genotype and treatment?”
- BH (Benjamini-Hochberg). This method based on Benjamini-Hochberg adjusts p-values to control the false discovery rate (FDR) and would be relevant if the researchers were asking “which cardiolipins have different synthesis levels and under which combination of genotype and treatment?”
We control FWER if we want to show that “we have evidence of something happening in this system”. We control FDR if we want to show that “we have evidence that we’ve discovered a specific component in the system”.
I don’t think the FWER question is very relevant to most experimental bench biology.
In a sense, BH is more liberal and Holm is more conservative but this isn’t a good way to think about the these because this is comparing apples and oranges. Instead of using BH to “gain power” or Holm to “be conservative”, just use a different alpha. If the cost of false discovery is cheap but the potential for reward is high, increase alpha. If the cost of false discovery is expensive, decrease alpha.
<- names(fig3)[which(substr(names(fig3), 1, 11) == "cardiolipin")]
cl_cols <- length(cl_cols)
n_cardiolipins
<- data.table(NULL)
p_table for(i in 1:n_cardiolipins){
<- cl_cols[i]
lipid_id <- paste(lipid_id, "~", "group") |>
formula_i formula()
<- lm(formula_i, data = fig3)
m1 <- emmeans(m1, specs = "group") |>
m1_pairs contrast(method = "revpairwise",
adjust = "none") |>
summary() |>
data.table()
<- m1_pairs[c(1,6,2,5),]
m1_pairs <- t.test(fig3[group == "WT-Saline", get(lipid_id)],
p1 == "WT-CL", get(lipid_id)],
fig3[group var.equal = FALSE)$p.value
<- t.test(fig3[group == "KO-Saline", get(lipid_id)],
p2 == "KO-CL", get(lipid_id)],
fig3[group var.equal = TRUE)$p.value
<- t.test(fig3[group == "WT-Saline", get(lipid_id)],
p3 == "KO-Saline", get(lipid_id)],
fig3[group var.equal = TRUE)$p.value
<- t.test(fig3[group == "WT-CL", get(lipid_id)],
p4 == "KO-CL", get(lipid_id)],
fig3[group var.equal = TRUE)$p.value
<- rbind(
p_table
p_table,data.table(
"lipid_id" = lipid_id,
.SDcols = c("contrast", "p.value")],
m1_pairs[, .SD, ttest = c(p1, p2, p3, p4)
)
)
}
# adjust over all contrasts
<- TRUE
pool_contrast if(pool_contrast == TRUE){
# add 8 more cardiolipins not included in the plot
# us lm pvalue instead of t test pvalue because t is less efficient
<- c(p_table[, p.value], rep(0.99, 8 * 4)) # 8 cardiolipins * 4 contrasts
p_set <- p.adjust(p_set, "holm")[1:(18 * 4)]
p_holm <- p.adjust(p_set, "BH")[1:(18 * 4)]
p_bh <- p.adjust(p_set, "BY")[1:(18 * 4)]
p_by
:= p_holm]
p_table[, holm := p_bh]
p_table[, fdr_bh := p_by]
p_table[, fdr_by
}
# adjust within each contrast, not over all contrasts
<- FALSE
by_contrast if(by_contrast == TRUE){
<- unique(p_table$contrast)
contrast_list for(contrast_i in 1:length(contrast_list)){
<- p_table[contrast == contrast_list[contrast_i], ttest]
p_set <- c(p_set, rep(0.99, 8))
p_set <- p.adjust(p_set, "holm")[1:18]
p_holm <- p.adjust(p_set, "BH")[1:18]
p_bh <- p.adjust(p_set, "BY")[1:18]
p_by == contrast_list[contrast_i],
p_table[contrast := p_holm]
holm == contrast_list[contrast_i],
p_table[contrast := p_bh]
fdr_bh == contrast_list[contrast_i],
p_table[contrast := p_by]
fdr_by
}
}
# p_table_order <- doBy::orderBy(~ contrast + lipid_id, p_table)
# p_table_order[, .SD, .SDcols = setdiff(names(p_table), "contrast")] |>
# setnames("lipid_id", "") |>
# kable(digits = 3) |>
# kable_styling(full_width = FALSE) |>
# pack_rows(contrast_list[1], 1, 18) |>
# pack_rows(contrast_list[2], 19, 36) |>
# pack_rows(contrast_list[3], 37, 54)
# researcher results
# set1 d-h
#WT: cl lower
#KO: cl not lower
# set2 i-m
#WT cl not lower
#KO: cl lower
# set3 n-u
#WT cl higher
#KO: cl higher in 3/8
data.table(
"Method" = c("lm (unadjusted)", "t test (unadjusted)", "BH (FDR)", "BY (FDR)", "Holm (FWER)"),
"N (p<0.05)" = c(sum(p_table[, p.value] < 0.05),
sum(p_table[, ttest] < 0.05),
sum(p_table[, fdr_bh] < 0.05),
sum(p_table[, fdr_by] < 0.05),
sum(p_table[, holm] < 0.05)
)|>
) kable() |>
kable_styling(full_width = FALSE)
Method | N (p<0.05) |
---|---|
lm (unadjusted) | 28 |
t test (unadjusted) | 26 |
BH (FDR) | 16 |
BY (FDR) | 9 |
Holm (FWER) | 7 |
Notes on Table 12.4
- For these data, there are more discoveries (p < 0.05) using the linear model with all four groups than using the pairwise t-tests used by the researchers. The linear model is slightly more powerful than pairwise t-tests because of a smaller SE, but differences in discovery rate between the two methods in any one batch will depend more on patterns of sample variances.
- The BH adjustment reduced the number of discoveries by nearly half. The BY adjustment reduced the number of discoveries by nearly 2/3. BY is an FDR method based on Benjamini & Yekutieli that is modified for non-independent tests. The tests here are not independent because the cardiolipin levels are correlated. Consequently, BY will have better control of FDR than BH.
- The Holm adjustment reduced the number of discoveries by three-fourths. I include it here simply for comparison. As argued above, Holm, or any FWER, doesn’t answer the right question.
- The researchers didn’t adjust. Should they have? If the goal is discovery on how combinations of LCN2 and CL316,243 effect cardiolipin synthesis, and they don’t have working models of the system, then yes, they should adjust.
- One way that R, or any programming language, is more efficient than menu-driven statistical software is the ability to write scripts to combine analyses, which is what we need to do to adjust p-values from multiple fit statistical models.
12.5 Better Know FDR {sec-mult-tests-fdr}
Use this simulation to explore FDR, FWER, and Type I error. Here, I use it to explore how the BH adjustment behaves in a batch of 100 tests in which only 20% have true effects, and a batch in which 80% have true effects. I use BH because the tests are independent in this simulation.
set.seed(1)
<- here("output", "chap_mult_tests_fdr_sim.Rds")
fileout_path <- FALSE
do_it
<- 10000
n_iter <- 100
n_tests <- 0.05
alpha
if(do_it){
<- data.table(NULL)
out_table for(true_fraction in c(.2, 0.8)){
# true_fraction <- .2 # fraction of tests with true effect
<- 1 - true_fraction # fraction of tests with no effect
false_fraction <- true_fraction * n_tests # number of tests with true effect
n_true <- false_fraction * n_tests # number of tests with no effect
n_false <- .8
power_exp <- power.t.test(n = 10, sd = 1, power = power_exp, sig.level = alpha)$delta
delta <- numeric(n_true)
p_true <- numeric(n_false)
p_false
<- 0.0
true_discoveries_none <- 0.0
false_nondiscoveries_none <- 0.0
false_discoveries_none <- 0.0
true_nondiscoveries_none
<- 0.0
true_discoveries_holm <- 0.0
false_nondiscoveries_holm <- 0.0
false_discoveries_holm <- 0.0
true_nondiscoveries_holm
<- 0.0
true_discoveries_fdr <- 0.0
false_nondiscoveries_fdr <- 0.0
false_discoveries_fdr <- 0.0
true_nondiscoveries_fdr
<- 0.0
FWE_none <- 0.0
FWE_holm <- 0.0
FWE_fdr <- 0.0
type_1 <- 0.0
power
for(iter in 1:n_iter){
for(j in 1:n_true){
<- t.test(rnorm(n = 10), rnorm(n = 10, mean = delta), var.equal = TRUE)$p.value
p_true[j]
}for(j in 1:n_false){
<- t.test(rnorm(n = 10), rnorm(n = 10), var.equal = TRUE)$p.value
p_false[j]
}<- type_1 + sum(p_false < 0.05)/n_false
type_1 <- power + sum(p_true < 0.05)/n_true
power
<- c(p_true, p_false)
p_none <- p.adjust(p_none, method = "holm")
p_holm <- p.adjust(p_none, method = "BH")
p_fdr <- p_holm[1:n_true]
p_true_holm <- p_fdr[1:n_true]
p_true_fdr <- p_holm[(n_true + 1):n_tests]
p_false_holm <- p_fdr[(n_true + 1):n_tests]
p_false_fdr
if(sum(p_false < 0.05) > 0){FWE_none <- FWE_none + 1}
<- true_discoveries_none + sum(p_true < 0.05)
true_discoveries_none <- false_nondiscoveries_none + sum(p_true > 0.05)
false_nondiscoveries_none <- false_discoveries_none + sum(p_false < 0.05)
false_discoveries_none <- true_nondiscoveries_none + sum(p_false > 0.05)
true_nondiscoveries_none
if(sum(p_false_holm < 0.05) > 0){FWE_holm <- FWE_holm + 1}
<- true_discoveries_holm + sum(p_true_holm < 0.05)
true_discoveries_holm <- false_nondiscoveries_holm + sum(p_true_holm > 0.05)
false_nondiscoveries_holm <- false_discoveries_holm + sum(p_false_holm < 0.05)
false_discoveries_holm <- true_nondiscoveries_holm + sum(p_false_holm > 0.05)
true_nondiscoveries_holm
if(sum(p_false_fdr < 0.05) > 0){FWE_fdr <- FWE_fdr + 1}
<- true_discoveries_fdr + sum(p_true_fdr < 0.05)
true_discoveries_fdr <- false_nondiscoveries_fdr + sum(p_true_fdr > 0.05)
false_nondiscoveries_fdr <- false_discoveries_fdr + sum(p_false_fdr < 0.05)
false_discoveries_fdr <- true_nondiscoveries_fdr + sum(p_false_fdr > 0.05)
true_nondiscoveries_fdr
}
<- true_discoveries_none + false_discoveries_none
total_discoveries_none <- false_nondiscoveries_none + true_nondiscoveries_none
total_non_discoveries_none <- true_discoveries_none/total_discoveries_none
TDR_none <- false_discoveries_none/total_discoveries_none
FDR_none <- false_nondiscoveries_none/total_non_discoveries_none
FNDR_none <- false_discoveries_none/n_false/n_iter
type_1_none <- true_discoveries_none/n_true/n_iter
power_none <- FWE_none/n_iter
FWER_none
<- true_discoveries_holm + false_discoveries_holm
total_discoveries_holm <- false_nondiscoveries_holm + true_nondiscoveries_holm
total_non_discoveries_holm <- true_discoveries_holm/total_discoveries_holm
TDR_holm <- false_discoveries_holm/total_discoveries_holm
FDR_holm <- false_nondiscoveries_holm/total_non_discoveries_holm
FNDR_holm <- false_discoveries_holm/n_false/n_iter
type_1_holm <- true_discoveries_holm/n_true/n_iter
power_holm <- FWE_holm/n_iter
FWER_holm
<- true_discoveries_fdr + false_discoveries_fdr
total_discoveries_fdr <- false_nondiscoveries_fdr + true_nondiscoveries_fdr
total_non_discoveries_fdr <- true_discoveries_fdr/total_discoveries_fdr
TDR_fdr <- false_discoveries_fdr/total_discoveries_fdr
FDR_fdr <- false_nondiscoveries_fdr/total_non_discoveries_fdr
FNDR_fdr <- false_discoveries_fdr/n_false/n_iter
type_1_fdr <- true_discoveries_fdr/n_true/n_iter
power_fdr <- FWE_fdr/n_iter
FWER_fdr
<- rbind(
out_table
out_table,data.table(
true_frac = true_fraction,
Stat = c("TD", "FD", "TDR", "FDR", "FNDR", "Type1", "power", "FWER"),
None = c(true_discoveries_none, false_discoveries_none, TDR_none,
FDR_none, FNDR_none, type_1_none, power_none, FWER_none),Holm = c(true_discoveries_holm, false_discoveries_holm, TDR_holm,
FDR_holm, FNDR_holm, type_1_holm, power_holm, FWER_holm),FDR = c(true_discoveries_fdr, false_discoveries_fdr, TDR_fdr,
FDR_fdr, FNDR_fdr, type_1_fdr, power_fdr, FWER_fdr)
)
)saveRDS(out_table, fileout_path)
} else{
}<- readRDS(fileout_path)
out_table
}
<- out_table[Stat %in% c("FDR", "FNDR", "Type1", "power", "FWER"), ]
out_table_keep
<- function(df, d){
row_digits # df is a data.frame or data.table
# d is a single value for all rows or vector of digits for each row
<- which(unlist(lapply(df, is.numeric)))
inc <- round(Filter(is.numeric, df), d)|>
x apply(2, as.character) |>
data.table()
<- apply(df, 2, as.character) |>
df_out data.table()
<- x
df_out[, inc] return(df_out)
}
options(knitr.kable.NA = '')
# out_table_keep[, .SD, .SDcols = c("Stat", "None", "Holm", "FDR")] |>
:= " "]
out_table_keep[, true_frac setnames(out_table_keep, "true_frac", " ")
|>
out_table_keep kable(digits = c(1, 1,3,3,3)) |>
kable_styling(full_width = FALSE) |>
pack_rows("True Fraction = 0.2", 1, 5) |>
pack_rows("True Fraction = 0.8", 6, 10)
Stat | None | Holm | FDR | |
---|---|---|---|---|
True Fraction = 0.2 | ||||
FDR | 0.200 | 0.012 | 0.047 | |
FNDR | 0.050 | 0.173 | 0.126 | |
Type1 | 0.050 | 0.001 | 0.005 | |
power | 0.801 | 0.164 | 0.424 | |
FWER | 0.985 | 0.040 | 0.324 | |
True Fraction = 0.8 | ||||
FDR | 0.015 | 0.001 | 0.010 | |
FNDR | 0.457 | 0.768 | 0.536 | |
Type1 | 0.050 | 0.001 | 0.030 | |
power | 0.801 | 0.174 | 0.720 | |
FWER | 0.641 | 0.012 | 0.449 |
Notes on Table 12.5
- Note that it’s not the Type I error rate that increases with multiple tests, it’s the FWER that increases. It’s important to understand this difference. If every test answers a different question, all we care about is the Type I error rate.
- If a batch of tests contains both true effects and null (zero) effects, the fraction of tests with true effects doesn’t effect the Type I error rate because the denominator is not the number of tests but the number of tests with no true effect.
- If a batch of tests contains both true effects and null effects, the fraction of tests with true effects does effect the FDR because there are more false discoveries relative to true discoveries as the fraction of tests with true effects goes to zero. Remember, the denominator is the sum of false and true discoveries. If there are no true effects, 100% of the discoveries are false! If there are only true effects, 0% of the discoveries are false!
12.6 Working in R
12.6.1 The contrast() function has different default adjustment methods for each of the contrast methods
- Tukey HSD is the default for all pairwise contrasts using either
method = "pairwise"
ormethod = "revpairwise"
# use the exfig2 data to explore this
<- lm(nefa ~ treatment, data = exfig2)
m1 <- emmeans(m1, specs = "treatment")
m1_emm <- contrast(m1_emm,
m1_pairs method = "revpairwise") |>
summary(infer = TRUE)
m1_pairs
contrast estimate SE df lower.CL upper.CL t.ratio p.value
3q - con -1.09 0.407 26 -2.1011 -0.0793 -2.680 0.0327
4i - con 1.05 0.396 26 0.0642 2.0320 2.647 0.0351
4i - 3q 2.14 0.407 26 1.1274 3.1491 5.256 <.0001
Confidence level used: 0.95
Conf-level adjustment: tukey method for comparing a family of 3 estimates
P value adjustment: tukey method for comparing a family of 3 estimates
Notes:
- The adjustment method for both the confidence interval and the p-value is given in the contrast object.
- Dunnett’s test is the default for comparing all treatments (non-reference groups) to the control (the reference group). Both
method = "trt.vs.ctrl"
andmethod = "dunnett"
work.
# use the exfig2 data to explore this
<- lm(nefa ~ treatment, data = exfig2)
m1 <- emmeans(m1, specs = "treatment")
m1_emm <- contrast(m1_emm,
m1_pairs method = "trt.vs.ctrl") |>
summary(infer = TRUE)
m1_pairs
contrast estimate SE df lower.CL upper.CL t.ratio p.value
3q - con -1.09 0.407 26 -2.047 -0.133 -2.680 0.0241
4i - con 1.05 0.396 26 0.117 1.979 2.647 0.0260
Confidence level used: 0.95
Conf-level adjustment: dunnettx method for 2 estimates
P value adjustment: dunnettx method for 2 tests
12.6.2 For non-default adjustment, use the adjust =
argument
For no adjustment, use adjust = "none"
# use the exfig2 data to explore this
<- lm(nefa ~ treatment, data = exfig2)
m1 <- emmeans(m1, specs = "treatment")
m1_emm <- contrast(m1_emm,
m1_pairs method = "revpairwise",
adjust = "none") |>
summary(infer = TRUE)
m1_pairs
contrast estimate SE df lower.CL upper.CL t.ratio p.value
3q - con -1.09 0.407 26 -1.926 -0.254 -2.680 0.0126
4i - con 1.05 0.396 26 0.234 1.862 2.647 0.0136
4i - 3q 2.14 0.407 26 1.302 2.974 5.256 <.0001
Confidence level used: 0.95
12.6.3 Options in adjust =
- “none” – no adjustment.
Options for controlling FDR
- “BH” or “fdr” – controls the false discovery rate using the logic of Benjamini & Hochberg. The BH method assumes independent tests. This or BY should be used for most uses of p-value adjustment in experimental bench biology.
- “BY” – is a modification of BH using the logic of Benjamini & Yekutieli. The BY method relaxes the independent test assumption and consequently is more consertive than BH. This or BY should be used for most uses of p-value adjustment in experimental bench biology.
General options for controlling FWER
- “holm” – Holm-Bonferroni is the Holm modification of Bonferroni and is used to control the FWER. It is more powerful than Bonferroni while still holding FWER at alpha. For most uses of p-value adjustment in experimental bench biology, we are less interested in controling the FWER than the FDR.
- “bonferroni” – Bonferroni is the original, general-purpose adjustment for a set of p-values to control the FWER. It has less power than the Holm modification, so it’s use should be historic only.
- “mvt” – based on the multivariate t distribution and using covariance structure of the variables.
Options for controlling FWER that are specific to experimental designs
- “dunnettx” – Dunnett’s test is a method used when comparing all treatments to a single control.
- “tukey” – Tukey’s HSD method is a method used to compare all pairwise comparisons.
12.6.4 Using the p.adjust() function
We use the base R p.adjust()
function to adjust a set of p-values that were computed from different models, as in Example 3 above.
But to show a very simple example of using the p.adjust()
function, let’s use the lipocalin Fig3 data. First get the p values for the four simple effects on the cardiolipin in panel d. Be sure to use the adjust = "none
argument because we want to adjust unadjusted p-values!
<- lm(cardiolipin_d ~ genotype * treatment, data = fig3)
m1 <- emmeans(m1, specs = c("genotype", "treatment"))
m1_emm # compute the four simple effect p-values
<- contrast(m1_emm,
m1_simple method = "revpairwise",
simple = "each",
combine = TRUE,
adjust = "none") |>
summary(infer = TRUE) |>
data.table()
# extract the p-values
<- m1_simple[, p.value]
p_fig3d p_fig3d
[1] 0.034802754 0.497327407 0.003147072 0.668086400
Now, use the p.adjust()
function
<- p.adjust(p_fig3d, "BH")
p_fdr_BH_fig3d p_fdr_BH_fig3d
[1] 0.06960551 0.66310321 0.01258829 0.66808640
Notes:
- Again, we wouldn’t typically do this, we would simply use the “adjust =” argument within the
contrast()
function and not bother to extract the p-values and then adjust these, as done here. - When we would want to extract the p-values and adjust is when we have multiple responses for the same experiment and these responses answer the same question, which is common in experimental bench biology and is the case for all three examples above.
12.6.4.1 Options in p.adjust()
- “BH” or “fdr” – controls the false discovery rate using the logic of Benjamini & Hochberg. The BH method assumes independent tests. This or BY should be used for most uses of p-value adjustment in experimental bench biology.
- “BY” – is a modification of BH using the logic of Benjamini & Yekutieli. The BY method relaxes the independent test assumption and consequently is more consertive than BH. This or BY should be used for most uses of p-value adjustment in experimental bench biology.
- “holm” – Holm-Bonferroni is the Holm modification of Bonferroni and is used to control the FWER. It is more powerful than Bonferroni while still holding FWER at alpha. For most uses of p-value adjustment in experimental bench biology, we are less interested in controling the FWER than the FDR.
- “bonferroni” – Bonferroni is the original, general-purpose adjustment for a set of p-values to control the FWER. It has less power than the Holm modification, so it’s use should be historic only.
12.6.5 A Holm-Sidak function for reproducing GraphPad Prism results
<- function(p){
holm_sidak # assume alpha = 0.05
<- length(p)
n <- frank(p)
rank_p <- n - rank_p + 1
inv_rank_p
# put pvalues and the inv rankings in rank order
<- p[order(p)]
p_order <- inv_rank_p[order(p)]
inv_rank_order
# compute the adjusted p-value
<- (1 - (1 - p_order)^(inv_rank_order))
p_holm_sidak_order
# adjusted p-values of higher rank cannot be less than adjusted p-value of lower rank
# so find and adjust these
for(j in 2:n){
if(p_holm_sidak_order[j] < p_holm_sidak_order[j-1]){
<- p_holm_sidak_order[j-1]
p_holm_sidak_order[j]
}
}
# any adjusted > 1 make 1
> 1.0] <- 1.0
p_holm_sidak_order[p_holm_sidak_order
# put back into original order
<- p_holm_sidak_order[rank_p]
p_holm_sidak return(p_holm_sidak)
}
Notes:
- This function only computes the Holm-Sidak p-value and not the adjusted CIs. The emmeans package uses Bonferroni intervals for the Holm adjustment, and these should be used for the Holm-Sidak adjustment. The function
RHSDB::rh.sd.sidak
computes CIs in addition to the p-values but I haven’t sleuthed out what these are. - This function is used exactly like any other R function. Here I compare both Holm and Holm-Sidak for the Lipocalin Fig3 dataset. Just like the Sidak adjustment is slightly more powerful than Bonferroni, the Holm-Sidak is slightly more powerful than Holm. These tiny differences are rather moot, given the TL;DR at the start of this chapter (Chapter 12).
<- lm(cardiolipin_d ~ genotype * treatment, data = fig3)
m1 <- emmeans(m1, specs = c("genotype", "treatment"))
m1_emm # compute the four simple effect p-values
<- contrast(m1_emm,
m1_simple method = "revpairwise",
simple = "each",
combine = TRUE,
adjust = "none") |>
summary(infer = TRUE) |>
data.table()
# compute holm
:= p.adjust(p.value, "holm")]
m1_simple[, holm
# compute holm-sidak
:= holm_sidak(p.value)]
m1_simple[, holm_sidak |>
m1_simple kable(digits = c(1,1,1,3,3,1,3,3,1,4,4,4)) |>
kable_styling()
treatment | genotype | contrast | estimate | SE | df | lower.CL | upper.CL | t.ratio | p.value | holm | holm_sidak |
---|---|---|---|---|---|---|---|---|---|---|---|
Saline | . | KO - WT | -0.023 | 0.010 | 14 | -0.043 | -0.002 | -2.3 | 0.0348 | 0.1044 | 0.1008 |
CL | . | KO - WT | 0.006 | 0.009 | 14 | -0.012 | 0.024 | 0.7 | 0.4973 | 0.9947 | 0.7473 |
. | WT | CL - Saline | -0.033 | 0.009 | 14 | -0.052 | -0.013 | -3.6 | 0.0031 | 0.0126 | 0.0125 |
. | KO | CL - Saline | -0.004 | 0.009 | 14 | -0.024 | 0.016 | -0.4 | 0.6681 | 0.9947 | 0.7473 |