12  Multiple tests – why and when to adjust p-values

In many experiments, researchers have multiple treatments or multiple responses. This results in multiple tests against the null. For example, in an experiment with four treatment levels, there are six combinations of contrasts (A - B, A - C, A - D, B - C, B - D, C - D) and a researcher might be interested in all six p-values. Or, in an experiment with only two levels, there could be six response variables (expression level in six different genes. Or, six different phenotypes). Here, there are six contrasts (B - A for all six response variables) and six tests. Researchers commonly adjust for multiple testing in examples like these (the exception is multiple phenotype responses. Curiously, researchers almost never adjust for this). What is multiple testing and why do we want to adjust? And, something that doesn’t seem to be considered much in the experimental bench biology literature, when do we want to adjust?

TL;DR
  1. If your experiment has a single response variable (that is, you’re not measuring the effect of treatment on multiple phenotypes or gene expression levels), then you probably don’t want to adjust for multiple tests.
  2. Adjusting for multiple tests in the experimental bench biology literature is the norm, because researchers are confused on what a “family” is.
  3. Despite No. 2, researchers almost always fail to adjust for multiple tests when they should, specifically, when there are multiple phenotypes or gene expression levels that answer the same experimental question.
  4. When you do adjust as in No. 3, you should probably use a method that controls the False Discovery Rate (FDR) and not the Family-Wise Error Rate (FWER). The reason isn’t because the FDR is more liberal, it’s because it’s more consistent with the goals of the research.

12.1 In a family of multiple tests, what is the probability of at least one p-value less than 0.05?

A family of tests is a set of tests that are used to make some inference about one question. For example, “In this mouse model of diabetes, are any genes differently expressed between diabetes mice and wildtype mice”? \(p < 0.05\) for any of the tested genes answers the question. If a researcher only tests one gene, then the probability of \(p < 0.05\) if there is no expression difference between phenotypes is 5%. But if a researcher tests two genes, there is a higher probability than 5% that at least one will have \(p < 0.05\) if there is no expression difference between phenotypes for both genes. And, if a researcher tests ten genes, there is a much higher probability than 5% that at least one will have \(p < 0.05\) if there is no expression difference between phenotypes for any of the ten genes. What are these probabilities?

The probability of a significant p-value in one test is 0.05, so the probability of no non-significant p-values in one test is 1 - 0.05. The probability of no non-significant p-values in two tests is \((1 - 0.05)^2\), so the probability of at least one significant p-value is \(1 - (1 - 0.05)^2\). The probability of no non-significant p-values in ten tests is \((1 - 0.05)^{10}\), so the probability of at least one significant p-value in ten tests is \(1 - (1 - 0.05)^{10}\). Let’s compute these probabilities, and more.

Family-wise error rate for unadjusted p-values
Number of tests Prob(at least 1 < 0.05)
1 0.05
2 0.10
5 0.23
10 0.40
100 0.99

The second column is the family-wise error rate (FWER), the probability of at least one \(p < 0.05\) in a family of tests in which there are no true effects. So the FWER for a family of 10 tests is 40% if our p-values are not adjusted for multiple testing. We adjust p values in experiments with multiple tests in order to control the FWER.

Note

An assumption of these probabilities is the tests are independent. Importantly, this independence assumption is also true for most, but not all, of the adjustment methods.

12.2 Example 1 – In many experiments, we don’t want to adjust for multiple tests (exfig2d data)

In many experiments, we don’t want to adjust for multiple tests because there is only a single test in the family even if there are multiple tests in the experiment. Researchers are adjusting because of a confusion on what a family is. The cost of this adjustment is more conservative (higher) p-values, which leads to less power, which leads to more failed discoveries or more research costs (mice, time) to discover by p-value.

Let’s explore what a family is using data from the article

Source: Lyu, Q., Xue, W., Liu, R. et al. A brain-to-gut signal controls intestinal fat absorption. Nature (2024). https://doi.org/10.1038/s41586-024-07929-5

Data source

Figure: Extended Figure 2d

The import code is in Section 12.7.1 below.

In this study, the researchers investigate the control of fat absorption in the intestine by the dorsal motor nucleus of the vagus nerve (DMV) and show that the plant metabolite puerarin mimics this DMV suppression.

The experiment in Extended Figure 2d follows up initial experiments from Figure 1 that show that inactivation of the DMV in mice decreased weight gain on a high fat diet relative to control mice and that DMV-inactive mice had lower plasma triglycerides, higher fat content in the feces, and lower fat absorption in the jejunum part of the small intestine. The DMV was inactivated by Clozapine N-oxide (CNO) binding to a human-designed hM4D(Gi) inhibitory receptor whose expression is controlled by the researchers.

In the experiment in Extended Figure 2d, the response is fecal NEFA (non-esterified fatty acid).

The three treatments are

  1. “con” is a negative control with a designer receptor that doesn’t bind CNO. The researchers expect fecal NEFA equal to mice with no designed receptors.
  2. “3q” is the focal treatment using a designer receptor that should activate the DMV. If the DMV regulation of fat absorption is working as thought, the researchers expect increased fat absorption so lower fecal NEFA levels relative to “con”.
  3. “4i” is a positive control using the DMV-inactivating receptor in the previous experiments. The researchers expect higher NEFA levels relative to “con”, as in the experiments in Fig. 1.

There are three contrasts of single treatments (“3q - con”, “4i - con”, “4i - 3q”) so three tests. The researchers follow the norm in experimental bench biology research and report p-values adjusted for multiple tests. Specifically, the researchers adjusted for the two contrasts comparing a treatment to the negative control (“3q - con” and “4i - con”). This adjustment seems to follow the logic that a family is the set of tests of interest to the researchers (the contrast “4i - 3q” is not of interest). But “of interest to the researchers” doesn’t define a family of tests. Instead, “same question” defines a family of tests. If were were to include both contrasts in a family, what is the question? It would have to be something like “if we manipulate the activity of the DMV, do we get a change in fecal NEFA?”.

This isn’t the question addressed by the research. The researchers have one focal question: “does activating the DMV increase fat absorption and consequently, decrease fecal NEFA levels?” Only one test (the contrast “3q - con”) answers this question so the family has a single test and there is no need to adjust. The researchers are interested in a second question as a check on the experiment itself. This question is something like “Are we actually getting the known response given the intervention”. Only one test (the contrast “4i - con”) answers this question so there is no need to adjust.

Let’s look at the cost of adjustment. The researchers used the Holm-Sidák method to adjust the two p-values.

m1 <- lm(nefa ~ treatment, data = exfig2)
m1_emm <- emmeans(m1, specs = "treatment")
m1_pairs <- contrast(m1_emm,
                     method = "trt.vs.ctrl",
                     adjust = "none") |>
  summary(infer = TRUE) |>
  data.table()

m1_pairs[, holm_sidak := holm_sidak(p.value)]
m1_pvals <- m1_pairs[, .SD, .SDcols = c("contrast", "estimate", "p.value", "holm_sidak")]
colnames(m1_pvals) <- c("Contrast", "Estimate", "Unadjusted P", "Holm-Sidak P")

m1_pvals |>
  kable(digits = c(1, 3, 5, 5)) |>
  kable_styling(full_width = FALSE)
Table 12.1: Unadjusted p-value and Holm-Sidak adjusted p-value for the experiment in Extended Figure 2d
Contrast Estimate Unadjusted P Holm-Sidak P
3q - con -1.090 0.01261 0.02506
4i - con 1.048 0.01361 0.02506

The cost is about double because there are two tests. If we use \(p < 0.05\) to act as if there is an effect, then the adjustment costs us nothing here. Adjusting for a small (two to three) number of tests is not going to have much impact on the quality of science either way. Nevertheless

  1. adjusting is unnecessary in experiments like this, where there are multiple questions addressed by the experiment and each question is answered by a single test. This is the case for many, many experiments in experimental bench biology.
  2. although small, the increased power with not adjusting is free.

12.2.1 Or, maybe we do want to adjust, but not in the way that is usually done

Extended figure 2d presents only one of six response variables that were measured on each mouse. The other response variables were body weight, fecal triglyceride, jejunal triglyceride, duodenal triglyceride, and ileal triglyceride. Some thoughts:

  1. Fecal NEFA and Fecal triglyceride effectively answer the same focal question “does activating the DMV increase fat absorption and consequently, decrease fecal fat levels?” So, adjusting for two tests is probably justified but the two tests are the same contrast (“3q - con”) with two different response variables and not different contrasts for the same response variable.
  2. jejunal, duodenal, and ileal triglyceride effectively answer the same focal question “does activating the DMV increase fat absorption in a specific part of the small intestine”. So adjusting for three tests is probably justified here, but again, the adjustment is for the same contrast (“3q - con”) from the model fit to three different response variables.
  3. Adjustment for multiple responses is common in some subfields of experimental bench biology that are measuring thousands of responses (gene expression, functional MRI) but I I haven’t seen adjustment for multiple phenotypic responses from a single experiment.

12.3 Example 2 – Tukey adjustment is extremely common, and (mostly) unjustified (fig5l data)

A very common research design in experimental bench biology is a design with two factors, each with two levels, with mice assigned to all 2 x 2 = 4 combinations of the levels. Analysis of these models for two (or more) factors is the focus of Chapter 13. The experiment in Figure 5l is a good example of a 2 x 2 design.

Data source

Figure: Figure 5l

The import code is in Section 12.7.2 below

The experiment in Fig. 5 follows up experiments that show that the plant metabolite puerarin binds to the GABA receptor encoded by Gabra1 and increases GABA’s inhibitory effect on the target neuron. In the experiment in Fig. 5, the researchers are investigating the mechanism of the effect of DVM-vagus supression on fat absorption in the jejunum. Figure 5c shows that suppressing DMV using COS activation of the designer 4i inhibitory receptor (as above) results in decreased jejunal microvilli length – the reduced length decreases surface area for absorption. Figure 5d shows that suppressing DMV results in reduced expression of four genes involved in microvilli growth and organization. The design of the experiment in Fig. 5c, d is the same as that in Extended fig 2d above.

Fig. 5l uses a 2 x 2 crossed design to investigate the effect of puerarin with and without a functional Gabra1 GABA receptor.

Here, we’ll focus on the first gene, ezr, which encodes the protein ezrin.

The two factors are

  1. genotype, with levels “WT” and “KO”. KO is the conditional knockout of Gabra1 in the DMV.
  2. treatment, with levels “Vehicle” and “Puerarin”. Infusion of a saline solution (Vehicle) or a solution with Puerarin.

All levels of both factors are combined to create the four experimental combinations (the design is fully factorial)

  1. “WT Vehicle” is the combination “WT” + “Vehicle”. This is the negative control.
  2. “WT Puerarin” is the combination “WT” + “Puerarin”.
  3. “KO Vehicle” is the combination “KO” + “Vehicle”.
  4. “KO Puerarin” is the combination “KO” + “Puerarin”

The primary questions of the experiment are:

  1. does addition of puerarin reduce RNA expression in mice with functional Gabra1 GABA receptor?
  2. does deletion of Gabra1 remove this effect of puerarin?

The researchers answer these questions using these two contrasts

  1. WT Puerarin - WT Vehicle
  2. KO Puerarin - KO Vehicle

So, there are two research questions, each answered by a single contrast. There is nothing to adjust because each family consists of only a single test.

Note

Look at question 2 closely – its a comparison of contrast 1 and contrast 2. Ultimately, the researchers are asking about the effects of Puerarin with and without a functional Gabra1 GABA receptor, so there is a single primary contrast of interest

(KO Puerarin - KO Vehicle) - (WT Puerarin - WT Vehicle)

This contrast is called the interaction effect or simply the interaction. The interaction is the difference in contrasts 1 and 2. An interaction is a difference of differences. The estimation of interaction effects almost always answers the focal question in 2 x 2 designs but is almost never estimated in experimental bench biology. More information on designs for two factors and interaction effects is in chapter (two-factors?)

What did the researchers do? Of the six pairwise contrasts, only two of which they focussed on in the text, the researchers report the four “simple effects” in the figure and in the archived Excel file. These are:

  1. (WT Puerarin) - (WT Vehicle) answers the question “how does microvilli-related gene expression change when we add puerarin to mice with fuctional Gabra1?”. This is a positive control for contrast 2.
  2. (KO Puerarin) - (KO Vehicle) answers the question “how does microvilli-related gene expression change when we add puerarin to mice with knocked out Gabra1?”. This is a focal contrast.
  3. (KO Vehicle)) - (WT Vehicle) answers the question “how does microvilli-related gene expression change when we knock out Gabra1 and don’t give puerarin administration? This is a positive control for contrast 4.
  4. (KO Puerarin) - (WT Puerarin) answers the question “how does microvilli-related gene expression change when we knock out Gabra1 but give puerarin administration? This is a focal contrast.

Instead of adjusting for the two tests of interest, or the four computed tests, the researchers followed the norm in experimental bench biology and used the Tukey HSD adjustment for a single family of six tests (all six pairwise tests). Perhaps the norm is following a definition of family as “all tests within an experiment”. Regardless, the adjustment logic in Fig. 5l is inconsistent with that in Fig 2d., where the researchers adjusted for the two tests of interest instead of the set of all three pairwise comparisons in the experiment. In Fig. 5l, the researchers adjust for all six pairwise comparisons, show only four contrasts in the figure and in the Excel file of the data, but only seem interested in the two contrasts that answer the focal questions.

What is the cost of adjustment? In Table 12.2 below, for the two focal tests, I give the unadjusted p-values, the Tukey p-values based on all six pairwise tests, and the Holm p-values based on the four simple effects. The cost of the Tukey and Holm adjustments are given in the last two columns as the ratio of the adjusted to unadjusted p-value. This loss of power is an unnecessary and results from a confusion in the experimental bench biology community of what a family of tests mean. Loss of power leads to more time, more dollars, more animals, and more failed discoveries.

m1 <- lm(ezrin ~ genotype * treatment, data = fig5l)
m1_emm <- emmeans(m1, specs = c("genotype", "treatment"))
m1_pairs <- contrast(m1_emm,
                     method = "revpairwise",
                     adjust = "none") |>
  summary(infer = TRUE) |>
  data.table()

m1_tukey <- contrast(m1_emm,
                     method = "revpairwise") |>
  summary(infer = TRUE) |>
  data.table()

m1_simple <- contrast(m1_emm,
                      method = "revpairwise",
                      simple = "each",
                      combine = TRUE,
                      adjust = "holm") |>
  summary(infer = TRUE) |>
  data.table()

m1_pairs[, tukey := m1_tukey$p.value]
m1_pairs[c(1,2,5,6), holm := m1_simple[c(1,3,4,2), p.value]]

m1_pairs[ , cost_tukey := tukey/p.value]
m1_pairs[ , cost_holm := holm/p.value]

m1_pvals <- m1_pairs[, .SD, .SDcols = c("contrast", "estimate", "p.value", "tukey", "holm", "cost_tukey", "cost_holm")]
colnames(m1_pvals) <- c("Contrast", "Estimate", "Unadjusted P", "Tukey P", "Holm P", "Tukey/Unadjusted", "Holm/Unadjusted")

# only the simple effects are of interest. These are rows 1, 2, 5, 6
m1_pvals[c(2,5), ] |>
  kable(digits = c(1, 3, 4, 4, 4, 1, 1)) |>
  kable_styling(full_width = FALSE)
Table 12.2: Unadjusted p-value and Tukey adjusted p-value for the experiment in Fig. 5l
Contrast Estimate Unadjusted P Tukey P Holm P Tukey/Unadjusted Holm/Unadjusted
WT Puerarin - WT Vehicle -0.554 0.0032 0.0159 0.0096 5.0 3
KO Puerarin - KO Vehicle -0.020 0.9085 0.9994 0.9085 1.1 1

12.3.1 Or, maybe we do want to adjust, but not in the way that is usually done

For each of the contrasts in the Fig. 5l example above, the researchers measured the response in four microvilli-related genes. Each of these genes answers the question posed for a contrast. For example, all four genes answer the question for contrast 1, which is “how does microvilli-related gene expression change when we add puerarin to mice with functional Gabra1?” The family is the four contrasts (WT Puerarin - WT Vehicle) for each of the four genes. The researchers should adjust for the four tests within this family. The next section shows how this is done.

Table 12.3

12.4 Example 3 – When we do want to adjust, we often want to use a method that controls the False Discovery Rate (FDR) (lipocalin fig3 data)

Wait What?

When \(p < 0.05\), researchers act as if they have discovered or confirmed something, so \(p < 0.05\) is evidence of a discovery. Consider these two situations

  1. \(p < 0.05\) and the Null Hypothesis is false – there is a true effect of treatment, or, there is a true difference in means between the two treatment levels. This is a true discovery.
  2. \(p < 0.05\) but the Null Hypothesis is true – there is no true effect of treatment, or, there is no true difference in means between the two treatment levels. This is a false discovery.

The false discovery rate (FDR) in a batch of tests is

\[ FDR = \frac{number\;of\;false\;discoveries}{number\;of\;false\;discoveries + number\;of\;true\;discoveries} \]

Remember that the family wise error rate (FWER) is the long term probability of at least one \(p < 0.05\) in a batch of tests in which all Null Hypothesis are true – there are no true effects (Section 12.1).

Note that the FWER for a single batch of tests is either 1 (there was at least one false positive) or 0 (there were no false positives). By contrast, an omniscient researcher can compute the FDR for a single batch of tests.

For any batch of tests in which some effects are real, and some are not

  1. A FWER method of p-value adjustment controls the long-term FWER at \(\leq\) 0.05.
  2. An FDR method of p-value adjustment controls the long-term FDR at \(\leq\) 0.05.

Source: Su, H., Guo, H., Qiu, X. et al. Lipocalin 2 regulates mitochondrial phospholipidome remodeling, dynamics, and function in brown adipose tissue in male mice. Nat Commun 14, 6729 (2023). https://doi.org/10.1038/s41467-023-42473-2

Data source

Figure: Figure 3

The import code is in Section 12.7.3 below

The researchers are investigating the role of Lipocalin-2 (LCN2) on mitochondrial membrane function. Using a LCN2 knockout and the thermogenic molecule CL316,243, the researchers scanned for differences in levels of 246 lipids between WT and KO mice with and without CL316,243 treatment. Of the different classes of lipids, the biggest differences were found in cardiolipins. The panels in Fig 3d-u show cardiolipin levels for each of the four groups for 18 of the 26 cardiolipins with at least one significant p-value for a difference between WT and KO mice.

The two factors are

  1. genotype, with levels “WT” and “KO”, where KO is the LCN2 knockout.
  2. treatment, with levels “Saline” and “CL316243”, where CL316243 is administration of CL316,243.

All levels of both factors are combined to create the four treatment combinations

  1. “WT Saline” is the combination “WT” + “Saline”. This is the negative control.
  2. “WT CL316243” is the combination “WT” + “CL316,243”.
  3. “KO Saline” is the combination “KO” + “Saline”.
  4. “KO CL316243” is the combination “KO” + “CL316,243”

The researchers do not have a model of how any specific cardiolipin will respond to any specific treatment combination. Instead, the research question seems to be something like, “do we see differences in CL synthesis under different combinations of LCN2 activity and CL316,243?” For each cardiolipin, the four simple effect contrasts answer this question. But the question remains the same for each cardiolipin, so there are 104 (4 x 26) tests in the family. Even though the researchers only present results for 18 cardiolipins, we have to include the 8 non-presented cardiolipins because these were also analyzed.

Now this is an example of an experiment where we should adjust p-values for multiple testing!

Which adjustment method?
  1. Holm. This modification of Bonferroni adjusts p-values to control the family-wise error rate (FWER) and would be relevant if the researchers were asking the question “does at least one cardiolipin have different synthesis level under any combination of genotype and treatment?”
  2. BH (Benjamini-Hochberg). This method based on Benjamini-Hochberg adjusts p-values to control the false discovery rate (FDR) and would be relevant if the researchers were asking “which cardiolipins have different synthesis levels and under which combination of genotype and treatment?”

We control FWER if we want to show that “we have evidence of something happening in this system”. We control FDR if we want to show that “we have evidence that we’ve discovered a specific component in the system”.

I don’t think the FWER question is very relevant to most experimental bench biology.

In a sense, BH is more liberal and Holm is more conservative but this isn’t a good way to think about the these because this is comparing apples and oranges. Instead of using BH to “gain power” or Holm to “be conservative”, just use a different alpha. If the cost of false discovery is cheap but the potential for reward is high, increase alpha. If the cost of false discovery is expensive, decrease alpha.

cl_cols <- names(fig3)[which(substr(names(fig3), 1, 11) == "cardiolipin")]
n_cardiolipins <- length(cl_cols)

p_table <- data.table(NULL)
for(i in 1:n_cardiolipins){
  lipid_id <- cl_cols[i]
  formula_i <- paste(lipid_id, "~", "group") |>
    formula()
  m1 <- lm(formula_i, data = fig3)
  m1_pairs <- emmeans(m1, specs = "group") |>
    contrast(method = "revpairwise",
             adjust = "none") |>
    summary() |>
    data.table()
  m1_pairs <- m1_pairs[c(1,6,2,5),]
  p1 <- t.test(fig3[group == "WT-Saline", get(lipid_id)],
               fig3[group == "WT-CL", get(lipid_id)],
               var.equal = FALSE)$p.value
  p2 <- t.test(fig3[group == "KO-Saline", get(lipid_id)],
               fig3[group == "KO-CL", get(lipid_id)],
               var.equal = TRUE)$p.value
  p3 <- t.test(fig3[group == "WT-Saline", get(lipid_id)],
               fig3[group == "KO-Saline", get(lipid_id)],
               var.equal = TRUE)$p.value
  p4 <- t.test(fig3[group == "WT-CL", get(lipid_id)],
               fig3[group == "KO-CL", get(lipid_id)],
               var.equal = TRUE)$p.value
  p_table <- rbind(
    p_table,
    data.table(
      "lipid_id" = lipid_id,
      m1_pairs[, .SD, .SDcols = c("contrast", "p.value")],
      ttest = c(p1, p2, p3, p4)
    )
  )
}

# adjust over all contrasts
pool_contrast <- TRUE
if(pool_contrast == TRUE){
  # add 8 more cardiolipins not included in the plot
  # us lm pvalue instead of t test pvalue because t is less efficient
  p_set <- c(p_table[, p.value], rep(0.99, 8 * 4)) # 8 cardiolipins * 4 contrasts
  p_holm <- p.adjust(p_set, "holm")[1:(18 * 4)]
  p_bh <- p.adjust(p_set, "BH")[1:(18 * 4)]
  p_by <- p.adjust(p_set, "BY")[1:(18 * 4)]
  
  p_table[, holm := p_holm]
  p_table[, fdr_bh := p_bh]
  p_table[, fdr_by := p_by]
}

# adjust within each contrast, not over all contrasts
by_contrast <- FALSE
if(by_contrast == TRUE){
  contrast_list <- unique(p_table$contrast)
  for(contrast_i in 1:length(contrast_list)){
    p_set <- p_table[contrast == contrast_list[contrast_i], ttest]
    p_set <- c(p_set, rep(0.99, 8))
    p_holm <- p.adjust(p_set, "holm")[1:18]
    p_bh <- p.adjust(p_set, "BH")[1:18]
    p_by <- p.adjust(p_set, "BY")[1:18]
    p_table[contrast == contrast_list[contrast_i],
            holm := p_holm]
    p_table[contrast == contrast_list[contrast_i],
            fdr_bh := p_bh]
    p_table[contrast == contrast_list[contrast_i],
            fdr_by := p_by]
  }
}


# p_table_order <- doBy::orderBy(~ contrast + lipid_id, p_table)
# p_table_order[, .SD, .SDcols = setdiff(names(p_table), "contrast")] |>
#   setnames("lipid_id", "") |>
#   kable(digits = 3) |>
#   kable_styling(full_width = FALSE) |>
#   pack_rows(contrast_list[1], 1, 18) |>
#   pack_rows(contrast_list[2], 19, 36) |>
#   pack_rows(contrast_list[3], 37, 54)

# researcher results
# set1 d-h
#WT: cl lower
#KO: cl not lower
# set2 i-m
#WT cl not lower
#KO: cl lower
# set3 n-u
#WT cl higher
#KO: cl higher in 3/8

data.table(
  "Method" = c("lm (unadjusted)", "t test (unadjusted)", "BH (FDR)", "BY (FDR)", "Holm (FWER)"),
  "N (p<0.05)" = c(sum(p_table[, p.value] < 0.05),
             sum(p_table[, ttest] < 0.05),
             sum(p_table[, fdr_bh] < 0.05),
             sum(p_table[, fdr_by] < 0.05),
             sum(p_table[, holm] < 0.05)
  )
) |>
  kable() |>
  kable_styling(full_width = FALSE)
Table 12.4: Number of discoveries (p < 0.05) among the 104 tests
Method N (p<0.05)
lm (unadjusted) 28
t test (unadjusted) 26
BH (FDR) 16
BY (FDR) 9
Holm (FWER) 7

Notes on Table 12.4

  1. For these data, there are more discoveries (p < 0.05) using the linear model with all four groups than using the pairwise t-tests used by the researchers. The linear model is slightly more powerful than pairwise t-tests because of a smaller SE, but differences in discovery rate between the two methods in any one batch will depend more on patterns of sample variances.
  2. The BH adjustment reduced the number of discoveries by nearly half. The BY adjustment reduced the number of discoveries by nearly 2/3. BY is an FDR method based on Benjamini & Yekutieli that is modified for non-independent tests. The tests here are not independent because the cardiolipin levels are correlated. Consequently, BY will have better control of FDR than BH.
  3. The Holm adjustment reduced the number of discoveries by three-fourths. I include it here simply for comparison. As argued above, Holm, or any FWER, doesn’t answer the right question.
  4. The researchers didn’t adjust. Should they have? If the goal is discovery on how combinations of LCN2 and CL316,243 effect cardiolipin synthesis, and they don’t have working models of the system, then yes, they should adjust.
  5. One way that R, or any programming language, is more efficient than menu-driven statistical software is the ability to write scripts to combine analyses, which is what we need to do to adjust p-values from multiple fit statistical models.

12.5 Better Know FDR {sec-mult-tests-fdr}

Use this simulation to explore FDR, FWER, and Type I error. Here, I use it to explore how the BH adjustment behaves in a batch of 100 tests in which only 20% have true effects, and a batch in which 80% have true effects. I use BH because the tests are independent in this simulation.

set.seed(1)
fileout_path <- here("output", "chap_mult_tests_fdr_sim.Rds")
do_it <- FALSE

n_iter <- 10000
n_tests <- 100
alpha <- 0.05

if(do_it){
  out_table <- data.table(NULL)
  for(true_fraction in c(.2, 0.8)){
    # true_fraction <- .2 # fraction of tests with true effect
    false_fraction <- 1 - true_fraction # fraction of tests with no effect
    n_true <- true_fraction * n_tests # number of tests with true effect
    n_false <- false_fraction * n_tests # number of tests with no effect
    power_exp <- .8
    delta <- power.t.test(n = 10, sd = 1, power = power_exp, sig.level = alpha)$delta
    p_true <- numeric(n_true)
    p_false <- numeric(n_false)
    
    true_discoveries_none <- 0.0
    false_nondiscoveries_none <- 0.0
    false_discoveries_none <- 0.0
    true_nondiscoveries_none <- 0.0
    
    true_discoveries_holm <- 0.0
    false_nondiscoveries_holm <- 0.0
    false_discoveries_holm <- 0.0
    true_nondiscoveries_holm <- 0.0
    
    true_discoveries_fdr <- 0.0
    false_nondiscoveries_fdr <- 0.0
    false_discoveries_fdr <- 0.0
    true_nondiscoveries_fdr <- 0.0
    
    FWE_none <- 0.0
    FWE_holm <- 0.0
    FWE_fdr <- 0.0
    type_1 <- 0.0
    power <- 0.0
    
    for(iter in 1:n_iter){
      for(j in 1:n_true){
        p_true[j] <- t.test(rnorm(n = 10), rnorm(n = 10, mean = delta), var.equal = TRUE)$p.value
      }
      for(j in 1:n_false){
        p_false[j] <- t.test(rnorm(n = 10), rnorm(n = 10), var.equal = TRUE)$p.value
      }
      type_1 <- type_1 + sum(p_false < 0.05)/n_false
      power <- power + sum(p_true < 0.05)/n_true
      
      p_none <- c(p_true, p_false)
      p_holm <- p.adjust(p_none, method = "holm")
      p_fdr <- p.adjust(p_none, method = "BH")
      p_true_holm <- p_holm[1:n_true]
      p_true_fdr <- p_fdr[1:n_true]
      p_false_holm <- p_holm[(n_true + 1):n_tests]
      p_false_fdr <- p_fdr[(n_true + 1):n_tests]
      
      if(sum(p_false < 0.05) > 0){FWE_none <- FWE_none + 1}
      true_discoveries_none <- true_discoveries_none + sum(p_true < 0.05)
      false_nondiscoveries_none <- false_nondiscoveries_none + sum(p_true > 0.05)
      false_discoveries_none <- false_discoveries_none + sum(p_false < 0.05)
      true_nondiscoveries_none <- true_nondiscoveries_none + sum(p_false > 0.05)
      
      if(sum(p_false_holm < 0.05) > 0){FWE_holm <- FWE_holm + 1}
      true_discoveries_holm <- true_discoveries_holm + sum(p_true_holm < 0.05)
      false_nondiscoveries_holm <- false_nondiscoveries_holm + sum(p_true_holm > 0.05)
      false_discoveries_holm <- false_discoveries_holm + sum(p_false_holm < 0.05)
      true_nondiscoveries_holm <- true_nondiscoveries_holm + sum(p_false_holm > 0.05)
      
      if(sum(p_false_fdr < 0.05) > 0){FWE_fdr <- FWE_fdr + 1}
      true_discoveries_fdr <- true_discoveries_fdr + sum(p_true_fdr < 0.05)
      false_nondiscoveries_fdr <- false_nondiscoveries_fdr + sum(p_true_fdr > 0.05)
      false_discoveries_fdr <- false_discoveries_fdr + sum(p_false_fdr < 0.05)
      true_nondiscoveries_fdr <- true_nondiscoveries_fdr + sum(p_false_fdr > 0.05)
    }
    
    total_discoveries_none <- true_discoveries_none + false_discoveries_none
    total_non_discoveries_none <- false_nondiscoveries_none + true_nondiscoveries_none
    TDR_none <- true_discoveries_none/total_discoveries_none
    FDR_none <- false_discoveries_none/total_discoveries_none
    FNDR_none <- false_nondiscoveries_none/total_non_discoveries_none
    type_1_none <- false_discoveries_none/n_false/n_iter
    power_none <- true_discoveries_none/n_true/n_iter
    FWER_none <- FWE_none/n_iter
    
    total_discoveries_holm <- true_discoveries_holm + false_discoveries_holm
    total_non_discoveries_holm <- false_nondiscoveries_holm + true_nondiscoveries_holm
    TDR_holm <- true_discoveries_holm/total_discoveries_holm
    FDR_holm <- false_discoveries_holm/total_discoveries_holm
    FNDR_holm <- false_nondiscoveries_holm/total_non_discoveries_holm
    type_1_holm <- false_discoveries_holm/n_false/n_iter
    power_holm <- true_discoveries_holm/n_true/n_iter
    FWER_holm <- FWE_holm/n_iter
    
    total_discoveries_fdr <- true_discoveries_fdr + false_discoveries_fdr
    total_non_discoveries_fdr <- false_nondiscoveries_fdr + true_nondiscoveries_fdr
    TDR_fdr <- true_discoveries_fdr/total_discoveries_fdr
    FDR_fdr <- false_discoveries_fdr/total_discoveries_fdr
    FNDR_fdr <- false_nondiscoveries_fdr/total_non_discoveries_fdr
    type_1_fdr <- false_discoveries_fdr/n_false/n_iter
    power_fdr <- true_discoveries_fdr/n_true/n_iter
    FWER_fdr <- FWE_fdr/n_iter
    
    out_table <- rbind(
      out_table,
      data.table(
        true_frac = true_fraction,
        Stat = c("TD", "FD", "TDR", "FDR", "FNDR", "Type1", "power", "FWER"),
        None = c(true_discoveries_none, false_discoveries_none, TDR_none,
                 FDR_none, FNDR_none, type_1_none, power_none, FWER_none),
        Holm = c(true_discoveries_holm, false_discoveries_holm, TDR_holm,
                 FDR_holm, FNDR_holm, type_1_holm, power_holm, FWER_holm),
        FDR = c(true_discoveries_fdr, false_discoveries_fdr, TDR_fdr,
                FDR_fdr, FNDR_fdr, type_1_fdr, power_fdr, FWER_fdr)
      )
    )
    saveRDS(out_table, fileout_path)
  } 
}else{
  out_table <- readRDS(fileout_path)
}

  out_table_keep <- out_table[Stat %in% c("FDR", "FNDR", "Type1", "power", "FWER"), ]

  row_digits <- function(df, d){
    # df is a data.frame or data.table
    # d is a single value for all rows or vector of digits for each row
    
    inc <- which(unlist(lapply(df, is.numeric)))
    x <- round(Filter(is.numeric, df), d)|>
      apply(2, as.character) |>
      data.table()
    df_out <- apply(df, 2, as.character) |>
      data.table()
    df_out[, inc] <- x
    return(df_out)
  }
  
    options(knitr.kable.NA = '')

#  out_table_keep[, .SD, .SDcols = c("Stat", "None", "Holm", "FDR")] |>
  out_table_keep[, true_frac := " "]
  setnames(out_table_keep, "true_frac", " ")
  out_table_keep |>
    kable(digits = c(1, 1,3,3,3)) |>
    kable_styling(full_width = FALSE) |>
    pack_rows("True Fraction = 0.2", 1, 5) |>
    pack_rows("True Fraction = 0.8", 6, 10)
Table 12.5: Performance statistics for no adjustment, BH adjustment, and Holm adjustment for a batch of 100 independent tests with two different fractions of true effects. FNDR is the False Non-Discovery Rate, the fraction of false non-discoveries (p > 0.05 but the effect is true) to the total number of non-discoveries (p > 0.05). We want this fraction to be low.
Stat None Holm FDR
True Fraction = 0.2
FDR 0.200 0.012 0.047
FNDR 0.050 0.173 0.126
Type1 0.050 0.001 0.005
power 0.801 0.164 0.424
FWER 0.985 0.040 0.324
True Fraction = 0.8
FDR 0.015 0.001 0.010
FNDR 0.457 0.768 0.536
Type1 0.050 0.001 0.030
power 0.801 0.174 0.720
FWER 0.641 0.012 0.449

Notes on Table 12.5

  1. Note that it’s not the Type I error rate that increases with multiple tests, it’s the FWER that increases. It’s important to understand this difference. If every test answers a different question, all we care about is the Type I error rate.
  2. If a batch of tests contains both true effects and null (zero) effects, the fraction of tests with true effects doesn’t effect the Type I error rate because the denominator is not the number of tests but the number of tests with no true effect.
  3. If a batch of tests contains both true effects and null effects, the fraction of tests with true effects does effect the FDR because there are more false discoveries relative to true discoveries as the fraction of tests with true effects goes to zero. Remember, the denominator is the sum of false and true discoveries. If there are no true effects, 100% of the discoveries are false! If there are only true effects, 0% of the discoveries are false!

12.6 Working in R

12.6.1 The contrast() function has different default adjustment methods for each of the contrast methods

  1. Tukey HSD is the default for all pairwise contrasts using either method = "pairwise" or method = "revpairwise"
# use the exfig2 data to explore this
m1 <- lm(nefa ~ treatment, data = exfig2)
m1_emm <- emmeans(m1, specs = "treatment")
m1_pairs <- contrast(m1_emm,
                     method = "revpairwise") |>
  summary(infer = TRUE)
m1_pairs
 contrast estimate    SE df lower.CL upper.CL t.ratio p.value
 3q - con    -1.09 0.407 26  -2.1011  -0.0793  -2.680  0.0327
 4i - con     1.05 0.396 26   0.0642   2.0320   2.647  0.0351
 4i - 3q      2.14 0.407 26   1.1274   3.1491   5.256  <.0001

Confidence level used: 0.95 
Conf-level adjustment: tukey method for comparing a family of 3 estimates 
P value adjustment: tukey method for comparing a family of 3 estimates 

Notes:

  1. The adjustment method for both the confidence interval and the p-value is given in the contrast object.
  2. Dunnett’s test is the default for comparing all treatments (non-reference groups) to the control (the reference group). Both method = "trt.vs.ctrl" and method = "dunnett" work.
# use the exfig2 data to explore this
m1 <- lm(nefa ~ treatment, data = exfig2)
m1_emm <- emmeans(m1, specs = "treatment")
m1_pairs <- contrast(m1_emm,
                     method = "trt.vs.ctrl") |>
  summary(infer = TRUE)
m1_pairs
 contrast estimate    SE df lower.CL upper.CL t.ratio p.value
 3q - con    -1.09 0.407 26   -2.047   -0.133  -2.680  0.0241
 4i - con     1.05 0.396 26    0.117    1.979   2.647  0.0260

Confidence level used: 0.95 
Conf-level adjustment: dunnettx method for 2 estimates 
P value adjustment: dunnettx method for 2 tests 

12.6.2 For non-default adjustment, use the adjust = argument

For no adjustment, use adjust = "none"

# use the exfig2 data to explore this
m1 <- lm(nefa ~ treatment, data = exfig2)
m1_emm <- emmeans(m1, specs = "treatment")
m1_pairs <- contrast(m1_emm,
                     method = "revpairwise",
                     adjust = "none") |>
  summary(infer = TRUE)
m1_pairs
 contrast estimate    SE df lower.CL upper.CL t.ratio p.value
 3q - con    -1.09 0.407 26   -1.926   -0.254  -2.680  0.0126
 4i - con     1.05 0.396 26    0.234    1.862   2.647  0.0136
 4i - 3q      2.14 0.407 26    1.302    2.974   5.256  <.0001

Confidence level used: 0.95 

12.6.3 Options in adjust =

  1. “none” – no adjustment.

Options for controlling FDR

  1. “BH” or “fdr” – controls the false discovery rate using the logic of Benjamini & Hochberg. The BH method assumes independent tests. This or BY should be used for most uses of p-value adjustment in experimental bench biology.
  2. “BY” – is a modification of BH using the logic of Benjamini & Yekutieli. The BY method relaxes the independent test assumption and consequently is more consertive than BH. This or BY should be used for most uses of p-value adjustment in experimental bench biology.

General options for controlling FWER

  1. “holm” – Holm-Bonferroni is the Holm modification of Bonferroni and is used to control the FWER. It is more powerful than Bonferroni while still holding FWER at alpha. For most uses of p-value adjustment in experimental bench biology, we are less interested in controling the FWER than the FDR.
  2. “bonferroni” – Bonferroni is the original, general-purpose adjustment for a set of p-values to control the FWER. It has less power than the Holm modification, so it’s use should be historic only.
  3. “mvt” – based on the multivariate t distribution and using covariance structure of the variables.

Options for controlling FWER that are specific to experimental designs

  1. “dunnettx” – Dunnett’s test is a method used when comparing all treatments to a single control.
  2. “tukey” – Tukey’s HSD method is a method used to compare all pairwise comparisons.

12.6.4 Using the p.adjust() function

We use the base R p.adjust() function to adjust a set of p-values that were computed from different models, as in Example 3 above.

But to show a very simple example of using the p.adjust() function, let’s use the lipocalin Fig3 data. First get the p values for the four simple effects on the cardiolipin in panel d. Be sure to use the adjust = "none argument because we want to adjust unadjusted p-values!

m1 <- lm(cardiolipin_d ~ genotype * treatment, data = fig3)
m1_emm <- emmeans(m1, specs = c("genotype", "treatment"))
# compute the four simple effect p-values
m1_simple <- contrast(m1_emm,
                      method = "revpairwise",
                      simple = "each",
                      combine = TRUE,
                      adjust = "none") |>
  summary(infer = TRUE) |>
  data.table()
# extract the p-values

p_fig3d <- m1_simple[, p.value]
p_fig3d
[1] 0.034802754 0.497327407 0.003147072 0.668086400

Now, use the p.adjust() function

p_fdr_BH_fig3d <- p.adjust(p_fig3d, "BH")
p_fdr_BH_fig3d
[1] 0.06960551 0.66310321 0.01258829 0.66808640

Notes:

  1. Again, we wouldn’t typically do this, we would simply use the “adjust =” argument within the contrast() function and not bother to extract the p-values and then adjust these, as done here.
  2. When we would want to extract the p-values and adjust is when we have multiple responses for the same experiment and these responses answer the same question, which is common in experimental bench biology and is the case for all three examples above.

12.6.4.1 Options in p.adjust()

  1. “BH” or “fdr” – controls the false discovery rate using the logic of Benjamini & Hochberg. The BH method assumes independent tests. This or BY should be used for most uses of p-value adjustment in experimental bench biology.
  2. “BY” – is a modification of BH using the logic of Benjamini & Yekutieli. The BY method relaxes the independent test assumption and consequently is more consertive than BH. This or BY should be used for most uses of p-value adjustment in experimental bench biology.
  3. “holm” – Holm-Bonferroni is the Holm modification of Bonferroni and is used to control the FWER. It is more powerful than Bonferroni while still holding FWER at alpha. For most uses of p-value adjustment in experimental bench biology, we are less interested in controling the FWER than the FDR.
  4. “bonferroni” – Bonferroni is the original, general-purpose adjustment for a set of p-values to control the FWER. It has less power than the Holm modification, so it’s use should be historic only.

12.6.5 A Holm-Sidak function for reproducing GraphPad Prism results

holm_sidak <- function(p){
  # assume alpha = 0.05
  n <- length(p)
  rank_p <- frank(p)
  inv_rank_p <- n - rank_p + 1
  
  # put pvalues and the inv rankings in rank order
  p_order <- p[order(p)]
  inv_rank_order <- inv_rank_p[order(p)]
  
  # compute the adjusted p-value
  p_holm_sidak_order <- (1 - (1 - p_order)^(inv_rank_order))
  
  # adjusted p-values of higher rank cannot be less than adjusted p-value of lower rank
  # so find and adjust these
  for(j in 2:n){
    if(p_holm_sidak_order[j] < p_holm_sidak_order[j-1]){
      p_holm_sidak_order[j] <- p_holm_sidak_order[j-1]
    }
  }
  
  # any adjusted > 1 make 1
  p_holm_sidak_order[p_holm_sidak_order > 1.0] <- 1.0
  
  # put back into original order
  p_holm_sidak <- p_holm_sidak_order[rank_p]
  return(p_holm_sidak)
}

Notes:

  1. This function only computes the Holm-Sidak p-value and not the adjusted CIs. The emmeans package uses Bonferroni intervals for the Holm adjustment, and these should be used for the Holm-Sidak adjustment. The function RHSDB::rh.sd.sidak computes CIs in addition to the p-values but I haven’t sleuthed out what these are.
  2. This function is used exactly like any other R function. Here I compare both Holm and Holm-Sidak for the Lipocalin Fig3 dataset. Just like the Sidak adjustment is slightly more powerful than Bonferroni, the Holm-Sidak is slightly more powerful than Holm. These tiny differences are rather moot, given the TL;DR at the start of this chapter (Chapter 12).
m1 <- lm(cardiolipin_d ~ genotype * treatment, data = fig3)
m1_emm <- emmeans(m1, specs = c("genotype", "treatment"))
# compute the four simple effect p-values
m1_simple <- contrast(m1_emm,
                      method = "revpairwise",
                      simple = "each",
                      combine = TRUE,
                      adjust = "none") |>
  summary(infer = TRUE) |>
  data.table()

# compute holm
m1_simple[, holm := p.adjust(p.value, "holm")]

# compute holm-sidak
m1_simple[, holm_sidak := holm_sidak(p.value)]
m1_simple |>
  kable(digits = c(1,1,1,3,3,1,3,3,1,4,4,4)) |>
  kable_styling()
treatment genotype contrast estimate SE df lower.CL upper.CL t.ratio p.value holm holm_sidak
Saline . KO - WT -0.023 0.010 14 -0.043 -0.002 -2.3 0.0348 0.1044 0.1008
CL . KO - WT 0.006 0.009 14 -0.012 0.024 0.7 0.4973 0.9947 0.7473
. WT CL - Saline -0.033 0.009 14 -0.052 -0.013 -3.6 0.0031 0.0126 0.0125
. KO CL - Saline -0.004 0.009 14 -0.024 0.016 -0.4 0.6681 0.9947 0.7473

12.7 Hidden Import Code

12.7.1 Example 1 (exfig2 data)

data_from <- "A brain-to-gut signal controls intestinal fat absorption"
file_name <- "41586_2024_7929_MOESM13_ESM.xlsx"
file_path <- here(data_folder, data_from, file_name)
exfig2_import <- function(phenotype = "xxx", range_in = "xxx"){
  exfig2_part <- read_excel(file_path,
                            sheet = "Sheet1",
                            range = range_in,
                            col_names = TRUE) |>
    data.table() |>
    melt(variable.name = "treatment",
         value.name = phenotype)
  return(exfig2_part)
}

exfig2 <- exfig2_import(phenotype = "nefa",
                        range = "A47:C57")
exfig2 <- cbind(exfig2,
                "fecal_tg" = exfig2_import(phenotype = "fecal_tg",
                                           range = "A64:C74")$fecal_tg)
exfig2 <- cbind(exfig2,
                "jejunal_tg" = exfig2_import(phenotype = "jejunal_tg",
                                             range = "A81:C91")$jejunal_tg)
exfig2[, treatment := factor(treatment,
                             levels = unique(treatment))]

12.7.2 Example 2 (fig5l data)

data_from <- "A brain-to-gut signal controls intestinal fat absorption"
file_name <- "41586_2024_7929_MOESM11_ESM.xlsx"
file_path <- here(data_folder, data_from, file_name)
fig5l <- read_excel(file_path,
                    sheet = "Sheet1",
                    range = "A133:AG137",
                    col_names = FALSE) |>
  data.table() |>
  transpose(make.names = 1)
colnames(fig5l)[1] <- c("treatment_orig")
fig5l[, ezrin := as.numeric(ezrin)]
fig5l[, cdc42 := as.numeric(cdc42)]
fig5l[, eps8 := as.numeric(eps8)]
fig5l[, vil1 := as.numeric(vil1)]
fig5l[, treatment_orig := fill_down(treatment_orig)]
fig5l[treatment_orig == "flox-veh", group := "WT Vehicle"]
fig5l[treatment_orig == "flox-Pue", group := "WT Puerarin"]
fig5l[treatment_orig == "KO-veh", group := "KO Vehicle"]
fig5l[treatment_orig == "KO-Pue", group := "KO Puerarin"]
fig5l[, group := factor(group, levels = unique(group))]

fig5l[, c("genotype", "treatment") := tstrsplit(group, " ")]
fig5l[, genotype := factor(genotype, levels = c("WT", "KO"))]
fig5l[, treatment := factor(treatment, levels = c("Vehicle", "Puerarin"))]

12.7.3 Example 3 (fig3 data)

data_from <- "Lipocalin 2 regulates mitochondrial phospholipidome remodeling, dynamics, and function in brown adipose tissue in male mice"

file_name <- "41467_2023_42473_MOESM7_ESM.xlsx"
file_path <- here(data_folder, data_from, file_name)

# note I had to insert rows in the excel file 
# in between some of the
# sets because of inconsistency in the file
fig3_import <- function(range_in = "xxx", lipid_id = "xxx"){
  fig3_wide <- read_excel(file_path,
                          sheet = "Fig. 3",
                          range = range_in) |>
    data.table()
  fig3_long <- melt(fig3_wide,
                    measure.vars = colnames(fig3_wide),
                    variable.name = "group",
                    value.name = paste0("cardiolipin_", lipid_id))
  fig3_long[, group := factor(group,
                              levels = colnames(fig3_wide))]
  fig3_long[, mouse_id := .I]
  return(fig3_long)
}

fig3 <- data.table(NULL)
row_1 <- 12
for(fig_id in letters[4:21]){
  row_2 <- row_1 + 5
  range_in <- paste0(
    "B",row_1,":E",row_2
  )
  if(fig_id == letters[4]){
    fig3 <- fig3_import(range_in, fig_id)
  }else{
    fig3_part <- fig3_import(range_in, fig_id)
    fig3 <- merge(fig3, fig3_part, by = c("mouse_id", "group"))
  }
  # fig3 <- rbind(
  #   fig3,
  #   data.table(
  #     figure = fig_id,
  #     fig3_import(range_in))
  # )
  row_1 <- row_2 + 4
}

# row 5 and 15 have all missing data
fig3 <- fig3[-c(5,15), ]

fig3[, c("genotype", "treatment") := tstrsplit(group, "-")]
y_cols <- which(substr(names(fig3), 1, 11) == "cardiolipin")
fig3 <- fig3[, .SD, .SDcols = c("mouse_id", "group", "genotype", "treatment", names(fig3)[y_cols])]

fig3[, group := factor(group, levels = unique(group))]
fig3[, genotype := factor(genotype, levels = c("WT", "KO"))]
fig3[, treatment := factor(treatment, levels = c("Saline", "CL"))]