Using radEmu with a reference taxon

First, we will install radEmu, if we haven’t already.

# if (!require("remotes", quietly = TRUE))
#     install.packages("remotes")
# 
# remotes::install_github("statdivlab/radEmu")

Next, we can load radEmu as well as the tidyverse package suite.

library(magrittr)
library(dplyr)
library(ggplot2)
library(stringr)
library(radEmu)

Introduction

In this vignette, we will use radEmu with a reference taxon as the identifiability constraint, instead of radEmu’s default of using the “typical” log fold-difference across taxa as the identifiability constraint. This is useful if you have a specific taxon that you want to compare all other taxa to, or if your experimental design includes a spiked-in synthetic taxon or internal standard. In this vignette we will provide an example of setting a reference taxon, and discuss how this changes the interpretation of radEmu parameter estimates and test results. This vignette assumes familiarity with the basic usage of radEmu, and we suggest reviewing the “intro_radEmu.Rmd” vignette before working through this vignette.

Loading and preparing data

We’ll use a dataset published by Wirbel et al. (2019), see “intro_radEmu.Rmd” for more information about this data.

data("wirbel_sample")
data("wirbel_otu")
mOTU_names <- colnames(wirbel_otu)

We will now prepare this data to obtain our abundance table and covariate data to use with emuFit. Again, see “intro_radEmu.Rmd” for an explanation of each of the steps in the code below.

wirbel_sample$Group <- factor(wirbel_sample$Group, levels = c("CTR","CRC"))

chosen_genera <- c("Eubacterium", "Faecalibacterium", "Fusobacterium", "Porphyromonas")
mOTU_name_df <- data.frame(name = mOTU_names) %>% 
  mutate(base_name = stringr::str_remove(mOTU_names, "unknown ") %>%
                      stringr::str_remove("uncultured ")) %>%
  mutate(genus_name = stringr::word(base_name, 1))
restricted_mOTU_names <- mOTU_name_df %>%
  filter(genus_name %in% chosen_genera) %>%
  pull(name)

ch_study_obs <- which(wirbel_sample$Country %in% c("CHI"))

small_Y <- wirbel_otu[ch_study_obs, restricted_mOTU_names]
sum(rowSums(small_Y) == 0) # no samples have a count sum of 0 
#> [1] 0
sum(colSums(small_Y) == 0) # one category has a count sum of 0
#> [1] 1

category_to_rm <- which(colSums(small_Y) == 0)
small_Y <- small_Y[, -category_to_rm]
sum(colSums(small_Y) == 0)
#> [1] 0

Fitting a model

Now we will use the function emuFit() to fit our model. We will start by fitting radEmu’s default model, which uses the “typical” log fold-difference across taxa as the identifiability constraint.

ch_fit_default <- emuFit(formula = ~ Group, 
                         data = wirbel_sample[ch_study_obs, ],
                         Y = small_Y,
                         run_score_tests = FALSE)

Let’s look at the first few estimates.

head(ch_fit_default$coef)
#>   covariate                                                category
#> 1  GroupCRC Fusobacterium nucleatum s. vincentii [ref_mOTU_v2_0754]
#> 2  GroupCRC  Fusobacterium nucleatum s. animalis [ref_mOTU_v2_0776]
#> 3  GroupCRC Fusobacterium nucleatum s. nucleatum [ref_mOTU_v2_0777]
#> 4  GroupCRC         Faecalibacterium prausnitzii [ref_mOTU_v2_1379]
#> 5  GroupCRC                  Eubacterium siraeum [ref_mOTU_v2_1387]
#> 6  GroupCRC                      Eubacterium sp. [ref_mOTU_v2_1395]
#>   category_num    estimate        se      lower     upper score_stat pval
#> 1            1  1.48637593 0.8965269 -0.2707846 3.2435364         NA   NA
#> 2            2  2.31316834 0.8270260  0.6922272 3.9341095         NA   NA
#> 3            3  3.08640201 1.0274037  1.0727278 5.1000763         NA   NA
#> 4            4 -0.34307844 0.3520048 -1.0329952 0.3468384         NA   NA
#> 5            5  0.03891827 0.4779880 -0.8979210 0.9757575         NA   NA
#> 6            6  1.12952082 0.9562351 -0.7446656 3.0037072         NA   NA

The first taxon reported is F. nucleatum s. vincentii, and it has an “estimate” of $1.49$ . This means that the abundance of F. nucleatum s. vincentii is, on average, $e^{1.49}=4.4$ times greater in colorectal cancer cases compared to controls compared to the typical fold-difference of taxa in cases compared to controls.

Let’s look at the distribution of log fold-difference estimates.

ch_fit_default$coef %>% ggplot(aes(x = estimate)) + geom_boxplot() + theme_bw()

We can see that the log fold-difference estimates in this analysis range from $\approx -2.5$ to $\approx 6.5$ . Furthermore, the median log fold-difference is $\approx 0$ , which is a consequence of the default identifability constraint… and why we need to add “compared to the typical fold-difference of taxa in cases compared to controls” in the above interpretation. (Yes, it’s annoying… but it’s necessary to avoid making stronger assumptions.)

Now, we will instead use a reference taxon to give the identifiability constraint. Here’s ~~two~~ three scenarios where this might be useful:

You know that a taxon is not changing in average abundance across your comparison groups. Maybe you’ve looked at comparable datasets, or maybe you know this mechanistically given the system you’re studying. Either way, you know it.
You spiked a taxon at equal abundance into your samples. So, you know it’s not changing in average abundance across your comparison groups. (This is just #1, but you made it happen. Go you!)
You just want to compare the fold-differences in abundance to a specific taxon’s fold-difference in abundance, without assuming that this taxon is not different in average abundance.

Note: Setting 3 isn’t different from 1 or 2 in its code, but the interpretations of estimates and inference will be different – with the comparison looking more like the median-comparison, but now with a taxon-specific comparison.

Let’s suppose you’re in Setting 1 or 2, and that F. nucleatum s. animalis is that taxon.

ref_taxon <- which(stringr::str_detect(colnames(small_Y), "Fusobacterium nucleatum s. animalis"))
ref_taxon # the index of F. nucleatum s. animalis in our Y matrix
#> [1] 2

We suggest when using a reference taxon for covariates, that you still use the “typical” log fold-difference as a constraint for the intercept in the model. This means that we will have different constraint functions for the intercept and for the covariates. In order to create a list of constraint functions to pass on to emuFit(), we can use the function make_reference_constraints().

# use `make_design_matrix` to find p, the number of columns in our design matrix
head(make_design_matrix(formula = ~ Group, data = wirbel_sample[ch_study_obs, ])) # p = 2
#>            (Intercept) GroupCRC
#> ERR1018185           1        1
#> ERR1018186           1        1
#> ERR1018187           1        1
#> ERR1018188           1        1
#> ERR1018189           1        1
#> ERR1018190           1        1
reference_constraint_lists <- make_reference_constraints(p = 2, j = ref_taxon)

We will pass on the outputs of make_reference_constraints in to emuFit().

ch_fit_ref <- emuFit(formula = ~ Group, 
                     data = wirbel_sample[ch_study_obs, ],
                     Y = small_Y,
                     run_score_tests = FALSE,
                     constraint_fn = reference_constraint_lists$constraint_list,
                     constraint_grad_fn = reference_constraint_lists$constraint_grad_list)

Again, let’s look at the first few estimates.

head(ch_fit_ref$coef)
#>   covariate                                                category
#> 1  GroupCRC Fusobacterium nucleatum s. vincentii [ref_mOTU_v2_0754]
#> 2  GroupCRC  Fusobacterium nucleatum s. animalis [ref_mOTU_v2_0776]
#> 3  GroupCRC Fusobacterium nucleatum s. nucleatum [ref_mOTU_v2_0777]
#> 4  GroupCRC         Faecalibacterium prausnitzii [ref_mOTU_v2_1379]
#> 5  GroupCRC                  Eubacterium siraeum [ref_mOTU_v2_1387]
#> 6  GroupCRC                      Eubacterium sp. [ref_mOTU_v2_1395]
#>   category_num   estimate        se      lower       upper score_stat pval
#> 1            1 -0.8267924 0.4581539 -1.7247576  0.07117277         NA   NA
#> 2            2  0.0000000 0.0000000  0.0000000  0.00000000         NA   NA
#> 3            3  0.7732337 0.6760004 -0.5517028  2.09817017         NA   NA
#> 4            4 -2.6562468 0.8742719 -4.3697882 -0.94270535         NA   NA
#> 5            5 -2.2742501 1.0329916 -4.2988764 -0.24962375         NA   NA
#> 6            6 -1.1836475 1.1804619 -3.4973103  1.13001527         NA   NA

First, note that the estimate for F. nucleatum s. animalis is now $0$ , because we have set it as the reference taxon. Great. It did what it was told.

The estimate for F. nucleatum s. vincentii is $-0.83$ . The interpretation of this estimate under Setting 1 or 2 is that the average abundance of F. nucleatum s. vincentii in cancer cases is $exp(-0.83) = 0.44$ times that of non-cancer controls. Alternatively, its average abundance is $exp(0.83) = 2.29$ times greater in non-cancer controls compared to cases.

How did this work? Because we assumed/knew information about one taxon, we could “pin down” the location of all estimates. Cool!

If we were instead in Setting 3, our interpretation would be slightly different. In this setting, the fold-difference in average abundance of F. nucleatum s. vincentii in cancer cases compared to non-cancer controls is $exp(-0.83) = 0.44$ times the fold-difference in average abundance of F. nucleatum s. animalis in cases compared to controls.

Let’s look at the distribution of log fold-difference estimates.

ch_fit_ref$coef %>% ggplot(aes(x = estimate)) + geom_boxplot() + theme_bw()

Here, we can see a distribution that has the same shape as the distribution of estimates when using the default constraint. This is because the result of changing the identifiability constraint is to shift all estimates up or down by the same value, where this value is determined by the constraint. In this plot, the log fold-difference estimates range from $\approx -4.5$ to $\approx 4$ , and the median log fold-difference is $\approx -2.5$ . This shows that while the majority of taxa have ratios of average abundances of cases to controls that are smaller than that of F. nucleatum s. animalis, there are some taxa that have larger ratios and are even more enriched in cases compared to controls than our chosen reference taxon.

Hypothesis testing

Finally, let’s test hypotheses about the taxon F. nucleatum s. vincentii using both identifiability constraints. First consider the default constraint. We will run a robust score test.

ch_test_default <- emuFit(formula = ~ Group, 
                         data = wirbel_sample[ch_study_obs, ],
                         Y = small_Y,
                         fitted_model = ch_fit_default, # provide our previous fit
                         refit = FALSE, # avoid refitting model
                         # test the first taxon, F. nucleatum s. vincentii
                         test_kj = data.frame(j = 1, 
                         # test the second column in the design matrix
                         # associated with the case/control covariate
                                              k = 2))

Now let’s look at the results.

head(ch_test_default$coef)
#>   covariate                                                category
#> 1  GroupCRC Fusobacterium nucleatum s. vincentii [ref_mOTU_v2_0754]
#> 2  GroupCRC  Fusobacterium nucleatum s. animalis [ref_mOTU_v2_0776]
#> 3  GroupCRC Fusobacterium nucleatum s. nucleatum [ref_mOTU_v2_0777]
#> 4  GroupCRC         Faecalibacterium prausnitzii [ref_mOTU_v2_1379]
#> 5  GroupCRC                  Eubacterium siraeum [ref_mOTU_v2_1387]
#> 6  GroupCRC                      Eubacterium sp. [ref_mOTU_v2_1395]
#>   category_num    estimate        se      lower     upper score_stat       pval
#> 1            1  1.48637593 0.8965269 -0.2707846 3.2435364   4.245031 0.03936535
#> 2            2  2.31316834 0.8270260  0.6922272 3.9341095         NA         NA
#> 3            3  3.08640201 1.0274037  1.0727278 5.1000763         NA         NA
#> 4            4 -0.34307844 0.3520048 -1.0329952 0.3468384         NA         NA
#> 5            5  0.03891827 0.4779880 -0.8979210 0.9757575         NA         NA
#> 6            6  1.12952082 0.9562351 -0.7446656 3.0037072         NA         NA

We have a p-value of $0.039$ . This means that with an alpha level of $0.05$ , we reject the null hypothesis that the expected log fold-difference in abundance of F. nucleatum s. vincentii in cases compared to controls is equal to the typical log fold-difference in abundance of cases to controls across all taxa in the analysis. We conclude that F. nucleatum s. vincentii is enriched in cases compared to controls, relative to the typical taxon in the analysis.

Now, instead consider a test using our reference taxon constraint.

ch_test_ref <- emuFit(formula = ~ Group, 
                      data = wirbel_sample[ch_study_obs, ],
                      Y = small_Y,
                      # constraint functions from above 
                      constraint_fn = reference_constraint_lists$constraint_list,
                      constraint_grad_fn = reference_constraint_lists$constraint_grad_list, 
                      fitted_model = ch_fit_ref, # provide our previous fit
                      refit = FALSE, # avoid refitting model
                      # test the first taxon, F. nucleatum s. vincentii
                      test_kj = data.frame(j = 1, 
                      # test the second column in the design matrix
                      # associated with the case/control covariate
                                           k = 2))

Now let’s look at the results.

head(ch_test_ref$coef)
#>   covariate                                                category
#> 1  GroupCRC Fusobacterium nucleatum s. vincentii [ref_mOTU_v2_0754]
#> 2  GroupCRC  Fusobacterium nucleatum s. animalis [ref_mOTU_v2_0776]
#> 3  GroupCRC Fusobacterium nucleatum s. nucleatum [ref_mOTU_v2_0777]
#> 4  GroupCRC         Faecalibacterium prausnitzii [ref_mOTU_v2_1379]
#> 5  GroupCRC                  Eubacterium siraeum [ref_mOTU_v2_1387]
#> 6  GroupCRC                      Eubacterium sp. [ref_mOTU_v2_1395]
#>   category_num   estimate        se      lower       upper score_stat     pval
#> 1            1 -0.8267924 0.4581539 -1.7247576  0.07117277   1.109802 0.292125
#> 2            2  0.0000000 0.0000000  0.0000000  0.00000000         NA       NA
#> 3            3  0.7732337 0.6760004 -0.5517028  2.09817017         NA       NA
#> 4            4 -2.6562468 0.8742719 -4.3697882 -0.94270535         NA       NA
#> 5            5 -2.2742501 1.0329916 -4.2988764 -0.24962375         NA       NA
#> 6            6 -1.1836475 1.1804619 -3.4973103  1.13001527         NA       NA

Here, the p-value is $0.292$ . Therefore, we fail to reject the null hypothesis that the expected log fold-difference in the abundance of F. nucleatum s. vincentii in cases compared to controls is equal to the expected log fold-difference in the abundance of F. nucleatum s. animalis in cases compared to controls. We do not have sufficient evidence to support the conclusion that F. nucleatum s. vincentii and F. nucleatum s. animalis have significantly different fold-differences in abundances of cases compared to controls.

Important: These hypothesis tests are testing different hypotheses!!! So, they should give different results. Phew.

Note that while this example uses a model with a single categorical covariate, you can follow the same approach with multiple covariates as well as cluster-correlated (e.g. longitudinal) data. Don’t forget that your parameter interpretations change when you adjust for multiple covariates!

We hope you found this example helpful!! If you have any questions, feel free to post an issue to let us know!

Sarah Teichman and Amy Willis

2025-07-28

Introduction

Loading and preparing data

Fitting a model

Hypothesis testing