Introduction

paramedic jointly models two data types: relative abundance data (e.g., 16S data), and absolute abundance data (e.g., qPCR data). Importantly, the model accounts for the fact that taxa may be detected with differing efficiency across the two datatypes. In the paper, we describe a hierarchical model that can be used in the absence of sample covariates or batch covariates. However, many users are likely to have covariate data available to improve modelling. The purpose of this vignette is to show you how to customise paramedic to incorporate your covariate data and batch data. We also discuss how to modify the absolute abundance data’s generating process from Poisson to Negative Binomial. In all examples below, we use the example data referenced in the main vignette. Additionally, we use a very small number of total and burn-in iterations for illustration only; we stress that a larger number of iterations should be used in practice.

library("rstan")
## Loading required package: StanHeaders
## Loading required package: ggplot2
## rstan (Version 2.21.2, GitRev: 2e1f913d3ca3)
## For execution on a local, multicore CPU with excess RAM we recommend calling
## options(mc.cores = parallel::detectCores()).
## To avoid recompilation of unchanged Stan programs, we recommend calling
## rstan_options(auto_write = TRUE)
library("paramedic")
## Loading required package: Rcpp
## paramedic version 0.1.3.5: Predicting Absolute and Relative Abundance by Modeling Efficiency to Derive Intervals and Concentrations
## Package created on 2021-06-28
library("dplyr")
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
data(example_16S_data)
data(example_qPCR_data)

Hierarchical models currently available in paramedic

The default hierarchical model

The default hierarchical model in paramedic places a Poisson distribution on \(V\) and a multinomial distribution on \(W\); a normal distribution on \(\log \mu\), where the distribution depends on a taxon-specific intercept \(\beta_0\) and a covariance matrix \(\Sigma\); and a normal distribution on \(\log e\). The likelihood is (for all \(i\) and \(j\)): \[V_{ij} \mid \mu_{ij} \sim Poisson(\mu_{ij}) \\ p_{ij} = \frac{\mu_{ij}e_j}{\sum_{\ell = 1}^q \mu_{i\ell}e_\ell} \\ W_{i\cdot} \mid M_i, \mu_{i\cdot}, e \sim Multinomial(M_i, p_{i\cdot}).\]

We place the following priors on \(\mu\) and \(e\): \[\log\mu_{i\cdot} \sim N_q(\beta_0, \Sigma) \\ \beta_0 \sim N_q(0, \sigma^2_\beta) \\ \Sigma_{jj} \sim Lognormal(0, \sigma^2_\Sigma) \\ e_j \sim Lognormal(0, \sigma^2_e) \\ \sigma^2_e \sim InvGamma(\alpha_\sigma, \kappa_\sigma).\]

The following code fits this model in paramedic for the example data:

mod <- paramedic::run_paramedic(W = example_16S_data[, 1:8], V = example_qPCR_data,
                                n_iter = 30, n_burnin = 25, n_chains = 1,
                                stan_seed = 4747, control = list(max_treedepth = 15))

Allowing for covariates

It may be of interest to allow covariates to affect the mean for each taxon. In this case, we may replace the prior on \(\mu\) with the following: \[\log\mu_{i\cdot} \sim N_q(\beta_0 + X_i\beta_1, \Sigma) \\ \beta_0 \sim N_q(0, \sigma^2_\beta) \\ \beta_{1,k} \sim N_q(0, 1),\] where \(\beta_1\) is a matrix with number of rows equal to the number of covariates and number of columns equal to \(q\).

You can implement this model by passing a tibble of covariates \(X\) to run_paramedic. \(X\) must have number of rows equal to the sample size, and any number of columns.

Suppose that we have a binary variable that specifies case status (i.e., 1 denotes a case and 0 denotes a control) for some disease. For example,

# generate a random case/control status variable
set.seed(4747)
case_control <- sample(0:1, size = nrow(example_qPCR_data), replace = TRUE)
library("tibble")
X <- tibble(sample_id = example_qPCR_data$sample_id, case = case_control)

If we want to adjust for this covariate in the example data analysis using paramedic, we use the following code:

mod_with_covariates <- paramedic::run_paramedic(W = example_16S_data[, 1:8],
                                                V = example_qPCR_data, X = X,
                                                n_iter = 30, n_burnin = 25,
                                                n_chains = 1, stan_seed = 4747,
                                                control = list(max_treedepth = 15))

Allowing for potential overdispersion in \(V\)

It may be of interest to allow for overdispersion in \(V\). In this case, we may replace the likelihood on \(V\) with the following: \[V_{ij} \sim NegBin(\mu_{ij}, \phi_i),\] where \(\phi_i\) is a subject-specific overdispersion parameter. We add the following prior on \(\phi\): for all \(i\), \[\phi_i \sim Gamma(\alpha_\phi, \beta_\phi).\]

You can implement this model by passing nonzero values of alpha_phi and beta_phi to run_paramedic.

In the example data, we can run the following code to implement this analysis:

mod <- paramedic::run_paramedic(W = example_16S_data[, 1:8], V = example_qPCR_data,
                                alpha_phi = 1, beta_phi = 10, n_iter = 30,
                                n_burnin = 25, n_chains = 1, stan_seed = 4747,
                                control = list(max_treedepth = 15))

Allowing for batch covariates

In some cases, the data under consideration are collected in multiple experiments (or batches). Here, we could add a layer to the hierarchical model to describe this data-generating mechanism. For example, if \(i\) indexes the sample, \(j\) indexes the taxon, and \(k\) indexes the study (or batch), the efficiencies could be modeled as \[e_{jk} \sim Lognormal(\xi_j, \sigma^2_\xi) \\ \xi_j \sim Lognormal(0, \sigma^2_e)\] in order that each taxon’s efficiency in each experiment can vary around an overall efficiency for that taxon.

Suppose that in the example data, we had replicates from multiple batches. For example,

# generate technical replicates
K <- 2
set.seed(12345)
W_diff <- replicate(ncol(example_16S_data) - 1, runif(n = nrow(example_16S_data), min = 0, max = 100))
V_diff <- replicate(ncol(example_qPCR_data) - 1, runif(n = nrow(example_qPCR_data), min = 0, max = 100))
W_replicate <- example_16S_data
W_replicate[, -1] <- W_replicate[, -1] + as.integer(W_diff)
W <- bind_rows(example_16S_data %>% mutate(batch = 1), W_replicate %>% mutate(batch = 2))
V_replicate <- example_qPCR_data
V_replicate[, -1] <- V_replicate[, -1] + as.integer(V_diff)
V <- bind_rows(example_qPCR_data %>% mutate(batch = 1), V_replicate %>% mutate(batch = 2))

We can run the following code to implement this analysis:

mod_batches <- paramedic::run_paramedic(W = W[, c(1:10, 434)], V = V,
                                        k = K, sigma_beta = sigma_beta,
                                        sigma_Sigma = sigma_Sigma,
                                        alpha_sigma = alpha_sigma,
                                        kappa_sigma = kappa_sigma,
                                        n_iter = 50, n_burnin = 30,
                                        n_chains = 1, stan_seed = 4747,
                                        control = list(adapt_delta = 0.9,
                                                       max_treedepth = 15))

Requesting new hierarchical modeling options

If you would like to see a new hierarchical modeling choice in paramedic, please do one of the following:

  1. File an issue specifying exactly the distribution that you would like to be added to paramedic (i.e., provide the mathematical formula and describe all parameters/hyperparameters). Please state your justification for requesting this distribution.
  2. Create a pull request after modifying the following files: update src/stan/variable_efficiency.stan and src/stan/variable_efficiency_centered.stan by adding the new option (please follow the style already implemented for various options); then update R/run_paramedic.R with any new arguments that are required.