paramedic
vignettes/hierarchical_model_specification.Rmd
hierarchical_model_specification.Rmd
paramedic
jointly models two data types: relative abundance data (e.g., 16S data), and absolute abundance data (e.g., qPCR data). Importantly, the model accounts for the fact that taxa may be detected with differing efficiency across the two datatypes. In the paper, we describe a hierarchical model that can be used in the absence of sample covariates or batch covariates. However, many users are likely to have covariate data available to improve modelling. The purpose of this vignette is to show you how to customise paramedic
to incorporate your covariate data and batch data. We also discuss how to modify the absolute abundance data’s generating process from Poisson to Negative Binomial. In all examples below, we use the example data referenced in the main vignette. Additionally, we use a very small number of total and burn-in iterations for illustration only; we stress that a larger number of iterations should be used in practice.
library("rstan")
## Loading required package: StanHeaders
## Loading required package: ggplot2
## rstan (Version 2.21.2, GitRev: 2e1f913d3ca3)
## For execution on a local, multicore CPU with excess RAM we recommend calling
## options(mc.cores = parallel::detectCores()).
## To avoid recompilation of unchanged Stan programs, we recommend calling
## rstan_options(auto_write = TRUE)
library("paramedic")
## Loading required package: Rcpp
## paramedic version 0.1.3.5: Predicting Absolute and Relative Abundance by Modeling Efficiency to Derive Intervals and Concentrations
## Package created on 2021-06-28
library("dplyr")
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
paramedic
The default hierarchical model in paramedic
places a Poisson distribution on \(V\) and a multinomial distribution on \(W\); a normal distribution on \(\log \mu\), where the distribution depends on a taxon-specific intercept \(\beta_0\) and a covariance matrix \(\Sigma\); and a normal distribution on \(\log e\). The likelihood is (for all \(i\) and \(j\)): \[V_{ij} \mid \mu_{ij} \sim Poisson(\mu_{ij}) \\ p_{ij} = \frac{\mu_{ij}e_j}{\sum_{\ell = 1}^q \mu_{i\ell}e_\ell} \\ W_{i\cdot} \mid M_i, \mu_{i\cdot}, e \sim Multinomial(M_i, p_{i\cdot}).\]
We place the following priors on \(\mu\) and \(e\): \[\log\mu_{i\cdot} \sim N_q(\beta_0, \Sigma) \\ \beta_0 \sim N_q(0, \sigma^2_\beta) \\ \Sigma_{jj} \sim Lognormal(0, \sigma^2_\Sigma) \\ e_j \sim Lognormal(0, \sigma^2_e) \\ \sigma^2_e \sim InvGamma(\alpha_\sigma, \kappa_\sigma).\]
The following code fits this model in paramedic
for the example data:
mod <- paramedic::run_paramedic(W = example_16S_data[, 1:8], V = example_qPCR_data, n_iter = 30, n_burnin = 25, n_chains = 1, stan_seed = 4747, control = list(max_treedepth = 15))
It may be of interest to allow covariates to affect the mean for each taxon. In this case, we may replace the prior on \(\mu\) with the following: \[\log\mu_{i\cdot} \sim N_q(\beta_0 + X_i\beta_1, \Sigma) \\ \beta_0 \sim N_q(0, \sigma^2_\beta) \\ \beta_{1,k} \sim N_q(0, 1),\] where \(\beta_1\) is a matrix with number of rows equal to the number of covariates and number of columns equal to \(q\).
You can implement this model by passing a tibble of covariates \(X\) to run_paramedic
. \(X\) must have number of rows equal to the sample size, and any number of columns.
Suppose that we have a binary variable that specifies case status (i.e., 1 denotes a case and 0 denotes a control) for some disease. For example,
# generate a random case/control status variable set.seed(4747) case_control <- sample(0:1, size = nrow(example_qPCR_data), replace = TRUE) library("tibble") X <- tibble(sample_id = example_qPCR_data$sample_id, case = case_control)
If we want to adjust for this covariate in the example data analysis using paramedic
, we use the following code:
mod_with_covariates <- paramedic::run_paramedic(W = example_16S_data[, 1:8], V = example_qPCR_data, X = X, n_iter = 30, n_burnin = 25, n_chains = 1, stan_seed = 4747, control = list(max_treedepth = 15))
It may be of interest to allow for overdispersion in \(V\). In this case, we may replace the likelihood on \(V\) with the following: \[V_{ij} \sim NegBin(\mu_{ij}, \phi_i),\] where \(\phi_i\) is a subject-specific overdispersion parameter. We add the following prior on \(\phi\): for all \(i\), \[\phi_i \sim Gamma(\alpha_\phi, \beta_\phi).\]
You can implement this model by passing nonzero values of alpha_phi
and beta_phi
to run_paramedic
.
In the example data, we can run the following code to implement this analysis:
mod <- paramedic::run_paramedic(W = example_16S_data[, 1:8], V = example_qPCR_data, alpha_phi = 1, beta_phi = 10, n_iter = 30, n_burnin = 25, n_chains = 1, stan_seed = 4747, control = list(max_treedepth = 15))
In some cases, the data under consideration are collected in multiple experiments (or batches). Here, we could add a layer to the hierarchical model to describe this data-generating mechanism. For example, if \(i\) indexes the sample, \(j\) indexes the taxon, and \(k\) indexes the study (or batch), the efficiencies could be modeled as \[e_{jk} \sim Lognormal(\xi_j, \sigma^2_\xi) \\ \xi_j \sim Lognormal(0, \sigma^2_e)\] in order that each taxon’s efficiency in each experiment can vary around an overall efficiency for that taxon.
Suppose that in the example data, we had replicates from multiple batches. For example,
# generate technical replicates K <- 2 set.seed(12345) W_diff <- replicate(ncol(example_16S_data) - 1, runif(n = nrow(example_16S_data), min = 0, max = 100)) V_diff <- replicate(ncol(example_qPCR_data) - 1, runif(n = nrow(example_qPCR_data), min = 0, max = 100)) W_replicate <- example_16S_data W_replicate[, -1] <- W_replicate[, -1] + as.integer(W_diff) W <- bind_rows(example_16S_data %>% mutate(batch = 1), W_replicate %>% mutate(batch = 2)) V_replicate <- example_qPCR_data V_replicate[, -1] <- V_replicate[, -1] + as.integer(V_diff) V <- bind_rows(example_qPCR_data %>% mutate(batch = 1), V_replicate %>% mutate(batch = 2))
We can run the following code to implement this analysis:
mod_batches <- paramedic::run_paramedic(W = W[, c(1:10, 434)], V = V, k = K, sigma_beta = sigma_beta, sigma_Sigma = sigma_Sigma, alpha_sigma = alpha_sigma, kappa_sigma = kappa_sigma, n_iter = 50, n_burnin = 30, n_chains = 1, stan_seed = 4747, control = list(adapt_delta = 0.9, max_treedepth = 15))
If you would like to see a new hierarchical modeling choice in paramedic
, please do one of the following:
paramedic
(i.e., provide the mathematical formula and describe all parameters/hyperparameters). Please state your justification for requesting this distribution.src/stan/variable_efficiency.stan
and src/stan/variable_efficiency_centered.stan
by adding the new option (please follow the style already implemented for various options); then update R/run_paramedic.R
with any new arguments that are required.