paramedic
vignettes/hierarchical_model_specification.Rmd
hierarchical_model_specification.Rmd
paramedic
jointly models two data types: relative
abundance data (e.g., 16S data), and absolute abundance data (e.g., qPCR
data). Importantly, the model accounts for the fact that taxa may be
detected with differing efficiency across the two datatypes. In the paper,
we describe a hierarchical model that can be used in the absence of
sample covariates or batch covariates. However, many users are likely to
have covariate data available to improve modelling. The purpose of this
vignette is to show you how to customise paramedic
to
incorporate your covariate data and batch data. We also discuss how to
modify the absolute abundance data’s generating process from Poisson to
Negative Binomial. In all examples below, we use the example data
referenced in the main
vignette. Additionally, we use a very small number of total and
burn-in iterations for illustration only; we stress that a larger number
of iterations should be used in practice.
## Warning: package 'rstan' was built under R version 4.3.2
## Loading required package: StanHeaders
## Warning: package 'StanHeaders' was built under R version 4.3.2
##
## rstan version 2.32.5 (Stan version 2.32.2)
## For execution on a local, multicore CPU with excess RAM we recommend calling
## options(mc.cores = parallel::detectCores()).
## To avoid recompilation of unchanged Stan programs, we recommend calling
## rstan_options(auto_write = TRUE)
## For within-chain threading using `reduce_sum()` or `map_rect()` Stan functions,
## change `threads_per_chain` option:
## rstan_options(threads_per_chain = 1)
## Do not specify '-march=native' in 'LOCAL_CPPFLAGS' or a Makevars file
## Loading required package: Rcpp
## paramedic version 0.1.3.5: Predicting Absolute and Relative Abundance by Modeling
## Efficiency to Derive Intervals and Concentrations
## Package created on 2023-12-26
## Warning: package 'dplyr' was built under R version 4.3.2
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
paramedic
The default hierarchical model in paramedic
places a
Poisson distribution on \(V\) and a
multinomial distribution on \(W\); a
normal distribution on \(\log \mu\),
where the distribution depends on a taxon-specific intercept \(\beta_0\) and a covariance matrix \(\Sigma\); and a normal distribution on
\(\log e\). The likelihood is (for all
\(i\) and \(j\)): \[V_{ij}
\mid \mu_{ij} \sim Poisson(\mu_{ij}) \\ p_{ij} =
\frac{\mu_{ij}e_j}{\sum_{\ell = 1}^q \mu_{i\ell}e_\ell} \\ W_{i\cdot}
\mid M_i, \mu_{i\cdot}, e \sim Multinomial(M_i,
p_{i\cdot}).\]
We place the following priors on \(\mu\) and \(e\): \[\log\mu_{i\cdot} \sim N_q(\beta_0, \Sigma) \\ \beta_0 \sim N_q(0, \sigma^2_\beta) \\ \Sigma_{jj} \sim Lognormal(0, \sigma^2_\Sigma) \\ e_j \sim Lognormal(0, \sigma^2_e) \\ \sigma^2_e \sim InvGamma(\alpha_\sigma, \kappa_\sigma).\]
The following code fits this model in paramedic
for the
example data:
mod <- paramedic::run_paramedic(W = example_16S_data[, 1:8], V = example_qPCR_data,
n_iter = 30, n_burnin = 25, n_chains = 1,
stan_seed = 4747, control = list(max_treedepth = 15))
It may be of interest to allow covariates to affect the mean for each taxon. In this case, we may replace the prior on \(\mu\) with the following: \[\log\mu_{i\cdot} \sim N_q(\beta_0 + X_i\beta_1, \Sigma) \\ \beta_0 \sim N_q(0, \sigma^2_\beta) \\ \beta_{1,k} \sim N_q(0, 1),\] where \(\beta_1\) is a matrix with number of rows equal to the number of covariates and number of columns equal to \(q\).
You can implement this model by passing a tibble of covariates \(X\) to run_paramedic
. \(X\) must have number of rows equal to the
sample size, and any number of columns.
Suppose that we have a binary variable that specifies case status (i.e., 1 denotes a case and 0 denotes a control) for some disease. For example,
# generate a random case/control status variable
set.seed(4747)
case_control <- sample(0:1, size = nrow(example_qPCR_data), replace = TRUE)
library("tibble")
X <- tibble(sample_id = example_qPCR_data$sample_id, case = case_control)
If we want to adjust for this covariate in the example data analysis
using paramedic
, we use the following code:
mod_with_covariates <- paramedic::run_paramedic(W = example_16S_data[, 1:8],
V = example_qPCR_data, X = X,
n_iter = 30, n_burnin = 25,
n_chains = 1, stan_seed = 4747,
control = list(max_treedepth = 15))
It may be of interest to allow for overdispersion in \(V\). In this case, we may replace the likelihood on \(V\) with the following: \[V_{ij} \sim NegBin(\mu_{ij}, \phi_i),\] where \(\phi_i\) is a subject-specific overdispersion parameter. We add the following prior on \(\phi\): for all \(i\), \[\phi_i \sim Gamma(\alpha_\phi, \beta_\phi).\]
You can implement this model by passing nonzero values of
alpha_phi
and beta_phi
to
run_paramedic
.
In the example data, we can run the following code to implement this analysis:
mod <- paramedic::run_paramedic(W = example_16S_data[, 1:8], V = example_qPCR_data,
alpha_phi = 1, beta_phi = 10, n_iter = 30,
n_burnin = 25, n_chains = 1, stan_seed = 4747,
control = list(max_treedepth = 15))
In some cases, the data under consideration are collected in multiple experiments (or batches). Here, we could add a layer to the hierarchical model to describe this data-generating mechanism. For example, if \(i\) indexes the sample, \(j\) indexes the taxon, and \(k\) indexes the study (or batch), the efficiencies could be modeled as \[e_{jk} \sim Lognormal(\xi_j, \sigma^2_\xi) \\ \xi_j \sim Lognormal(0, \sigma^2_e)\] in order that each taxon’s efficiency in each experiment can vary around an overall efficiency for that taxon.
Suppose that in the example data, we had replicates from multiple batches. For example,
# generate technical replicates
K <- 2
set.seed(12345)
W_diff <- replicate(ncol(example_16S_data) - 1, runif(n = nrow(example_16S_data), min = 0, max = 100))
V_diff <- replicate(ncol(example_qPCR_data) - 1, runif(n = nrow(example_qPCR_data), min = 0, max = 100))
W_replicate <- example_16S_data
W_replicate[, -1] <- W_replicate[, -1] + as.integer(W_diff)
W <- bind_rows(example_16S_data %>% mutate(batch = 1), W_replicate %>% mutate(batch = 2))
V_replicate <- example_qPCR_data
V_replicate[, -1] <- V_replicate[, -1] + as.integer(V_diff)
V <- bind_rows(example_qPCR_data %>% mutate(batch = 1), V_replicate %>% mutate(batch = 2))
We can run the following code to implement this analysis:
mod_batches <- paramedic::run_paramedic(W = W[, c(1:10, 434)], V = V,
k = K, sigma_beta = sigma_beta,
sigma_Sigma = sigma_Sigma,
alpha_sigma = alpha_sigma,
kappa_sigma = kappa_sigma,
n_iter = 50, n_burnin = 30,
n_chains = 1, stan_seed = 4747,
control = list(adapt_delta = 0.9,
max_treedepth = 15))
If you would like to see a new hierarchical modeling choice in
paramedic
, please do one of the following:
paramedic
(i.e., provide the mathematical formula
and describe all parameters/hyperparameters). Please state your
justification for requesting this distribution.src/stan/variable_efficiency.stan
and
src/stan/variable_efficiency_centered.stan
by adding the
new option (please follow the style already implemented for various
options); then update R/run_paramedic.R
with any new
arguments that are required.