paramedic
vignettes/introduction_to_paramedic.Rmd
introduction_to_paramedic.Rmd
paramedic
is a R
package for estimating
microbial concentration. paramedic
uses information from
16S count data (compositional data on all taxa) and absolute data on a
subset of taxa (e.g., qPCR or flow cytometry) to estimate the absolute
abundance of all taxa. The method accounts for differing taxon detection
efficiencies between the two methods, and produces prediction and
confidence intervals as well as point estimates of the absolute
abundance of all taxa.
The author and maintainer of the paramedic
package is Brian Williamson. For details
on the method, check out our paper.
Currently, the package may only be downloaded and installed from
GitHub using the devtools
package. Type the following
commands in your R console to install devtools
and
paramedic
as needed:
# install.packages("devtools")
# devtools::install_github("statdivlab/paramedic@main")
This section should serve as a quick guide to using the
paramedic
package – we will cover the main functions using
a data example.
First, load the paramedic
and rstan
packages:
## Warning: package 'rstan' was built under R version 4.3.2
## Loading required package: StanHeaders
## Warning: package 'StanHeaders' was built under R version 4.3.2
##
## rstan version 2.32.5 (Stan version 2.32.2)
## For execution on a local, multicore CPU with excess RAM we recommend calling
## options(mc.cores = parallel::detectCores()).
## To avoid recompilation of unchanged Stan programs, we recommend calling
## rstan_options(auto_write = TRUE)
## For within-chain threading using `reduce_sum()` or `map_rect()` Stan functions,
## change `threads_per_chain` option:
## rstan_options(threads_per_chain = 1)
## Do not specify '-march=native' in 'LOCAL_CPPFLAGS' or a Makevars file
## Loading required package: Rcpp
## paramedic version 0.1.3.5: Predicting Absolute and Relative Abundance by Modeling
## Efficiency to Derive Intervals and Concentrations
## Package created on 2023-12-26
Next, we load in some example data:
example_16S_data
is a matrix of q = 433
taxa observed from 16S sequencing in n = 20
samples. We do
not have primers to observe the concentration of all taxa, but we can do
this for q_obs = 7
taxa. The observed concentration of the
7 taxa in the 20 samples (in 16S gene copies per unit volume) is in
example_qPCR_data
. These data mimic data observed by the
University of Washington (UW) Sexually Transmitted Infections
Cooperative Research Center (STICRC).
We are now ready to estimate concentrations. We pass the data to
run_paramedic
, the workhorse function of the package:
## -------------------------------------------------------------
## Obtain estimates of concentration for the first seven taxa
## -------------------------------------------------------------
## this is a small number of iterations, only for illustration
## also, shows how to use control parameters for rstan::stan
mod <- paramedic::run_paramedic(W = example_16S_data[, 1:8], V = example_qPCR_data,
n_iter = 30, n_burnin = 25, n_chains = 1, stan_seed = 4747,
control = list(max_treedepth = 15))
##
## SAMPLING FOR MODEL 'variable_efficiency' NOW (CHAIN 1).
## Chain 1:
## Chain 1: Gradient evaluation took 0.000154 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 1.54 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1:
## Chain 1:
## Chain 1: WARNING: There aren't enough warmup iterations to fit the
## Chain 1: three stages of adaptation as currently configured.
## Chain 1: Reducing each adaptation stage to 15%/75%/10% of
## Chain 1: the given number of warmup iterations:
## Chain 1: init_buffer = 3
## Chain 1: adapt_window = 20
## Chain 1: term_buffer = 2
## Chain 1:
## Chain 1: Iteration: 1 / 30 [ 3%] (Warmup)
## Chain 1: Iteration: 3 / 30 [ 10%] (Warmup)
## Chain 1: Iteration: 6 / 30 [ 20%] (Warmup)
## Chain 1: Iteration: 9 / 30 [ 30%] (Warmup)
## Chain 1: Iteration: 12 / 30 [ 40%] (Warmup)
## Chain 1: Iteration: 15 / 30 [ 50%] (Warmup)
## Chain 1: Iteration: 18 / 30 [ 60%] (Warmup)
## Chain 1: Iteration: 21 / 30 [ 70%] (Warmup)
## Chain 1: Iteration: 24 / 30 [ 80%] (Warmup)
## Chain 1: Iteration: 26 / 30 [ 86%] (Sampling)
## Chain 1: Iteration: 28 / 30 [ 93%] (Sampling)
## Chain 1: Iteration: 30 / 30 [100%] (Sampling)
## Chain 1:
## Chain 1: Elapsed Time: 0.019 seconds (Warm-up)
## Chain 1: 0.002 seconds (Sampling)
## Chain 1: 0.021 seconds (Total)
## Chain 1:
## Warning: There were 1 chains where the estimated Bayesian Fraction of Missing Information was low. See
## https://mc-stan.org/misc/warnings.html#bfmi-low
## Warning: Examine the pairs() plot to diagnose sampling problems
## Warning: The largest R-hat is 1.9, indicating chains have not mixed.
## Running the chains for more iterations may help. See
## https://mc-stan.org/misc/warnings.html#r-hat
mod_summ <- summary(mod$stan_fit, probs = c(0.025, 0.975))$summary
mod_samps <- extract(mod$stan_fit)
The output from run_paramedic
may be manipulated as any
output from rstan::stan
might; however, there are certain
posterior summaries that are of particular interest in this problem,
namely the posterior mean estimates of concentrations and efficiencies,
along with interval estimates for these quantities. The next function
extracts these point and interval estimators:
## -------------------------------------------------------------
## Extract posterior estimates for the first three taxa
## -------------------------------------------------------------
summs <- paramedic::extract_posterior_summaries(stan_mod = mod_summ,
stan_samps = mod_samps,
taxa_of_interest = 1:3, mult_num = 1, level = 0.95,
interval_type = "wald")
and we may finally print them out:
## -------------------------------------------------------------
## Print out the posterior mean efficiencies and concentrations
## -------------------------------------------------------------
summs$estimates
## [,1] [,2] [,3]
## 1 3.332132e-01 3.689357e+02 2.106022e-01
## 2 1.298799e-01 2.284346e+02 6.373446e-02
## 3 1.555831e+00 1.455176e+02 1.036781e+00
## 4 2.461307e+01 2.225623e-03 1.729249e-02
## 5 7.215584e+01 4.040974e+01 7.283377e-01
## 6 2.593184e+02 1.042068e+00 2.322630e-02
## 7 4.024012e+01 8.902963e-03 2.728638e-02
## 8 1.487867e-03 2.225860e-03 1.564062e+00
## 9 4.014350e-01 7.589127e+03 1.876758e-03
## 10 2.479431e-04 2.225652e-03 1.943784e+00
## 11 1.200725e+02 3.636011e+02 6.481326e+00
## 12 3.687596e+02 3.899757e+01 8.026380e+00
## 13 3.305934e-04 8.645593e+03 2.069066e-05
## 14 4.439689e-02 2.225763e-03 3.342122e+00
## 15 1.652900e-04 4.451806e-03 3.650459e+00
## 16 7.306072e+00 4.452018e-03 1.588019e+00
## 17 1.821682e+02 8.459541e-02 4.385349e-01
## 18 2.479394e-04 2.225623e-03 1.193138e-06
## 19 3.305873e-04 2.225753e-03 1.112984e+00
## 20 5.347797e-01 4.451798e-03 2.236127e+00
summs$est_efficiency
## e[1] e[2] e[3] e[4] e[5] e[6] e[7] sigma_e
## 0.1815300 0.1583407 0.1388907 0.5119174 0.2135972 0.1740900 1.7257625 3.2414754
This output shows that we have point estimates for the true concentration of each taxon within each participant, and the efficiencies of each taxon comparing br16S to qPCR.
For details on the hierarchical modelling options available in
paramedic
, see the vignette on
hierarchical model specificaion.
RStan
supports parallelization, which we can use within
paramedic
to speed up computation. In the following
example, we re-analyze the data referenced above, but run the models in
parallel. Depending on the number of cores on your machine, this can
result in significant reductions in computation time.
# parallel::detectCores() will run with multiple cores if available
# follow the steps in the RStan documentation
# (https://github.com/stan-dev/rstan/wiki/RStan-Getting-Started#how-to-use-rstan)
options(mc.cores = parallel::detectCores())
rstan_options(auto_write = TRUE)
## -------------------------------------------------------------
## Obtain estimates of concentration for the first seven taxa
## -------------------------------------------------------------
## this is a small number of iterations, only for illustration
## also, shows how to use control parameters for rstan::stan
mod <- paramedic::run_paramedic(W = example_16S_data[, 1:8], V = example_qPCR_data,
n_iter = 30, n_burnin = 25, n_chains = 1, stan_seed = 4747,
control = list(max_treedepth = 15))
##
## SAMPLING FOR MODEL 'variable_efficiency' NOW (CHAIN 1).
## Chain 1:
## Chain 1: Gradient evaluation took 0.000174 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 1.74 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1:
## Chain 1:
## Chain 1: WARNING: There aren't enough warmup iterations to fit the
## Chain 1: three stages of adaptation as currently configured.
## Chain 1: Reducing each adaptation stage to 15%/75%/10% of
## Chain 1: the given number of warmup iterations:
## Chain 1: init_buffer = 3
## Chain 1: adapt_window = 20
## Chain 1: term_buffer = 2
## Chain 1:
## Chain 1: Iteration: 1 / 30 [ 3%] (Warmup)
## Chain 1: Iteration: 3 / 30 [ 10%] (Warmup)
## Chain 1: Iteration: 6 / 30 [ 20%] (Warmup)
## Chain 1: Iteration: 9 / 30 [ 30%] (Warmup)
## Chain 1: Iteration: 12 / 30 [ 40%] (Warmup)
## Chain 1: Iteration: 15 / 30 [ 50%] (Warmup)
## Chain 1: Iteration: 18 / 30 [ 60%] (Warmup)
## Chain 1: Iteration: 21 / 30 [ 70%] (Warmup)
## Chain 1: Iteration: 24 / 30 [ 80%] (Warmup)
## Chain 1: Iteration: 26 / 30 [ 86%] (Sampling)
## Chain 1: Iteration: 28 / 30 [ 93%] (Sampling)
## Chain 1: Iteration: 30 / 30 [100%] (Sampling)
## Chain 1:
## Chain 1: Elapsed Time: 0.019 seconds (Warm-up)
## Chain 1: 0.002 seconds (Sampling)
## Chain 1: 0.021 seconds (Total)
## Chain 1:
## Warning: There were 1 chains where the estimated Bayesian Fraction of Missing Information was low. See
## https://mc-stan.org/misc/warnings.html#bfmi-low
## Warning: Examine the pairs() plot to diagnose sampling problems
## Warning: The largest R-hat is 1.9, indicating chains have not mixed.
## Running the chains for more iterations may help. See
## https://mc-stan.org/misc/warnings.html#r-hat