Fit radEmu model
Usage
emuFit(
Y,
X = NULL,
formula = NULL,
data = NULL,
assay_name = NULL,
cluster = NULL,
penalize = TRUE,
B = NULL,
B_null_list = NULL,
fitted_model = NULL,
refit = TRUE,
test_kj = NULL,
alpha = 0.05,
return_wald_p = FALSE,
compute_cis = TRUE,
run_score_tests = TRUE,
use_fullmodel_info = FALSE,
use_fullmodel_cov = FALSE,
use_both_cov = FALSE,
constraint_fn = pseudohuber_center,
constraint_grad_fn = dpseudohuber_center_dx,
constraint_param = 0.1,
verbose = FALSE,
tolerance = 1e-04,
B_null_tol = 0.001,
rho_init = 1,
inner_tol = 1,
ntries = 4,
tau = 2,
kappa = 0.8,
constraint_tol = 1e-05,
c1 = 1e-04,
maxit = 1000,
inner_maxit = 25,
max_step = 1,
trackB = FALSE,
return_nullB = FALSE,
return_both_score_pvals = FALSE,
remove_zero_comparison_pvals = 0.01,
unobserved_taxon_error = TRUE
)
Arguments
- Y
an n x J matrix or dataframe of nonnegative observations, or a phyloseq object containing an otu table and sample data.
- X
an n x p matrix or dataframe of covariates (optional)
- formula
a one-sided formula specifying the form of the mean model to be fit
- data
an n x p data frame containing variables given in
formula
- assay_name
a string containing the desired assay name within a
TreeSummarizedExperiment
object. This is only required if Y is aTreeSummarizedExperiment
object, otherwise this argument does nothing and can be ignored.- cluster
a vector giving cluster membership for each row of Y to be used in computing GEE test statistics. Default is NULL, in which case rows of Y are treated as independent.
- penalize
logical: should Firth penalty be used in fitting model? Default is TRUE.
- B
starting value of coefficient matrix (p x J). If not provided, B will be initiated as a zero matrix.
- B_null_list
list of starting values of coefficient matrix (p x J) for null estimation. This should either be a list with the same length as
test_kj
. If you only want to provide starting values for some tests, include the other elements of the list asNULL
.- fitted_model
a fitted model produced by a separate call to emuFit; to be provided if score tests are to be run without refitting the full unrestricted model. Default is NULL.
- refit
logical: if B or fitted_model is provided, should full model be fit (TRUE) or should fitting step be skipped (FALSE), e.g., if score tests are to be run on an already fitted model. Default is TRUE.
- test_kj
a data frame whose rows give coordinates (in category j and covariate k) of elements of B to construct hypothesis tests for. If
test_kj
is not provided, all elements of B save the intercept row will be tested.- alpha
nominal type 1 error level to be used to construct confidence intervals. Default is 0.05 (corresponding to 95% confidence intervals)
- return_wald_p
logical: return p-values from Wald tests? Default is FALSE.
- compute_cis
logical: compute and return Wald CIs? Default is TRUE.
- run_score_tests
logical: perform robust score testing? Default is TRUE.
- use_fullmodel_info
logical: TODO? Default is FALSE.
- use_fullmodel_cov
logical: use information matrix and empirical score covariance computed for full model fit? Defaults to FALSE, in which case these quantities are recomputed for each null model fit for score testing.
- use_both_cov
logical: should score tests be run using information and empirical score covariance evaluated both under the null and full models? Used in simulations
- constraint_fn
function g defining a constraint on rows of B; g(B_k) = 0 for rows k = 1, ..., p of B. Default function is a smoothed median (minimizer of pseudohuber loss). If a number is provided a single category constraint will be used with the provided category as a reference category.
- constraint_grad_fn
derivative of constraint_fn with respect to its arguments (i.e., elements of a row of B)
- constraint_param
If pseudohuber centering is used (this is the default), parameter controlling relative weighting of elements closer and further from center. (Limit as
constraint_param
approaches infinity is the mean; as this parameter approaches zero, the minimizer of the pseudo-Huber loss approaches the median.)- verbose
provide updates as model is being fitted? Defaults to FALSE.
- tolerance
tolerance for stopping criterion in full model fitting; once no element of B is updated by more than this value in a single step, we exit optimization. Defaults to 1e-3.
- B_null_tol
numeric: convergence tolerance for null model fits for score testing (if max of absolute difference in B across outer iterations is below this threshold, we declare convergence). Default is 0.01.
- rho_init
numeric: value at which to initiate rho parameter in augmented Lagrangian algorithm. Default is 1.
- inner_tol
numeric: convergence tolerance for augmented Lagrangian subproblems within null model fitting. Default is 1.
- ntries
numeric: how many times should optimization be tried in null models where at least one optimization attempt fails? Default is 4.
- tau
numeric: value to scale rho by in each iteration of augmented Lagrangian algorithm that does not move estimate toward zero sufficiently. Default is 2.
- kappa
numeric: value between 0 and 1 that determines the cutoff on the ratio of current distance from feasibility over distance in last iteration triggering scaling of rho. If this ratio is above kappa, rho is scaled by tau to encourage estimate to move toward feasibility.
- constraint_tol
numeric: constraint tolerance for fits under null hypotheses (tested element of B must be equal to constraint function to within this tolerance for a fit to be accepted as a solution to constrained optimization problem). Default is 1e-5.
- c1
numeric: parameter for Armijo line search. Default is 1e-4.
- maxit
maximum number of outer iterations of augmented lagrangian algorithm to perform before exiting optimization. Default is 1000.
- inner_maxit
maximum number of coordinate descent passes through columns of B to make within each outer iteration of augmented lagrangian algorithm before exiting inner loop
- max_step
maximum stepsize; update directions computed during optimization will be rescaled if a step in any parameter exceeds this value. Defaults to 0.5.
- trackB
logical: should values of B be recorded across optimization iterations and be returned? Primarily used for debugging. Default is FALSE.
- return_nullB
logical: should values of B under null hypothesis be returned. Primarily used for debugging. Default is FALSE.
- return_both_score_pvals
logical: should score p-values be returned using both information matrix computed from full model fit and from null model fits? Default is FALSE. This parameter is used for simulations - in any applied analysis, type of p-value to be used should be chosen before conducting tests.
- remove_zero_comparison_pvals
Should score p-values be replaced with NA for zero-comparison parameters? These parameters occur for categorical covariates with three or more levels, and represent parameters that compare a covariate level to the reference level for a category in which the comparison level and reference level both have 0 counts in all samples. These parameters can have misleadingly small p-values and are not thought to have scientifically interesting signals. We recommend removing them before analyzing data further. If TRUE, all zero-comparison parameter p-values will be set to NA. If FALSE no zero-comparison parameter p-values will be set to NA. If a value between 0 and 1, all zero-comparison p-values below the value will be set to NA. Default is
0.01
.- unobserved_taxon_error
logical: should an error be thrown if Y includes taxa that have 0 counts for all samples? Default is TRUE.
Value
A list containing elements 'coef', 'B', 'penalized', 'Y_augmented',
'z_hat', 'I', 'Dy', and 'score_test_hyperparams' if score tests are run.
Parameter estimates by covariate and outcome category (e.g., taxon for microbiome data),
as well as optionally confidence intervals and p-values, are contained in 'coef'.
Any robust score statistics and score test p-values are also included in 'coef'.
If there are any zero-comparison parameters in the model, a column 'zero_comparison'
is also included, which is TRUE for any parameters that compare the level of a categorical
covariate to a reference level for a category with only zero counts for both the comparison
level and the reference level. This check is currently implemented for an arbitrary design matrix
generated using the formula
and data
arguments, and for a design matrix with no more
than one categorical covariate if the design matrix X
is input directly.
'B' contains parameter estimates in matrix format (rows indexing covariates and
columns indexing outcome category / taxon).
'penalized' is equal to TRUE f Firth penalty is used in estimation (default) and FALSE otherwise.
'z_hat' returns the nuisance parameters calculated in Equation 7 of the radEmu manuscript,
corresponding to either 'Y_augmented' or 'Y' if the 'penalized' is equal to TRUE
or FALSE, respectively.
I' and 'Dy' contain an information matrix and empirical score covariance matrix computed under the
full model.
'score_test_hyperparams' contains parameters and hyperparameters related to estimation under the null,
including whether or not the algorithm converged, which can be helpful for debugging.