Fit radEmu model
Usage
emuFit(
Y,
X = NULL,
formula = NULL,
data = NULL,
assay_name = NULL,
cluster = NULL,
constraint_fn = pseudohuber_median,
constraint_grad_fn = dpseudohuber_median_dx,
constraint_param = 0.1,
verbose = FALSE,
match_row_names = TRUE,
unobserved_taxon_error = TRUE,
penalize = TRUE,
B = NULL,
fitted_model = NULL,
refit = TRUE,
tolerance = 1e-04,
maxit = 1000,
alpha = 0.05,
return_wald_p = FALSE,
compute_cis = TRUE,
max_abs_B = 250,
run_score_tests = TRUE,
test_kj = NULL,
null_fit_alg = NULL,
B_null_list = NULL,
maxit_null = 1000,
tol_lik = 1e-05,
tol_test_stat = 0.01,
tol_discrete = 0.01,
null_window = 5,
null_diagnostic_plots = FALSE,
remove_zero_comparison_pvals = 0.01,
control = NULL,
...
)Arguments
- Y
an n x J matrix or dataframe of nonnegative observations, or a phyloseq object containing an otu table and sample data.
- X
an n x p design matrix (either provide
Xordataandformula)- formula
a one-sided formula specifying the form of the mean model to be fit (used with
data)- data
an n x p data frame containing variables given in
formula- assay_name
a string containing the desired assay name within a
TreeSummarizedExperimentobject. This is only required if Y is aTreeSummarizedExperimentobject, otherwise this argument can be ignored.- cluster
a vector giving cluster membership for each row of Y to be used in computing GEE test statistics. Default is NULL, in which case rows of Y are treated as independent.
- constraint_fn
(Optional) User-provided constraint function, if default behavior of comparing log fold-difference parameters to smoothed median over all categories is not desired. If a number is provided a single category constraint will be used with the provided category as a reference category. This argument can either be a single constraint function to be used for all rows of B, or a list of length p of constraints to be used for each row of B.
- constraint_grad_fn
(Optional) User-provided derivative of constraint function, if default behavior of comparing log fold-difference parameters to smoothed median over all categories is not desired. If
constraint_fnis a list of constraint functions, then this argument must also be a list. Ifconstraint_fnis a single number, or a list that includes a single number, then the correspondingconstraint_grad_fncan be set toNULL, and will be appropriately set within the function.- constraint_param
(Optional) If the smoothed median is used as a constraint (this is the default), parameter controlling relative weighting of elements closer and further from center. (Limit as
constraint_paramapproaches infinity is the mean; as this parameter approaches zero, the minimizer of the pseudo-Huber loss approaches the median.) If constraint function is not smoothed median (implemented inradEmu::pseudohuber_median()) then this argument will be ignored.- verbose
provide updates as model is being fitted? Defaults to
FALSE. If user setsverbose = TRUE, then key messages about algorithm progress will be displayed. If user setsverbose = "development", then key messages and technical messages about convergence will be displayed. Most users who want status updates should setverbose = TRUE.- match_row_names
logical: If
TRUE, make sure rows on covariate data and response data correspond to the same sample by comparing row names and subsetting/reordering if necessary. Default isTRUE.- unobserved_taxon_error
logical: should an error be thrown if Y includes taxa that have 0 counts for all samples? Default is TRUE.
- penalize
logical: should Firth penalty be used? Default is
TRUE. Used in estimation.- B
starting value of coefficient matrix (p x J) for estimation. If not provided, B will be initiated as a zero matrix. Used in estimation.
- fitted_model
a fitted model produced by a separate call to emuFit; to be provided if score tests are to be run without refitting the full unrestricted model. Default is
NULL.- refit
logical: if
Borfitted_modelis provided, should full model be fit (TRUE) or should fitting step be skipped (FALSE), e.g., if score tests are to be run on an already fitted model. Default isTRUE.- tolerance
tolerance for stopping criterion in estimation; once no element of B is updated by more than this value in a single step, we exit optimization. Defaults to
1e-3. Used in estimation.- maxit
maximum number of outer iterations to perform before exiting optimization. Default is
1000. Used in estimation.- alpha
nominal type 1 error level to be used to construct confidence intervals. Default is
0.05(corresponding to 95% confidence intervals)- return_wald_p
logical: return p-values from Wald tests? Default is
FALSE.- compute_cis
logical: compute and return Wald CIs? Default is
TRUE.- max_abs_B
maximum allowed value for elements of B (in absolute value) in full model fitting. In most cases this is not needed as Firth penalty will prevent infinite estimates under separation. However, such a threshold may be helpful in very poorly conditioned problems (e.g., with many nearly collinear regressors). Default is
250.- run_score_tests
logical: perform robust score testing? Default is TRUE.
- test_kj
a data frame whose rows give coordinates (in category j and covariate k) of elements of B to construct hypothesis tests for.
kcould also be the name of a covariate included inXordata. If you don't know which coordinates k correspond to the covariate(s) that you would like to test, run the functionradEmu::make_design_matrix()in order to view the design matrix, and identify which column of the design matrix corresponds to each covariate in your model. This argument is required when running score tests.- null_fit_alg
Which null fitting algorithm to use for score tests:
"constraint_sandwich"or"augmented_lagrangian", or"discrete"when design matrix only includes categorical covariates. Default and recommended approach is"constraint_sandwich", or"discrete"for a design matrix with only categorical covariates andJ < 150. Augmented lagrangian is used whenJ < 20.- B_null_list
list of starting values of coefficient matrix (p x J) for null estimation for score testing. This should either be a list with the same length as
test_kj. If you only want to provide starting values for some tests, include the other elements of the list asNULL.- maxit_null
maximum number of outer iterations to perform before exiting optimization. Default is
1000. Used in estimation under null hypothesis for score tests.- tol_lik
tolerance for relative changes in likelihood for stopping criteria. Default is
1e-5. Used in estimation under null hypothesis for score tests with "constraint_sandwich" algorithm.- tol_test_stat
tolerance for relative changes in test statistic for stopping criteria. Default is
0.01. Used in estimation under null hypothesis for score tests with "constraint_sandwich" algorithm.- tol_discrete
tolerance for the root mean norm of the score vector, for stopping criteria. Default is
0.01. Used in estimation under null hypothesis for score tests with "discrete" algorithm (for discrete designs).- null_window
window to use for stopping criteria (this many iterations where stopping criteria is met). Default is
5. Used in estimation under null hypothesis for score tests with "constraint_sandwich" algorithm.- null_diagnostic_plots
logical: should diagnostic plots be made for estimation under the null hypothesis? Default is
FALSE.- remove_zero_comparison_pvals
Should score p-values be replaced with NA for zero-comparison parameters? These parameters occur for categorical covariates with three or more levels, and represent parameters that compare a covariate level to the reference level for a category in which the comparison level and reference level both have 0 counts in all samples. These parameters can have misleadingly small p-values and are not thought to have scientifically interesting signals. We recommend removing them before analyzing data further. If TRUE, all zero-comparison parameter p-values will be set to NA. If FALSE no zero-comparison parameter p-values will be set to NA. If a value between 0 and 1, all zero-comparison p-values below the value will be set to NA. Default is
0.01.- control
A list of control parameters, to have more control over estimation and hypothesis testing. See
control_fnfor details.- ...
Additional arguments. Arguments matching the names of
control_fn()options are forwarded to that function and override defaults. Unknown arguments are ignored with a warning.
Value
emuFit returns a list containing elements coef, B, penalized, Y_augmented,
z_hat, I, Dy, and score_test_hyperparams if score tests are run.
The coef table contains log fold-difference parameter estimates by covariate and outcome
category (e.g., taxon for microbiome data). A log fold-difference estimate of 1 for a treatment (versus control) and taxon X can be
interpreted to say that we expect taxon X is exp(1) = 2.72 times more abundant in someone who has received the treatment compared to
someone who has received the control (holding all other covariates equal), when compared to typical fold-differences in abundances of
taxa in this analysis.
coef also includes optionally-computed confidence intervals and robust Wald p-values.
Robust score statistics and score test p-values are also included in coef.
As explained in the associated manuscript, we recommend use of the robust score test values instead of the robust
Wald test p-values, due to better error rate control (i.e. fewer false positives).
If there are any zero-comparison parameters in the model, a column "zero_comparison"
is also included, which is TRUE for any parameters that compare the level of a categorical
covariate to a reference level for a category with only zero counts for both the comparison
level and the reference level. This check is currently implemented for an arbitrary design matrix
generated using the formula and data arguments, and for a design matrix with no more
than one categorical covariate if the design matrix X is input directly.
B contains parameter estimates in matrix format (rows indexing covariates and
columns indexing outcome category / taxon).
penalized is TRUE if Firth penalty is used in estimation (default) and FALSE otherwise.
z_hat returns the nuisance parameters (sample-specific sequencing effects).
I and Dy contain an information matrix and empirical score covariance matrix computed under the
full model.
score_test_hyperparams contains parameters and hyperparameters related to estimation under the null,
including whether or not the algorithm converged, which can be helpful for debugging.
Examples
# data frame example
data(wirbel_sample_small)
data(wirbel_otu_small)
emuRes <- emuFit(formula = ~ Group, data = wirbel_sample_small, Y = wirbel_otu_small,
test_kj = data.frame(k = 2, j = 1), tolerance = 0.01)
# here we set large tolerances for the example to run quickly,
# but we recommend smaller tolerances in practice
# TreeSummarizedExperiment example (only run this if you have TreeSummarizedExperiment installed)
if (FALSE) { # \dontrun{
library("TreeSummarizedExperiment")
example("TreeSummarizedExperiment")
assayNames(tse) <- "counts"
emuRes <- emuFit(Y = tse, formula = ~ condition, assay_name = "counts",
test_kj = data.frame(k = 2, j = 1), tolerance = 0.01)
# here we set large tolerances for the example to run quickly,
# but we recommend smaller tolerances in practice
} # }
