sparseR: Implement ranked sparsity for selecting interactions and polynomials
Description
The sparseR package implements various techniques for selecting from
a set of interaction and polynomial terms under ranked sparsity. Additional
tools for data pre-processing, post-selection inference, and visualization
are also included.
Author(s)
Maintainer: Ryan Andrew Peterson ryan.a.peterson@cuanschutz.edu (ORCID)
See Also
Useful links:
Pipe operator
Description
See magrittr::%>% for details.
Usage
lhs %>% rhs
Arguments
lhs
A value or the magrittr placeholder.
rhs
A function call using the magrittr semantics.
Value
The result of calling 'rhs(lhs)'.
Custom IC functions for stepwise models
Description
Custom IC functions for stepwise models
Usage
EBIC(...)
## Default S3 method:
EBIC(fit, varnames, pen_info, gammafn = NULL, return_df = TRUE, ...)
RBIC(fit, ...)
## Default S3 method:
RBIC(fit, varnames, pen_info, gammafn = NULL, return_df = TRUE, ...)
RAIC(fit, ...)
## Default S3 method:
RAIC(fit, varnames, pen_info, gammafn = NULL, return_df = TRUE, ...)
Arguments
...
additional args
fit
a fitted object
varnames
names of variables
pen_info
penalty information
gammafn
What to use for gamma in formula
return_df
should the deg. freedom be returned
Value
A vector of values for the criterion requested, and the degrees of freedom (appended to front of vector) if return_df == TRUE.
Data sets
Description
Detrano data sets (cleveland, hungarian, switzerland, va); The Iowa Radon Lung Cancer Study (irlcs_radon_syn): Data simulated to resemble the IRLCS study; Sheddon survival data (Z: clinical covariates, S:survival outcome)
Usage
cleveland
hungarian
switzerland
va
irlcs_radon_syn
Z
S
Format
An object of class data.frame with 303 rows and 14 columns.
An object of class data.frame with 294 rows and 14 columns.
An object of class data.frame with 123 rows and 14 columns.
An object of class data.frame with 200 rows and 14 columns.
An object of class data.frame with 1027 rows and 16 columns.
An object of class data.frame with 442 rows and 6 columns.
An object of class Surv with 442 rows and 2 columns.
Source
- Detrano data
- IRLCS data sets
https://cheec.uiowa.edu/research/residential-radon-and-lung-cancer-case-control-study
- Sheddon
https://www.gsea-msigdb.org/gsea/msigdb/cards/SHEDDEN_LUNG_CANCER_POOR_SURVIVAL_A6
References
- Detrano
-
Detrano, R., Janosi, A., Steinbrunn, W., Pfisterer, M., Schmid, J., Sandhu, S., Guppy, K., Lee, S., & Froelicher, V. (1989). International application of a new probability algorithm for the diagnosis of coronary artery disease. American Journal of Cardiology, 64,304–310.
- IRLCS
-
FIELD, R., SMITH, B., STECK, D. et al. Residential radon exposure and lung cancer: Variation in risk estimates using alternative exposure scenarios. J Expo Sci Environ Epidemiol 12, 197–203 (2002). https://www.nature.com/articles/7500215
- Shedden
-
Director's Challenge Consortium for the Molecular Classification of Lung Adenocarcinoma, Shedden, K., Taylor, J. M., Enkemann, S. A., Tsao, M. S., Yeatman, T. J., Gerald, W. L., Eschrich, S., Jurisica, I., Giordano, T. J., Misek, D. E., Chang, A. C., Zhu, C. Q., Strumpf, D., Hanash, S., Shepherd, F. A., Ding, K., Seymour, L., Naoki, K., Pennell, N., ... Beer, D. G. (2008). Gene expression-based survival prediction in lung adenocarcinoma: a multi-site, blinded validation study. Nature medicine, 14(8), 822–827. https://www.nature.com/articles/nm.1790
Plot relevant effects of a sparseR object
Description
Plot relevant effects of a sparseR object
Usage
effect_plot(fit, ...)
## S3 method for class 'sparseR'
effect_plot(
fit,
coef_name,
at = c("cvmin", "cv1se"),
by = NULL,
by_levels,
nn = 101,
plot.args = list(),
resids = TRUE,
legend_location = "bottomright",
...
)
## S3 method for class 'sparseRBIC'
effect_plot(
fit,
coef_name,
by = NULL,
by_levels,
nn = 101,
plot.args = list(),
resids = TRUE,
legend_location = "bottomright",
...
)
Arguments
fit
a 'sparseR' object
...
additional arguments
coef_name
The name of the coefficient to plot along the x-axis
at
value of lambda to use
by
the variable(s) involved in the (possible) interaction
by_levels
values to cut continuous by variable (defaults to 3 quantiles)
nn
number of points to plot along prediction line
plot.args
list of arguments passed to the plot itself
resids
should residuals be plotted or not?
legend_location
location for legend passed to 'legend'
Value
nothing returned
Nothing (invisible) returned
Helper function to help set up penalties
Description
Helper function to help set up penalties
Usage
get_penalties(
varnames,
poly,
poly_prefix = "poly_",
int_sep = "\\:",
pool = FALSE,
gamma = 0.5,
cumulative_k = FALSE,
cumulative_poly = TRUE
)
Arguments
varnames
names of the covariates in the model matrix
poly
max polynomial considered
poly_prefix
what comes before the polynomial specification in these varnames?
int_sep
What denotes the multiplication for interactions?
pool
Should polynomials and interactions be pooled?
gamma
How much should the penalty increase with group size (0.5 assumes equal contribution of prior information)
cumulative_k
Should penalties be increased cumulatively as order interaction increases? (only used if !pool)
cumulative_poly
Should penalties be increased cumulatively as order polynomial increases? (only used if !pool)
Details
This is primarily a helper function for sparseR, but it may be useful if doing the model matrix set up by hand.
Value
a list of relevant information for the variables, including:
penalties
the numeric value of the penalties
vartype
Variable type (main effect, order k interaction, etc)
varname
names of variables
Plot relevant properties of sparseR objects
Description
Plot relevant properties of sparseR objects
Usage
## S3 method for class 'sparseR'
plot(x, plot_type = c("both", "cv", "path"), cols = NULL, log.l = TRUE, ...)
Arguments
x
a 'sparseR' object
plot_type
should the solution path, CV results, or both be plotted?
cols
option to specify color of groups
log.l
should the x-axis (lambda) be logged?
...
extra plotting options
Value
nothing returned
Predict coefficients or responses for sparseR object
Description
Predict coefficients or responses for sparseR object
Usage
## S3 method for class 'sparseR'
predict(object, newdata, lambda, at = c("cvmin", "cv1se"), ...)
## S3 method for class 'sparseR'
coef(object, lambda, at = c("cvmin", "cv1se"), ...)
Arguments
object
sparseR object
newdata
new data on which to make predictions
lambda
a particular value of lambda to predict with
at
a "smart" guess to use for lambda
...
additional arguments passed to predict.ncvreg
Value
predicted outcomes for 'newdata' (or coefficients) at specified (or smart) lambda value
Print sparseR object
Description
Print sparseR object
Usage
## S3 method for class 'sparseR'
print(x, prep = FALSE, ...)
Arguments
x
a sparseR object
prep
Should the SR set-up information be printed as well?
...
additional arguments passed to print.ncvreg
Value
returns x invisibly
Fit a ranked-sparsity model with regularized regression
Description
Fit a ranked-sparsity model with regularized regression
Usage
sparseR(
formula,
data,
family = c("gaussian", "binomial", "poisson", "coxph"),
penalty = c("lasso", "MCP", "SCAD"),
alpha = 1,
ncvgamma = 3,
lambda.min = 0.005,
k = 1,
poly = 2,
gamma = 0.5,
cumulative_k = FALSE,
cumulative_poly = TRUE,
pool = FALSE,
ia_formula = NULL,
pre_process = TRUE,
model_matrix = NULL,
y = NULL,
poly_prefix = "_poly_",
int_sep = "\\:",
pre_proc_opts = c("knnImpute", "scale", "center", "otherbin", "none"),
filter = c("nzv", "zv"),
extra_opts = list(),
...
)
Arguments
formula
Names of the terms
data
Data
family
The family of the model
penalty
What penalty should be used (lasso, MCP, or SCAD)
alpha
The mix of L1 penalty (lower values introduce more L2 ridge penalty)
ncvgamma
The tuning parameter for ncvreg (for MCP or SCAD)
lambda.min
The minimum value to be used for lambda (as ratio of max, see ?ncvreg)
k
The maximum order of interactions to consider (default: 1; all pairwise)
poly
The maximum order of polynomials to consider (default: 2)
gamma
The degree of extremity of sparsity rankings (see details)
cumulative_k
Should penalties be increased cumulatively as order interaction increases?
cumulative_poly
Should penalties be increased cumulatively as order polynomial increases?
pool
Should interactions of order k and polynomials of order k+1 be pooled together for calculating the penalty?
ia_formula
formula to be passed to step_interact (for interactions, see details)
pre_process
Should the data be preprocessed (if FALSE, must provide model_matrix)
model_matrix
A data frame or matrix specifying the full model matrix (used if !pre_process)
y
A vector of responses (used if !pre_process)
poly_prefix
If model_matrix is specified, what is the prefix for polynomial terms?
int_sep
If model_matrix is specified, what is the separator for interaction terms?
pre_proc_opts
List of preprocessing steps (see details)
filter
The type of filter applied to main effects + interactions
extra_opts
A list of options for all preprocess steps (see details)
...
Additional arguments (passed to fitting function)
Details
Selecting gamma: higher values of gamma will penalize "group" size more. By
default, this is set to 0.5, which yields equal contribution of prior
information across orders of interactions/polynomials (this is a good
default for most settings).
Additionally, setting cumulative_poly or cumulative_k to TRUE increases
the penalty cumulatively based on the order of either polynomial or
interaction.
The options that can be passed to pre_proc_opts are: - knnImpute (should
missing data be imputed?) - scale (should data be standardized)? - center
(should data be centered to the mean or another value?) - otherbin (should
factors with low prevalence be combined?) - none (should no preprocessing be
done? can also specify a null object)
The options that can be passed to extra_opts are:
centers (named numeric vector which denotes where each covariate should be centered)
center_fn (alternatively, a function can be specified to calculate center such as
minormedian)freq_cut, unique_cut (see ?step_nzv; these get used by the filtering steps)
neighbors (the number of neighbors for knnImpute)
one_hot (see ?step_dummy), this defaults to cell-means coding which can be done in regularized regression (change at your own risk)
raw (should polynomials not be orthogonal? defaults to true because variables are centered and scaled already by this point by default)
ia_formula will by default interact all variables with each other up
to order k. If specified, ia_formula will be passed as the terms argument
to recipes::step_interact, so the help documentation for that function
can be investigated for further assistance in specifying specific
interactions.
Value
an object of class sparseR containing the following:
fit
the fit object returned by ncvreg
srprep
a recipes object used to prep the data
pen_factors
the factor multiple on penalties for ranked sparsity
results
all coefficients and penalty factors at minimum CV lambda
results_summary
a tibble of summary results at minimum CV lambda
results1se
all coefficients and penalty factors at lambda_1se
results1se_summary
a tibble of summary results at lambda_1se
data
the (unprocessed) data
family
the family argument (for non-normal, eg. poisson)
info
a list containing meta-info about the procedure
References
For fitting functionality, the ncvreg package is used; see
Breheny, P. and Huang, J. (2011) Coordinate descent algorithms for nonconvex
penalized regression, with applications to biological feature selection. Ann.
Appl. Statist., 5: 232-253.
Bootstrap procedure for stepwise regression
Description
Runs bootstrap on models selection procedure using RBIC to find bootstrapped standard error (smoothed, see Efron 2014) as well as selection percentage across candidate variables. (experimental)
Usage
sparseRBIC_bootstrap(srbic_fit, B = 100, quiet = FALSE)
Arguments
srbic_fit
An object fitted by sparseRBIC_step
B
Number of bootstrap samples
quiet
Should the display of a progress bar be silenced?
Value
a list containing:
results
a tibble containing coefficients, p-values, selection pct
bootstraps
a tibble of bootstrapped coefficients
Sample split procedure for stepwise regression
Description
Runs multiple on models selection procedures using RBIC to achieve valid inferential results post-selection
Usage
sparseRBIC_sampsplit(srbic_fit, S = 100, quiet = FALSE)
Arguments
srbic_fit
An object fitted by sparseRBIC_step
S
Number of splitting iterations
quiet
Should the display of a progress bar be silenced?
Value
a list containing:
results
a tibble containing coefficients, p-values, selection pct
splits
a tibble of different split-based coefficients
Fit a ranked-sparsity model with forward stepwise RBIC (experimental)
Description
Fit a ranked-sparsity model with forward stepwise RBIC (experimental)
Usage
sparseRBIC_step(
formula,
data,
family = c("gaussian", "binomial", "poisson"),
k = 1,
poly = 1,
ic = c("RBIC", "RAIC", "BIC", "AIC", "EBIC"),
hier = c("strong", "weak", "none"),
sequential = (hier[1] != "none"),
cumulative_k = FALSE,
cumulative_poly = TRUE,
pool = FALSE,
ia_formula = NULL,
pre_process = TRUE,
model_matrix = NULL,
y = NULL,
poly_prefix = "_poly_",
int_sep = "\\:",
pre_proc_opts = c("knnImpute", "scale", "center", "otherbin", "none"),
filter = c("nzv", "zv"),
extra_opts = list(),
trace = 0,
message = TRUE,
...
)
Arguments
formula
Names of the terms
data
Data
family
The family of the model
k
The maximum order of interactions to consider
poly
The maximum order of polynomials to consider
ic
The information criterion to use
hier
Should hierarchy be enforced (weak or strong)? Must be set with sequential == TRUE (see details)
sequential
Should the main effects be considered first, orders sequentially added/considered?
cumulative_k
Should penalties be increased cumulatively as order interaction increases?
cumulative_poly
Should penalties be increased cumulatively as order polynomial increases?
pool
Should interactions of order k and polynomials of order k+1 be pooled together for calculating the penalty?
ia_formula
formula to be passed to step_interact via terms argument
pre_process
Should the data be preprocessed (if FALSE, must provide model_matrix)
model_matrix
A data frame or matrix specifying the full model matrix (used if !pre_process)
y
A vector of responses (used if !pre_process)
poly_prefix
If model_matrix is specified, what is the prefix for polynomial terms?
int_sep
If model_matrix is specified, what is the separator for interaction terms?
pre_proc_opts
List of preprocessing steps (see details)
filter
The type of filter applied to main effects + interactions
extra_opts
A list of options for all preprocess steps (see details)
trace
Should intermediate results of model selection process be output
message
should experimental message be suppressed
...
additional arguments for running stepwise selection
Details
This function mirrors sparseR but uses stepwise selection guided by RBIC.
Additionally, setting cumulative_poly or cumulative_k to TRUE increases
the penalty cumulatively based on the order of either polynomial or
interaction.
The hier hierarchy enforcement will only work if sequential == TRUE, and
notably will only consider the "first gen" hierarchy, that is, that all
main effects which make up an interaction are already in the model. It
is therefore possible for a third order interaction (x1:x2:x3) to
enter a model without x1:x2 or x2:x3, so long as x1, x2, and x3 are all
in the model.
The options that can be passed to pre_proc_opts are:
knnImpute (should missing data be imputed?)
scale (should data be standardized)?
center (should data be centered to the mean or another value?)
otherbin (should factors with low prevalence be combined?)
none (should no preprocessing be done? can also specify a null object)
The options that can be passed to extra_opts are:
centers (named numeric vector which denotes where each covariate should be centered)
center_fn (alternatively, a function can be specified to calculate center such as
minormedian)freq_cut, unique_cut (see ?step_nzv - these get used by the filtering steps)
neighbors (the number of neighbors for knnImpute)
one_hot (see ?step_dummy), this defaults to cell-means coding which can be done in regularized regression (change at your own risk)
raw (should polynomials not be orthogonal? defaults to true because variables are centered and scaled already by this point by default)
Value
an object of class sparseRBIC containing the following:
fit
the final fit object
srprep
a recipes object used to prep the data
pen_info
coefficient-level variable counts, types + names
data
the (unprocessed) data
family
the family argument (for non-normal, eg. poisson)
info
a list containing meta-info about the procedure
stats
the IC for each fit and respective terms included
Preprocess & create a model matrix with interactions + polynomials
Description
Preprocess & create a model matrix with interactions + polynomials
Usage
sparseR_prep(
formula,
data,
k = 1,
poly = 1,
pre_proc_opts = c("knnImpute", "scale", "center", "otherbin", "none"),
ia_formula = NULL,
filter = c("nzv", "zv"),
extra_opts = list(),
family = "gaussian"
)
Arguments
formula
A formula of the main effects + outcome of the model
data
A required data frame or tibble containing the variables in
formula
k
Maximum order of interactions to numeric variables
poly
the maximum order of polynomials to consider
pre_proc_opts
A character vector specifying methods for preprocessing (see details)
ia_formula
formula to be passed to step_interact (for interactions, see details)
filter
which methods should be used to filter out variables with (near) zero variance? (see details)
extra_opts
extra options to be used for preprocessing
family
family passed from sparseR
Details
The pre_proc_opts acts as a wrapper for the corresponding procedures in the
recipes package. The currently supported options that can be passed to
pre_proc_opts are: knnImpute: Should k-nearest-neighbors be performed (if
necessary?) scale: Should variables be scaled prior to creating interactions
(does not scale factor variables or dummy variables) center: Should variables
be centered (will not center factor variables or dummy variables ) otherbin:
ia_formula will by default interact all variables with each other up
to order k. If specified, ia_formula will be passed as the terms argument
to recipes::step_interact, so the help documentation for that function
can be investigated for further assistance in specifying specific
interactions.
The methods specified in filter are important; filtering is necessary to cut down on extraneous polynomials and interactions (in cases where they really don't make sense). This is true, for instance, when using dummy variables in polynomials , or when using interactions of dummy variables that relate to the same categorical variable.
Value
an object of class recipe; see recipes::recipe()
Centering numeric data to a value besides their mean
Description
'step_center_to' generalizes 'step_center' to allow for a different function than the 'mean' function to calculate centers. It creates a *specification* of a recipe step that will normalize numeric data to have a 'center' of zero.
Usage
step_center_to(
recipe,
...,
role = NA,
trained = FALSE,
centers = NULL,
center_fn = mean,
na_rm = TRUE,
skip = FALSE,
id = rand_id("center_to")
)
## S3 method for class 'step_center_to'
tidy(x, ...)
Arguments
recipe
A recipe object. The step will be added to the sequence of operations for this recipe.
...
One or more selector functions to choose which variables are affected by the step. See [selections()] for more details. For the 'tidy' method, these are not currently used.
role
Not used by this step since no new variables are created.
trained
A logical to indicate if the quantities for preprocessing have been estimated.
centers
A named numeric vector of centers. This is 'NULL' until computed by [prep.recipe()] (or it can be specified as a named numeric vector as well?).
center_fn
a function to be used to calculate where the center should be
na_rm
A logical value indicating whether 'NA' values should be removed during computations.
skip
A logical. Should the step be skipped when the recipe is baked by [bake.recipe()]? While all operations are baked when [prep.recipe()] is run, some operations may not be able to be conducted on new data (e.g. processing the outcome variable(s)). Care should be taken when using 'skip = TRUE' as it may affect the computations for subsequent operations
id
A character string that is unique to this step to identify it.
x
A 'step_center_to' object.
Details
Centering data means that the average of a variable is subtracted from the data. 'step_center_to' estimates the variable centers from the data used in the 'training' argument of 'prep.recipe'. 'bake.recipe' then applies the centering to new data sets using these centers.
Value
An updated version of 'recipe' with the new step added to the sequence of existing steps (if any). For the 'tidy' method, a tibble with columns 'terms' (the selectors or variables selected) and 'value' (the centers).
See Also
[recipe()] [prep.recipe()] [bake.recipe()]
Examples
data(biomass, package = "modeldata")
biomass_tr <- biomass[biomass$dataset == "Training",]
biomass_te <- biomass[biomass$dataset == "Testing",]
rec <- recipes::recipe(
HHV ~ carbon + hydrogen + oxygen + nitrogen + sulfur,
data = biomass_tr)
center_trans <- rec %>%
step_center_to(carbon, contains("gen"), -hydrogen)
center_obj <- recipes::prep(center_trans, training = biomass_tr)
transformed_te <- recipes::bake(center_obj, biomass_te)
biomass_te[1:10, names(transformed_te)]
transformed_te
recipes::tidy(center_trans)
recipes::tidy(center_obj)
Summary of sparseR model coefficients
Description
Summary of sparseR model coefficients
Usage
## S3 method for class 'sparseR'
summary(object, lambda, at = c("cvmin", "cv1se"), ...)
Arguments
object
a sparseR object
lambda
a particular value of lambda to predict with
at
a "smart" guess to use for lambda
...
additional arguments to be passed to summary.ncvreg
Value
an object of class 'summary.ncvreg' at specified or smart value of lambda.