parsnip a drawing of a parsnip on a beige background

R-CMD-check Codecov test coverage CRAN status Downloads lifecycle Codecov test coverage

Introduction

The goal of parsnip is to provide a tidy, unified interface to models that can be used to try a range of models without getting bogged down in the syntactical minutiae of the underlying packages.

Installation

 # The easiest way to get parsnip is to install all of tidymodels:
 install.packages("tidymodels")
 
 # Alternatively, install just parsnip:
 install.packages("parsnip")
 
 # Or the development version from GitHub:
 # install.packages("pak")
pak::pak("tidymodels/parsnip")

Getting started

One challenge with different modeling functions available in R that do the same thing is that they can have different interfaces and arguments. For example, to fit a random forest regression model, we might have:

 # From randomForest
rf_1 <- randomForest(
 y ~ ., 
 data = dat, 
 mtry = 10, 
 ntree = 2000, 
 importance = TRUE
)
 
 # From ranger
rf_2 <- ranger(
 y ~ ., 
 data = dat, 
 mtry = 10, 
 num.trees = 2000, 
 importance = "impurity"
)
 
 # From sparklyr
rf_3 <- ml_random_forest(
 dat, 
 intercept = FALSE, 
 response = "y", 
 features = names(dat)[names(dat) != "y"], 
 col.sample.rate = 10,
 num.trees = 2000
)

Note that the model syntax can be very different and that the argument names (and formats) are also different. This is a pain if you switch between implementations.

In this example:

The goals of parsnip are to:

Using the example above, the parsnip approach would be:

 library(parsnip)
 
 rand_forest(mtry = 10, trees = 2000) |>
 set_engine("ranger", importance = "impurity") |>
 set_mode("regression")
 #> Random Forest Model Specification (regression)
 #> 
 #> Main Arguments:
 #> mtry = 10
 #> trees = 2000
 #> 
 #> Engine-Specific Arguments:
 #> importance = impurity
 #> 
 #> Computational engine: ranger

The engine can be easily changed. To use Spark, the change is straightforward:

 rand_forest(mtry = 10, trees = 2000) |>
 set_engine("spark") |>
 set_mode("regression")
 #> Random Forest Model Specification (regression)
 #> 
 #> Main Arguments:
 #> mtry = 10
 #> trees = 2000
 #> 
 #> Computational engine: spark

Either one of these model specifications can be fit in the same way:

 set.seed(192)
 rand_forest(mtry = 10, trees = 2000) |>
 set_engine("ranger", importance = "impurity") |>
 set_mode("regression") |>
 fit(mpg ~ ., data = mtcars)
 #> parsnip model object
 #> 
 #> Ranger result
 #> 
 #> Call:
 #> ranger::ranger(x = maybe_data_frame(x), y = y, mtry = min_cols(~10, x), num.trees = ~2000, importance = ~"impurity", num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1)) 
 #> 
 #> Type: Regression 
 #> Number of trees: 2000 
 #> Sample size: 32 
 #> Number of independent variables: 10 
 #> Mtry: 10 
 #> Target node size: 5 
 #> Variable importance mode: impurity 
 #> Splitrule: variance 
 #> OOB prediction error (MSE): 5.976917 
 #> R squared (OOB): 0.8354559

A list of all parsnip models across different CRAN packages can be found at https://www.tidymodels.org/find/parsnip/.

Contributing

This project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

AltStyle によって変換されたページ (->オリジナル) /