tidysynthesis

This repository contains the tidysynthesis R package for generating synthetic data. Complete documentation is available is available at the tidysynthesis documentation website.

Installation:

 # The easiest way to get tidysynthesis is from CRAN:
 install.packages("tidysynthesis")
 
 # Or the development version from GitHub:
 # install.packages("pak")
pak::pak("UrbanInstitute/tidysynthesis")

Overview

tidysynthesis is a "metapackage" for creating synthetic data sets for statistical disclosure limitation that shares the underlying design philosophy, grammar, and data structures of the tidyverse and tidymodels. tidysynthesis flexibly supports sequential synthesis modeling and sampling specifications with different formal and empirical privacy properties.

Note that the privacy and security properties of tidysynthesis’s outputs rely on many different technical assumptions. Our goal is to make the package agnostic to many of these assumptions, which places greater responsibility on users to evaluate synthetic data prior to dissemination. For more information, see our notes on security principles from our documentation.

presynth

Syntheses ultimately depend on a presynth object with two main components: a roadmap that outlines the macroscopic workflow (i.e., what order to synthesize variables, how variables are defined and relate to one another, etc.) and a synth_spec that details how individual variables are synthesized (i.e. how are specific output variable modeling workflows specified, how are new synthetic samples generated, etc.). Here is the general workflow, with required objects in blue and optional objects in magenta.

flowchart TD
 A[conf_data]:::required --> B[roadmap]:::required
 C[start_data]:::required --> B
 D[start_method]:::optional --> B
 E[schema]:::optional --> B 
 F[visit_sequence]:::optional --> B
 G[replicates]:::optional --> B
 H[constraints]:::optional --> B
 I[models]:::required --> J[synth_spec]:::required
 K[samplers]:::required --> J
 L[steps]:::optional --> J
 M[noise]:::optional --> J
 N[tuners]:::optional --> J
 O[extractors]:::optional --> J
 B --> P[presynth]:::required 
 J --> P
 P --> Q[postsynth]:::required
 classDef required fill:#1696d2,stroke:#1696d2;
 classDef optional fill:#ec008b,stroke:#ec008b;

roadmap Components

roadmap objects require data inputs. All other inputs can be optionally supplied as S3 objects or constructed by default. roadmaps also have a tidymodels-style API that lets you update objects, for example:

 # create an example roadmap
 roadmap(conf_data = example_conf_data, 
 start_data = example_start_data) |>
 # add an example visit_sequence using the API
 add_sequence_manual(var1, var2) |>
 # update the example schema using the API
 update_schema(col_schema = list("var1" = list("dtype" = "fct")))

See the documentation website for more API examples.

Required roadmap objects:

Optional roadmap objects:

synth_spec Components

synth_spec S3 objects allow you to specify different components using default versions for regression and classification models, or custom models mapping individual variables to components. Here is an example:

 synth_spec(
 default_regression_sampler = tidysynthesis::sample_lm,
 default_classification_sampler = tidysynthesis::sample_rpart,
 custom_samplers = list(
 list("vars" = c("var1", "var2"), "sampler" = sample_ranger)
 ),
 ...
)

These components also support updating via a tidymodels-style API; see the documentation website for more API examples.

Required synth_spec components:

Optional synth_spec components:

Examples

Code in the following set of examples synthesizes the palmerpenguins data set with missing values removed.

 library(palmerpenguins)
 library(tidyverse)
 library(tidysynthesis)
 
penguins_complete <- penguins |>
 select(-year) |>
 drop_na() |>
 mutate(
 flipper_length_mm = as.numeric(flipper_length_mm),
 body_mass_g = as.numeric(body_mass_g)
 )

All of the examples use the same starting data and visit sequence, as specified by the roadmap below.

 set.seed(20220218)
 
 # create "starting data"
starting_data <- penguins_complete |> 
 group_by(island) |>
 slice_sample(n = 5) |>
 select(species, island, sex) |>
 ungroup()
 
 # create roadmap
rm <- roadmap(
 conf_data = penguins_complete,
 start_data = starting_data
) |> 
 add_sequence_numeric(
 dplyr::where(is.numeric), 
 method = "correlation", 
 cor_var = "bill_length_mm"
 )

Example 1

Example 1 uses linear regression to synthesize the numeric data in the penguins data set. sample_lm() samples from normal distributions centered on the regression line with standard deviation equal to the residual standard error.

 # synth_spec
 
lm_mod <- parsnip::linear_reg() |> 
 parsnip::set_engine(engine = "lm") |>
 parsnip::set_mode(mode = "regression")
 
synth_spec1 <- synth_spec(
 default_regression_model = lm_mod,
 default_regression_sampler = tidysynthesis::sample_lm
)
 
 # create a presynth object
 # use defaults for noise, constraints, and replicates
presynth1 <- presynth(
 roadmap = rm,
 synth_spec = synth_spec1
)
 
 # synthesize!
 set.seed(1)
postsynth1 <- synthesize(presynth = presynth1)
#> Synthesizing 1/4 bill_length_mm...
#> Synthesizing 2/4 flipper_length_mm...
#> Synthesizing 3/4 body_mass_g...
#> Synthesizing 4/4 bill_depth_mm...
postsynth1$synthetic_data
#> # A tibble: 15 ×ばつ 7
#> species island sex bill_length_mm flipper_length_mm body_mass_g
#> <fct> <fct> <fct> <dbl> <dbl> <dbl>
#> 1 Gentoo Biscoe male 47.9 219. 5764.
#> 2 Gentoo Biscoe male 49.8 221. 5400.
#> 3 Gentoo Biscoe female 43.7 218. 4895.
#> 4 Gentoo Biscoe male 53.1 227. 5605.
#> 5 Gentoo Biscoe female 46.4 217. 4432.
#> 6 Chinstrap Dream female 45.1 196. 3308.
#> 7 Chinstrap Dream female 48.1 197. 3399.
#> 8 Adelie Dream male 42.1 195. 4075.
#> 9 Chinstrap Dream female 48.3 182. 3565.
#> 10 Adelie Dream female 35.9 189. 3624.
#> 11 Adelie Torgersen male 44.4 197. 4110.
#> 12 Adelie Torgersen female 38.1 188. 3322.
#> 13 Adelie Torgersen female 35.8 179. 3391.
#> 14 Adelie Torgersen male 35.8 189. 3997.
#> 15 Adelie Torgersen female 39.8 192. 3307.
#> # i 1 more variable: bill_depth_mm <dbl>

Example 2

synth_spec() can accept different model types. The example below is a regression tree model. Notice how all of the other objects from example 1 can be reused.

dt_mod <- parsnip::decision_tree() |>
 parsnip::set_engine(engine = "rpart") |>
 parsnip::set_mode(mode = "regression")
 
synth_spec2 <- synth_spec(
 default_regression_model = dt_mod,
 default_regression_sampler = tidysynthesis::sample_rpart
)
 
 # create a presynth object
presynth2 <- presynth(
 roadmap = rm,
 synth_spec = synth_spec2
)
 
 # synthesize!
 set.seed(1)
postsynth2 <- synthesize(presynth = presynth2)
#> Synthesizing 1/4 bill_length_mm...
#> Synthesizing 2/4 flipper_length_mm...
#> Synthesizing 3/4 body_mass_g...
#> Synthesizing 4/4 bill_depth_mm...
postsynth2$synthetic_data
#> # A tibble: 15 ×ばつ 7
#> species island sex bill_length_mm flipper_length_mm body_mass_g
#> <fct> <fct> <fct> <dbl> <dbl> <dbl>
#> 1 Gentoo Biscoe male 49.6 220 5350
#> 2 Gentoo Biscoe male 50 215 5550
#> 3 Gentoo Biscoe female 45.2 222 4700
#> 4 Gentoo Biscoe male 59.6 216 5500
#> 5 Gentoo Biscoe female 42.8 218 5000
#> 6 Chinstrap Dream female 46.6 195 3550
#> 7 Chinstrap Dream female 40.9 185 3700
#> 8 Adelie Dream male 39.2 186 3800
#> 9 Chinstrap Dream female 47.7 190 2900
#> 10 Adelie Dream female 36 176 3850
#> 11 Adelie Torgersen male 41.4 178 4050
#> 12 Adelie Torgersen female 37 174 3350
#> 13 Adelie Torgersen female 37 180 3775
#> 14 Adelie Torgersen male 41.6 195 3550
#> 15 Adelie Torgersen female 36.8 191 3200
#> # i 1 more variable: bill_depth_mm <dbl>

Example 3

Sometimes prediction error is not enough and additional noise is added to predictions. noise controls adding additional noise to predicted values. add_noise toggles on/off which variables will receive additional noise. Here, we use the function add_noise_kde() with additional arguments exclusions and n_ntiles passed to this function (see the documentation ?add_noise_kde for details.)

 # noise
 # this turns on noise for all variables and adds 0 as an exclusion for body_mass_g
noise_spec <- noise(
 add_noise = TRUE,
 noise_func = add_noise_kde,
 exclusions = 0,
 n_ntiles = 20
)
 
synth_spec3 <- synth_spec2 |>
 update_synth_spec(
 default_regression_noise = noise_spec
 )
 
presynth3 <- presynth(
 roadmap = rm,
 synth_spec = synth_spec3
)
 
 # synthesize!
 set.seed(1)
postsynth3 <- synthesize(presynth = presynth3)
#> Synthesizing 1/4 bill_length_mm...
#> Synthesizing 2/4 flipper_length_mm...
#> Synthesizing 3/4 body_mass_g...
#> Synthesizing 4/4 bill_depth_mm...
postsynth3$synthetic_data
#> # A tibble: 15 ×ばつ 7
#> species island sex bill_length_mm flipper_length_mm body_mass_g
#> <fct> <fct> <fct> <dbl> <dbl> <dbl>
#> 1 Gentoo Biscoe male 49.5 217. 5283.
#> 2 Gentoo Biscoe male 49.8 218. 5254.
#> 3 Gentoo Biscoe female 45.1 208. 4420.
#> 4 Gentoo Biscoe male 59.9 215. 5881.
#> 5 Gentoo Biscoe female 42.4 210. 4467.
#> 6 Chinstrap Dream female 47.4 195. 3321.
#> 7 Chinstrap Dream female 41.0 208. 3289.
#> 8 Adelie Dream male 39.9 195. 3914.
#> 9 Chinstrap Dream female 47.4 198. 3463.
#> 10 Adelie Dream female 36.5 184. 3072.
#> 11 Adelie Torgersen male 40.8 199. 3548.
#> 12 Adelie Torgersen female 37.6 196. 3841.
#> 13 Adelie Torgersen female 37.3 170. 3582.
#> 14 Adelie Torgersen male 41.4 189. 3461.
#> 15 Adelie Torgersen female 37.7 182. 3400.
#> # i 1 more variable: bill_depth_mm <dbl>

Example 4

tidysynthesis contains a system for specifying constraints during this synthesis process. This means constraints imposed on earlier variables are realized before later variables are synthesized. The constraints can be unconditional (e.g. penguin weight must be positive) or conditional (e.g. a Gentoo penguin must weigh at least 6,000 grams).

Constraints can be specified for numeric and/or categorical variables. Depending on the variable type, two different constraint_df_* can be specified, either constraints_df_num or constraints_df_cat (see the documentation ?constraints for examples.) max_z_num and max_z_cat controls the number of times a value should be resampled if it violates a constraint before enforcing the constraints by modifying synthesized values.

Below is an example using constraints_df_num and max_df_num.

 # create a tibble of constraints
 # skipped variables will have a minimum of -Inf and a maximum of Inf
constraints_df_num <- 
 tibble::tribble(
 ~var, ~min, ~max, ~conditions,
 "bill_length_mm", 0, Inf, "TRUE",
 "bill_length_mm", 0, Inf, "TRUE",
 "flipper_length_mm", 0, Inf, "TRUE",
 "body_mass_g", 0, Inf, "TRUE",
 "body_mass_g", 4000, 10000, "flipper_length_mm > 190",
 "body_mass_g", 6000, Inf, "species == 'Gentoo'"
 ) 
 
 # create a constraints object
constraints4 <- constraints(
 schema = rm$schema,
 constraints_df_num = constraints_df_num,
 max_z_num = list(0, 1, 2, 3)
)
 
presynth4 <- presynth(
 roadmap = rm |>
 add_constraints(constraints4),
 synth_spec = synth_spec3
)
 
 # synthesize!
 set.seed(1)
postsynth4 <- synthesize(presynth = presynth4)
#> Synthesizing 1/4 bill_length_mm...
#> Synthesizing 2/4 flipper_length_mm...
#> Synthesizing 3/4 body_mass_g...
#> Synthesizing 4/4 bill_depth_mm...
postsynth4$synthetic_data
#> # A tibble: 15 ×ばつ 7
#> species island sex bill_length_mm flipper_length_mm body_mass_g
#> <fct> <fct> <fct> <dbl> <dbl> <dbl>
#> 1 Gentoo Biscoe male 49.5 217. 6001.
#> 2 Gentoo Biscoe male 49.8 218. 6309.
#> 3 Gentoo Biscoe female 45.1 208. 6000 
#> 4 Gentoo Biscoe male 59.9 215. 6000 
#> 5 Gentoo Biscoe female 42.4 210. 6000 
#> 6 Chinstrap Dream female 47.4 195. 4000 
#> 7 Chinstrap Dream female 41.0 208. 4000 
#> 8 Adelie Dream male 39.9 195. 4000 
#> 9 Chinstrap Dream female 47.4 198. 4000 
#> 10 Adelie Dream female 36.5 184. 3072.
#> 11 Adelie Torgersen male 40.8 199. 4000 
#> 12 Adelie Torgersen female 37.6 196. 4000 
#> 13 Adelie Torgersen female 37.3 170. 3582.
#> 14 Adelie Torgersen male 41.4 189. 3461.
#> 15 Adelie Torgersen female 37.7 182. 3400.
#> # i 1 more variable: bill_depth_mm <dbl>

Example 5

tidysynthesis can generate multiple replicates. This means that all input conditions are the same, but, due to random sampling, the syntheses themselves differ. The replicates() functionality allows for the creation of these replicates (see the documentation ?replicates() for different kinds of replicate specification).

replicates5 <- replicates(model_sample_replicates = 5)
 
presynth5 <- presynth(
 roadmap = rm |>
 add_replicates(replicates5),
 synth_spec = synth_spec2
)
 
 # synthesize!
 set.seed(1)
 suppressMessages(synth5 <- synthesize(presynth = presynth5))
 
 glimpse(synth5[[1]]$synthetic_data)
#> Rows: 15
#> Columns: 7
#> $ species <fct> Gentoo, Gentoo, Gentoo, Gentoo, Gentoo, Chinstrap, C...
#> $ island <fct> Biscoe, Biscoe, Biscoe, Biscoe, Biscoe, Dream, Dream...
#> $ sex <fct> male, male, female, male, female, female, female, ma...
#> $ bill_length_mm <dbl> 49.6, 50.0, 45.2, 59.6, 42.8, 46.6, 40.9, 39.2, 47.7...
#> $ flipper_length_mm <dbl> 220, 215, 222, 216, 218, 195, 185, 186, 190, 176, 17...
#> $ body_mass_g <dbl> 5350, 5550, 4700, 5500, 5000, 3550, 3700, 3800, 2900...
#> $ bill_depth_mm <dbl> 15.3, 15.9, 14.5, 15.9, 13.7, 18.0, 16.6, 17.2, 18.7...
 glimpse(synth5[[2]]$synthetic_data)
#> Rows: 15
#> Columns: 7
#> $ species <fct> Gentoo, Gentoo, Gentoo, Gentoo, Gentoo, Chinstrap, C...
#> $ island <fct> Biscoe, Biscoe, Biscoe, Biscoe, Biscoe, Dream, Dream...
#> $ sex <fct> male, male, female, male, female, female, female, ma...
#> $ bill_length_mm <dbl> 50.4, 51.4, 44.9, 51.9, 47.5, 46.5, 48.4, 37.6, 50.1...
#> $ flipper_length_mm <dbl> 228, 220, 208, 230, 209, 189, 198, 187, 202, 190, 18...
#> $ body_mass_g <dbl> 6000, 5550, 4850, 5700, 5200, 3200, 3500, 4500, 3700...
#> $ bill_depth_mm <dbl> 16.4, 15.7, 13.8, 15.3, 14.3, 19.4, 16.8, 19.5, 17.3...
 glimpse(synth5[[3]]$synthetic_data)
#> Rows: 15
#> Columns: 7
#> $ species <fct> Gentoo, Gentoo, Gentoo, Gentoo, Gentoo, Chinstrap, C...
#> $ island <fct> Biscoe, Biscoe, Biscoe, Biscoe, Biscoe, Dream, Dream...
#> $ sex <fct> male, male, female, male, female, female, female, ma...
#> $ bill_length_mm <dbl> 51.3, 48.5, 46.0, 50.0, 48.1, 45.2, 46.4, 41.4, 45.7...
#> $ flipper_length_mm <dbl> 220, 225, 210, 230, 220, 200, 192, 189, 200, 195, 17...
#> $ body_mass_g <dbl> 5800, 5050, 4950, 5600, 3950, 3000, 3500, 4100, 2700...
#> $ bill_depth_mm <dbl> 15.6, 16.3, 13.1, 17.0, 14.0, 18.9, 17.3, 17.5, 16.9...

Getting help

Please share bugs and feature requests on GitHub. Minimal reproducible examples are appreciated.

AltStyle によって変換されたページ (->オリジナル) /