create_synthetic_data

 library(spect)
 #> Loading required package: futile.logger
 #> Loading required package: dplyr
 #> 
 #> Attaching package: 'dplyr'
 #> The following objects are masked from 'package:stats':
 #> 
 #> filter, lag
 #> The following objects are masked from 'package:base':
 #> 
 #> intersect, setdiff, setequal, union

It can be useful to have a data set with a known distribution for testing modeling approaches. It’s also useful to be able to clearly conceptualize that data set. spect can generate synthetic time-to-event data for this purpose without relying on a potentially unknown external data set.

Creating synthetic data

The create_synthetic_data() function will produce a single, relational data set where each row represents a fictional subscriber to a theoretical streaming service. spect can be used to model the time to the cancellation of the service. If no parameters are passed, then all defaults are invoked. The resulting data set contains two modeling variables:

incomes - the average household income of the subscriber
watchtimes - the average number of weekly hours the subscriber used the service in the prior month

It also contains the following columns:

total_months - the first of the time to cancellation or the end of the study (i.e. - censored data - the event did not occur)
cancel_event_detected - an indicator variable. 0 means that the event did not occur (i.e. - censored data). 1 means that the event (cancellation) was observed.
baseline_time_to_cancel - This is given by a simple, but non-linear formula: B = 26 + W^2 - (I / 10000) where W is the watchtimes and I is the incomes. This can be thought of as the "ground truth" for the cancellation event time.
perturbed_baseline - This differs from the baseline_time_to_cancel by the pertubartion_shift, if passed.

 
 set.seed(42)
 
data <- create_synthetic_data()
 #> INFO [2025年04月06日 20:25:29] Creating 250 income samples from normal distribution of median 50000, variance 10000 
 #> and watchtimes samples from uniform distribution with min: 0 and max: 6
 head(data)
 #> incomes watchtimes total_months cancel_event_detected
 #> 1 63709.58 0.8190312 20.29985 1
 #> 2 44353.02 1.0628185 22.69428 1
 #> 3 53631.28 3.1173627 30.35482 1
 #> 4 56328.63 4.8667247 44.05215 1
 #> 5 54042.68 0.6921721 21.07483 1
 #> 6 48938.75 5.3605307 48.00000 0
 #> baseline_time_to_cancel perturbed_baseline
 #> 1 20.29985 20.29985
 #> 2 22.69428 22.69428
 #> 3 30.35482 30.35482
 #> 4 44.05215 44.05215
 #> 5 21.07483 21.07483
 #> 6 49.84141 49.84141

Modifying the distribution

Since a distribution that matches exactly to a formula may not be adequate for testing a model, some optional parameters are provided to perturb the cancellation event distribution in a structured way. In particular, the user can specify the minimum, median, and variance of the income distribution and the minimum and maximum watchtimes.

Additionally, it’s possible to set a censorship percentage within a given minimum and maximum amount, adjust the length of the study (i.e. - the maximum total months). Finally, the perturbation_shift argument adds some random noise to the total_months column of the data, which helps to prevent instant overfitting.

 
data <- create_synthetic_data(sample_size = 500
 , minimum_income = 10000
 , median_income = 40000
 , income_variance = 10000
 , min_watchhours = 2
 , max_watchhours = 10
 , censor_percentage = .2
 , min_censor_amount = 3
 , max_censor_amount = 3
 , study_time_in_months = 60
 , perturbation_shift = 5
 )
 #> INFO [2025年04月06日 20:25:29] Creating 500 income samples from normal distribution of median 40000, variance 10000 
 #> and watchtimes samples from uniform distribution with min: 2 and max: 10
 
 head(data)
 #> incomes watchtimes total_months cancel_event_detected
 #> 1 50291.41 9.919725 60.00000 0
 #> 2 49147.75 5.507949 55.75277 1
 #> 3 39975.44 7.599226 60.00000 0
 #> 4 41360.10 9.112616 60.00000 0
 #> 5 32798.46 8.673276 60.00000 0
 #> 6 38018.76 7.875372 60.00000 0
 #> baseline_time_to_cancel perturbed_baseline
 #> 1 119.37180 114.97561
 #> 2 51.42273 55.75277
 #> 3 79.75069 78.24010
 #> 4 104.90375 104.02173
 #> 5 97.94587 102.55733
 #> 6 84.21960 84.53033

It may also be useful to visualize the data distributions. plotSynData() handles this straightforwardly. Here, it’s easy to see the impact of the perturbation and censorship by comparing the "cancel_months" graph to the "final_cancel_months" graph. Also, note that incomes are roughly normally distribution, while watchtimes are roughly uniformly distributed.

 plot_synthetic_data(data)