modeling_with_binary_classifiers

 library(spect)

Modeling time-to-event data

The purpose of the spect package is to provide a simple, flexible interface for modeling time-to-event data using a discrete-time approach that makes use of existing binary classification tools. Below is an instructive example using the Mayo clinic pbc data set from the "survival" package.

Note that the spect_train function requires the name of the column in the modeling data containing the event indicator variable (and the values in that column must be either zero or one) and the name of the column containing the time to the event for each individual. There are no positional constraints for these columns. All other columns will be treated as inputs to the model.

Below, we take a subset of the pbc input columns and calculate the event indicator and survival time in years.

 set.seed(42)
 
 data(pbc, package = "survival")
 
event_column <- "event_indicator"
time_column <- "survival_time"
 
pbc_complete <- pbc[complete.cases(pbc),]
source_data <- pbc_complete[,c(7:20)]
source_data[[event_column]] <- ifelse(pbc_complete$status == 2, 1, 0)
source_data[[time_column]] <- pbc_complete$time / 365.25
 
 head(source_data)
 #> ascites hepato spiders edema bili chol albumin copper alk.phos ast trig
 #> 1 1 1 1 1.0 14.5 261 2.60 156 1718.0 137.95 172
 #> 2 0 1 1 0.0 1.1 302 4.14 54 7394.8 113.52 88
 #> 3 0 0 0 0.5 1.4 176 3.48 210 516.0 96.10 55
 #> 4 0 1 1 0.5 1.8 244 2.54 64 6121.8 60.63 92
 #> 5 0 1 1 0.0 3.4 279 3.53 143 671.0 113.15 72
 #> 7 0 1 0 0.0 1.0 322 4.09 52 824.0 60.45 213
 #> platelet protime stage event_indicator survival_time
 #> 1 190 12.2 4 1 1.095140
 #> 2 221 10.6 3 0 12.320329
 #> 3 151 12.0 4 1 2.770705
 #> 4 183 10.3 4 1 5.270363
 #> 5 136 10.9 3 0 4.117728
 #> 7 204 9.7 3 0 5.015743

Next, save a subset of the data to demonstrate predictions later.

 
predict_data <- source_data[1:10,]
train_data <- source_data[11:nrow(source_data),]

Then, call the spect_train() function. Note that the cores parameter is not set and use_parallel is set to FALSE, specifically to avoid excessive stress on testing systems, however, it’s generally best to choose an appropriate number of cores or left at zero to allow spect to dynamically choose based on system availability.

result <- spect_train(model_algorithm = "rf"
 , base_learner_list = c("glm", "svmLinear")
 , use_parallel = FALSE
 , modeling_data = train_data
 , event_indicator_var = event_column
 , survival_time_var = time_column
 , obs_window = 12)
 #> INFO [2025年04月06日 20:25:33] Splitting test/train data at 0.200000/0.800000...
 #> INFO [2025年04月06日 20:25:33] Creating person-period data set...
 #> INFO [2025年04月06日 20:25:33] Creating caret train control...
 #> INFO [2025年04月06日 20:25:33] Training base models glm, svmLinear using repeatedcv method with 10 resamples and 3 kfold repeats
 #> Warning in train.default(x, y, weights = w, ...): The metric "ROC" was not in
 #> the result set. Accuracy will be used instead.
 #> Loading required package: ggplot2
 #> Loading required package: lattice
 #> Warning in train.default(x, y, weights = w, ...): The metric "ROC" was not in
 #> the result set. Accuracy will be used instead.
 #> maximum number of iterations reached 0.002668032 0.002649569maximum number of iterations reached 0.001200352 0.001196387maximum number of iterations reached 0.004128868 0.004085047maximum number of iterations reached 0.005894409 0.005808958maximum number of iterations reached -0.0001578453 -0.0001578894maximum number of iterations reached 0.002781525 0.002766878maximum number of iterations reached 0.00382699 0.003795592maximum number of iterations reached 0.00423685 0.004202033maximum number of iterations reached 0.005475957 0.005409323maximum number of iterations reached 0.002354984 0.002342749maximum number of iterations reached 0.00305497 0.003008607maximum number of iterations reached 0.003272896 0.003243133maximum number of iterations reached 0.003677112 0.003654026maximum number of iterations reached 0.004014497 0.003970751maximum number of iterations reached 0.004131751 0.004049032maximum number of iterations reached 0.003069128 0.003055374maximum number of iterations reached 0.001908117 0.001902267maximum number of iterations reached 0.004899541 0.004792288maximum number of iterations reached 0.002402241 0.002386853maximum number of iterations reached 0.00264648 0.002631756maximum number of iterations reached 0.001807357 0.001800219maximum number of iterations reached 0.002053572 0.002043815maximum number of iterations reached 0.002518704 0.002503169maximum number of iterations reached 0.002345695 0.002334784maximum number of iterations reached 0.0008497007 0.0008480762maximum number of iterations reached 0.001148698 0.001143859maximum number of iterations reached 0.005440482 0.005331076maximum number of iterations reached 0.004314559 0.004262836maximum number of iterations reached 0.0007320378 0.000731329maximum number of iterations reached 0.0006210769 0.0006201139maximum number of iterations reached 0.001264579 0.001260133INFO [2025年04月06日 20:25:39] Training the metalearner rf using Kappa metric...
 #> note: only 1 unique complexity parameters in default grid. Truncating the grid to 1 .
 #> 
 #> INFO [2025年04月06日 20:25:53] Calculating probabilities on 53 test individuals...
 #> INFO [2025年04月06日 20:25:53] Transforming 53 rows of data

The result object can be passed to other functions in spect to help visualize and evaluate the model. First, take a look at the survival curve for a given individual - let’s say number fifteen. The absolute survival probability is plotted in blue, and the conditional probability of surviving the given time slice, given that the individual survived to the beginning of that time slice is shown in orange.

 
 plot_survival_curve(result, individual_id=40)

To see only the conditional probability, set the curve_type to "conditional." For only the absolute probability plot, set curve_type to "absolute."

 
 plot_survival_curve(result, individual_id=40, curve_type="conditional")

 
 plot_survival_curve(result, individual_id=40, curve_type="absolute")

Next, build a population-based survival curve (the so-called Kaplan-Meier curve). Note that in order to create a curve, we need to turn the individual survival curves above into actual predicted survival times. To do that, we must choose a probability threshold, then project that onto the time axis at the point of intersection with the curve. The search granularity allows us to do this for a number of choices of the probability threshold and visualize the resulting population curves.

 
km_data <- plot_km(result, prediction_threshold_search_granularity = 0.1)
 #> INFO [2025年04月06日 20:25:53] Calculating plot increments...
 #> INFO [2025年04月06日 20:25:53] collecting the minimum interval which crosses the prediction threshold: 0.100000
 #> INFO [2025年04月06日 20:25:53] Found 10 individuals...
 #> INFO [2025年04月06日 20:25:53] Generate P-value...
 #> INFO [2025年04月06日 20:25:53] Pval is 0.00 for threhold 0.10
 #> INFO [2025年04月06日 20:25:53] collecting the minimum interval which crosses the prediction threshold: 0.200000
 #> INFO [2025年04月06日 20:25:53] Found 16 individuals...
 #> INFO [2025年04月06日 20:25:53] Generate P-value...
 #> INFO [2025年04月06日 20:25:53] Pval is 0.00 for threhold 0.20
 #> INFO [2025年04月06日 20:25:53] collecting the minimum interval which crosses the prediction threshold: 0.300000
 #> INFO [2025年04月06日 20:25:53] Found 19 individuals...
 #> INFO [2025年04月06日 20:25:53] Generate P-value...
 #> INFO [2025年04月06日 20:25:53] Pval is 0.03 for threhold 0.30
 #> INFO [2025年04月06日 20:25:53] collecting the minimum interval which crosses the prediction threshold: 0.400000
 #> INFO [2025年04月06日 20:25:53] Found 23 individuals...
 #> INFO [2025年04月06日 20:25:53] Generate P-value...
 #> INFO [2025年04月06日 20:25:53] Pval is 0.10 for threhold 0.40
 #> INFO [2025年04月06日 20:25:53] collecting the minimum interval which crosses the prediction threshold: 0.500000
 #> INFO [2025年04月06日 20:25:53] Found 26 individuals...
 #> INFO [2025年04月06日 20:25:53] Generate P-value...
 #> INFO [2025年04月06日 20:25:53] Pval is 0.09 for threhold 0.50
 #> INFO [2025年04月06日 20:25:53] collecting the minimum interval which crosses the prediction threshold: 0.600000
 #> INFO [2025年04月06日 20:25:53] Found 30 individuals...
 #> INFO [2025年04月06日 20:25:53] Generate P-value...
 #> INFO [2025年04月06日 20:25:53] Pval is 0.10 for threhold 0.60
 #> INFO [2025年04月06日 20:25:53] collecting the minimum interval which crosses the prediction threshold: 0.700000
 #> INFO [2025年04月06日 20:25:53] Found 36 individuals...
 #> INFO [2025年04月06日 20:25:53] Generate P-value...
 #> INFO [2025年04月06日 20:25:53] Pval is 0.25 for threhold 0.70
 #> INFO [2025年04月06日 20:25:53] collecting the minimum interval which crosses the prediction threshold: 0.800000
 #> INFO [2025年04月06日 20:25:53] Found 40 individuals...
 #> INFO [2025年04月06日 20:25:53] Generate P-value...
 #> INFO [2025年04月06日 20:25:53] Pval is 0.39 for threhold 0.80
 #> INFO [2025年04月06日 20:25:53] collecting the minimum interval which crosses the prediction threshold: 0.900000
 #> INFO [2025年04月06日 20:25:53] Found 49 individuals...
 #> INFO [2025年04月06日 20:25:53] Generate P-value...
 #> INFO [2025年04月06日 20:25:53] Pval is 0.58 for threhold 0.90
 #> INFO [2025年04月06日 20:25:53] Initializing Kaplan-Meier curve plot...
 #> INFO [2025年04月06日 20:25:53] Collecting data for each threshold value...

It is also often useful to see the ROC curve, however, because the modeling data contains one row per individual per time bin, we must choose a particular time at which to generate an ROC curve. Note that multiple times may be chose - a curve is generated for times 2 and 6 below.

 
prediction_times = c(2, 6)
eval <- evaluate_model(result, prediction_times)
 #> INFO [2025年04月06日 20:25:59] Collecting evaluation metric data...

Finally - it’s relatively easy to calculate quantities of interest such as the unconditional probability of an individual surviving to a particular time. Cumulative probability is already captured by the spect_predict() function in the abs_event_prob column. Note that the conditional probability of surviving an interval, given that the individual survived to the beginning of that interval is also available.

 
predictions <- spect_predict(result, new_data=predict_data) 
 #> INFO [2025年04月06日 20:26:01] Transforming 10 rows of data
 
 # Collect the absolute probability of individual 1 surviving to time 6.
individual = 1
survival_time_check = 6
 
 tail(predictions[predictions$individual_id == individual & predictions$upper_bound 
 < survival_time_check,], n = 1)$abs_event_prob
 #> [1] 0.9801006