Figure 1. The machine learning and evaluation scheme of rwTTD
prediction. a. Calculation of future time in a censored population. b.
Simulation of rwTTD data capturing a variety of factors potentially
affecting performance. c-e. Three evaluation schemes used in the study:
absolute error, cumulative error and absolute number of error days when
50% of the population is terminated.
We developed three metrics to evaluate the model performance
(Fig. 1c-e ). For the first metric, “absolute error ”,
we calculated the accumulated values of the predicted curve and the gold
standard curve from day 0 to a specific date (1000 days, if not
otherwise specified in this paper), and then divided the total
difference by the total number of days. Thus, if the predicted curve is
higher than the gold standard curve in the first half, but lower in the
later half, the errors could be canceled out by using this metric. For
the second metric, “cumulative error ”, we accumulated the
absolute error at each day from day 0 to a specific date, and then
divided the total error by the total number of days. Then, no matter
positive error or negative error, the absolute errors will aggregate.
For the third metric, “Absolute date error at 50% terminated”, we
calculated when 50% of the patients are terminated (reaching 0.5 on
y-axis on the termination curve), what is the absolute difference in
days between the gold standard curve and the predicted curve. The three
metrics capture the important aspects in drug administration.
Of note, models can only generate predictions for each individual’s
expected future time in the test set when trained with a machine
learning classifier. When we aggregate the predictions, the resulting
curve is closely centered at the average expected future time and
substantially deviates from the true distribution (Fig. 2a-c ).
This is due to the innate properties of most machine learning
algorithms. When minimizing the squared errors or another similar loss
function, the prediction values tend to center around the mean.
To combat such an effect, we further divided the training set into the
train set, from which the model parameters are derived, and the
validation set, from which the distribution of the prediction value is
obtained. The prediction value from the validation set and corresponding
future time are used as a reference to interpolate the prediction
results of the test set. In this study, we used first order
interpolation and extrapolation if the test set prediction values go
beyond the range of the validation set. By interpolation, we generated a
distribution resembling the observed future time distribution of the
test set. To further illustrate the functions of the three metrics we
used in this study, we showed the illustrations of the percentage of
errors using either the absolute error or the cumulative errors using
ExtraTreeRegressor by different numbers of maximal dates considered and
the absolute error date when 50% of the population is terminated
(Fig. 2d-e ).