Figure 4. Performance of rwTTD prediction across heterogeneous populations. a. Performance of different test set termination rates, when the training set is at 0.0008 termination rate. b. Performance of different training set examples, when the number of test set examples is fixed at 5000. c. Performance of different test set noise levels, when the training set noise level is 0.1. d. Performance of different test set feature scales when the training set feature scale is 1.
The other factors affected little on the performance. When the training set and test set were drawn from the same population, when increasing the number of training examples, the performance steadily improves, while the number of testing examples mainly affects the breadth of the performance (Fig. 4b, Fig. S8-9 ). Noise level on individual features does not affect overall performance on population-wise rwTTD (Fig. 4c, Fig. S10-11 ). We then altered the scaling factor of the features. This alteration would result in feature values distributed at different scales, and thus addressing record disparities across cohorts. As expected, when the training and testing feature scales are similar, the model showed relatively low errors. As the two distributions deviate, the percentage of error increases. However, even when the training set feature scale is 1, and the test set feature scale is 1000, the overall population error was moderate (0.13481 for both metrics) (Fig. 4d, Fig. S12-13 ). The above results point to a stable performance of the model across two distinct populations against a variety of factors.