Processing of feature data
We used the following data tables for feature extraction before the cutoff date: ECOG, enhanced biomarkers, demographics, diagnosis code, visit code, telemedicine code, medication administration code, insurance, lab results, medication order, vitals, and practice.
Feature data can be largely separated into two categories. One set is static data, which does not change over the observation time course, including Age, Gender, Race, etc. The other set is dynamic data, including lab, medication, visit, vitals, diagnosis, etc , which are collected before the cutoff date. For this set of data, we extracted diverse meta-features. We first selected the most frequent 100 concept IDs in each of the above Flatiron data tables, and the last eight points of records are binarized (if not originally a continuous value) to generate 800 features, with 1 representing the appearance of the concept ID at that data point, and 0 otherwise. Additionally, if the concept ID represents a real-valued feature, the mean value and the standard deviation of each selected concept ID before the cutoff time are included. Using these mean and the standard deviation, we generate normalized values for the initial 800 features for each table, and we record the time difference between each record and the previous one. Lastly, we include a binary indicator for each original feature whether it comes from a missing record (8 values for each Flatiron data table) or an existing record. This matrix will be flattened into a single feature vector, concatenated with the static features and input into lightGBM.