# Evaluating eligibility criteria of oncology trials using real-world data and AI

Apr 7, 2021

### Clinical trial curation

In this study, we focused on aNSCLC, because aNSCLC is a prevalent cancer type and has the largest number of patients in the Flatiron Health database. We systematically identified all of the aNSCLC trials that are available for our analysis. A total of 3,684 interventional clinical trials of NSCLC were retrieved from the ClinicalTrials.gov website of the National Library of Medicine (queried on 8 November 2019). A systematic selection of trials was carried out using the following filters: (1) trials were interventional and only had two arms; (2) treatments consisted of drugs or biologicals only; (3) the drugs selected in each arm are recommended for aNSCLC as listed on the NIH website (https://www.cancer.gov/about-cancer/treatment/drugs/lung); (4) at least 250 patients in each arm were found in the Flatiron Health dataset who match the description of the patients in the trials; (5) the trial was conducted in phase III; and (6) protocols were available. The final list of selected aNSCLC trials included FLAURA29, LUX830, Checkmate01731, Checkmate05732, Checkmate07833, Keynote01034, Keynote18935, Keynote40736, BEYOND37 and OAK38. Detailed information on these trials can be found in Extended Data Table 1. To ensure the completeness of the trial criteria, we carefully extracted all of the eligibility rules directly from the original trial protocols rather than from ClinicalTrials.gov. The eligibility criteria were extracted from the original clinical trial protocol documents and the programmatic encoding of the criteria was verified by a team of experienced oncology data scientists and clinical trial specialists. Additional information about the encoding of the criteria is provided in the Supplementary Methods and Supplementary Discussion. Trial Pathfinder is a flexible framework that can be applied to other clinical trials.

### Flatiron Health dataset

The data that support the findings of this study have been obtained by Flatiron Health, a nationwide EHR-derived de-identified database containing 219,312 patients with cancer with an average of 2.6 years of follow-up. The Flatiron data leveraged in this study (the February 2020 data cut) comes from a combination of EHR-derived data and external commercial and US Social Security Death Index data. The Flatiron Health database is considered one of the industry’s leading research databases in oncology owing to the rigorous data curation and abstraction processes as well as publications in which their efforts to validate outcomes are demonstrated. In previous validation studies in which the Flatiron mortality data are compared to data from the gold-standard National Death Index, the sensitivity of mortality capture in a population of patients with aNSCLC was shown to be 91%, and that the effect of the remaining missing deaths on survival analyses was minimal39,40. In addition to curation accuracy, the Flatiron data are harmonized and aggregated across approximately 280 cancer clinics across the country, which enables its data to be more representative than the EHRs of a single healthcare centre. The majority of patients in the database originate from community oncology settings; relative community/academic proportions may vary depending on the study cohort. Data provided to investigators was de-identified and subject to obligations to prevent re-identification and to protect the confidentiality of the patients. These de-identified data may be made available upon request, and are subject to a licence agreement with Flatiron Health; interested researchers can contact DataAccess@flatiron.com to determine licensing terms. Institutional Review Board approval with a waiver of informed consent was obtained before the study was conducted.

Flatiron Health takes a comprehensive approach to data curation, which involves the collection of both structured and unstructured data from the EHRs. Structured data points, such as laboratory test results, are harmonized across different EHRs and mapped into common terminologies. Unstructured data processing, such as data that come from clinician notes or biomarker reports, leverages technology-enabled abstraction. Through this process, qualified abstractors extract key data points from unstructured documents and are aided by software that facilitates this process through organization, searching and surfacing of key documents throughout the abstraction process. Flatiron’s network of abstractors includes certified tumour registrars, oncology nurses and oncology clinical researchers.

Patients in the Flatiron Health network were considered to be part of the aNSCLC real-world cohort if they were diagnosed with lung cancer (the ninth revision of the international classification of diseases (ICD-9) code 162.x; or the tenth revision of the international classification of diseases (ICD-10) code C34x or C39.9); had at least two documented clinical visits on or after 1 January 2011; had pathology consistent with NSCLC; and were diagnosed with stage IIIB, IIIC, IVA or IVB NSCLC on or after 1 January 2011, or diagnosed with early-stage NSCLC and subsequently developed recurrent or progressive disease on or after 1 January 2011. Patients were excluded if there was a lack of relevant unstructured documents in the Flatiron Health database for review by the abstraction team.

A catalogue of the criteria that it was possible to emulate using the Flatiron Health database can be found in Supplementary Table 1. There are some criteria for which Flatiron Health does not currently abstract information from EHRs—for example, reproductive health, some prior co-morbidities, some previous treatments, imaging procedures and results—and these were not included in the present study. For those criteria that are available in the database, we also evaluated the percentage of missing ECOG and laboratory value information for each patient at the start of the first or second line of therapy (Supplementary Table 38). To closely mirror the actual trial screenings, we considered clinical measurements taken within a window from 30 days before to 7 days after the start of the line of therapy40.

We further support our findings by analysing toxicity data for a real-world cohort of 1,000 patients with aNSCLC from the Flatiron database. These patients were randomly selected from the broader aNSCLC cohort based on receipt of anti-PD-1/PD-L1 therapy, and underwent additional data abstraction to determine the reasons for treatment discontinuation, including toxicity. In addition, we identified 22 Roche oncology trials with available clinical study reports, and extracted statistics from the study reports on the number of patients who withdrew from treatment owing to adverse events.

### The Trial Pathfinder workflow

In the first step of Trial Pathfinder—trial emulation—we identified individuals in the real-world dataset who met the available eligibility criteria as originally published in the clinical trial protocol. The eligibility criteria were encoded as logic statements and were automatically applied by our workflow. More information on how the semi-structured free-text criteria in the clinical trial protocols were encoded into programmatic statements is provided in the Supplementary Methods. Patients with missing data points (for example, ECOG or laboratory values) in the corresponding criteria were not filtered by those criteria. We then assigned the selected patients to the treatment groups that were consistent with their treatment records in the database (for example, atezolizumab versus docetaxel). To emulate the randomization and blind assignment in the trials, we used inverse probability of treatment weighting (IPTW) to adjust for baseline confounding factors. Time zero was set to be the start of the corresponding line of therapy. Finally, we performed survival analysis for the emulated trials using the hazard ratio of the overall survival as the outcome. Each individual was followed until the occurrence of death or censored at the latest reported activity. Outcomes that occur after 27 months in the Flatiron database are considered censored in our analysis to match the original trial settings. The results are robust to the specific window lengths discussed here (Supplementary Table 39). The Trial Pathfinder open source code was written in Python version 3.6.

### Trial Pathfinder trial emulation and survival analysis

To emulate the blind assignment and obtain unbiased estimates of treatment effects, we used IPTW to adjust for the baseline covariates. During the survival analysis, patient i is given the weight defined in equation (1), in which Zi is the indicator variable representing whether patient i is treated or not, with Zi = 1 indicating a treated case. The propensity score ei is defined in equation (2), in which Xi denotes the baseline covariates. We used a logistic regression model to estimate ei. In our experiments of aNSCLC, the covariates X were: age, gender, composite race or ethnicity, histology, smoking status, staging, ECOG and biomarker status, including ALK, EGFR, PDL1, ROS1, KRAS and BRAF. Adjustment by propensity score is effective in balancing all of the covariates between the synthetic treatment and control groups (Extended Data Fig. 3).

$${omega }_{i}={Z}_{i}/{e}_{i}+(1-{Z}_{i})/(1-{e}_{i})$$

(1)

$${e}_{i}={rm{Pr }}({Z}_{i}=1|{X}_{i})$$

(2)

We further performed survival analysis on the emulated trials. For each patient, the index date or time zero, resembling the randomization point in a clinical trial, was chosen to be the start date of the line of therapy of that trial (either first or second). This choice of time zero ensures that there is no immortal time bias41. Patients were followed until the occurrence of death, censoring those patients without a death event. The Cox proportional-hazards model was used to compute hazard ratios and confidence intervals of overall survival. Survival curves were estimated with the Kaplan–Meier method.

### Eligibility criteria evaluation with Shapley values

To evaluate the influence of an individual criterion we used the Shapley value, which is the average expected marginal contribution of adding one criterion to the hazard ratio after all possible combinations of criteria have been considered. The Shapley value has recently been proposed in machine learning as a principled approach to quantify the contribution of individual features and data28. The definition of the Shapley value of the ith criterion is given in equation (3), in which n is the total number of criteria and HR(S) indicates the hazard ratio computed when the criteria subset S is used to select patients. The sum in equation (3) is taken over all possible subsets S of the n original criteria (denoted as N for short) that did not contain i.

$${rm{Shapley}},{rm{value}},{rm{of}},{rm{the}},i{rm{th}},{rm{criterion}}=sum _{Ssubseteq Nbackslash {i}}(|S|!(n-|S|-1)!/n!)({rm{HR}}(Scup {i})mbox{–}{rm{HR}}(S))$$

(3)

The Shapley value of the ith criterion is a weighted average of the effect of adding this criterion to different subsets of inclusion/exclusion criteria. The weights normalize for the number of possible sets that have the same cardinality and are required to satisfy the Shapley attribution properties.

Exhaustively computing the hazard ratios of overall survival for all possible subsets of criteria (order of n!) was computationally prohibitive. Here we estimated the Shapley value by Monte Carlo sampling subsets of criteria S. The Monte Carlo sampling gives an unbiased estimate of the Shapley value. Following the previously proposed algorithm42, we stop sampling when the Shapley estimate has converged (that is, when the standard error of the Monte Carlo mean is less than 0.001). In practice, convergence happened after a hundred iterations for each criterion. A few thousand Monte Carlo samples combined is sufficient for a trial with tens of criteria to evaluate. This makes Trial Pathfinder computationally efficient (Extended Data Fig. 4) and only needs around half an hour to run with a single CPU for one trial. For each trial, we averaged its results evaluating on a different criteria set from the trials in the same line of therapy (either first or second). A Shapley value larger than zero indicates that the contribution of that criterion is to increase the hazard ratio on average. Conversely, a negative Shapley value means that the contribution of that criterion is to decrease the hazard ratio on average. Finally, Shapley values that are close to zero correspond to a criterion that does not affect the hazard ratio.

Trial Pathfinder reports the subset of criteria used by the original trial that have a Shapley value smaller than 0 as data-driven criteria. Once the data-driven subset of criteria was selected, Trial Pathfinder computed the number of eligible patients and the hazard ratio of the overall survival between the synthetic treatment and control arms.

We stratified our 61,094 patients with aNSCLC from the Flatiron database by their geography of residence as in the US census—Northeast (n = 11,777), Midwest (n = 8,895), South (n = 23,895) and West (n = 9,061). We then evaluated the inclusion/exclusion criteria selected by Trial Pathfinder for each of the 10 aNSCLC trials for patients from each geographical region separately (Supplementary Tables 22–25). We also stratified our aNSCLC cohort by their insurance plan as an additional robustness analysis—commercial health plans (n = 22,423), Medicare (n = 10,841) and the remaining patients (n = 22,361). We evaluated our previously selected inclusion/exclusion criteria for each of the 10 aNSCLC trials for patients under the three types of insurance plans separately (Supplementary Tables 26–28). We used the nationwide (US-based) de-identified Flatiron Health-Foundation Medicine aNSCLC clinicogenomic database (FH-FMI CGDB) for further validation43. Genomic alterations were identified through comprehensive genomic profiling of more than 300 cancer-related genes on the next-generation sequencing-based FoundationOne panel of the FMI44. Retrospective longitudinal clinical data were derived from EHR data from clinics in the Flatiron network, consisting of patient-level structured and unstructured data, curated by technology-enabled abstraction, and were linked to genomic data derived from comprehensive genomic profiling tests of the FMI in the FH-FMI CGDB by de-identified and deterministic matching43. To leverage the rich genomics information of FH-FMI CGDB, we added 17 additional genes to the adjustment of the covariates that have alterations in at least 1,000 patients (Supplementary Table 31). For each of the 10 aNSCLC trials, we applied the inclusion/exclusion criteria that Trial Pathfinder selected on the Flatiron data and used it to emulate a trial using the FH-FMI CGDB cohort (Supplementary Table 30). Progression is used as the end point and progression-free survival hazard ratios are computed.

### Statistical analysis

We bootstrapped the cohorts to estimate the standard deviations for the Shapley values. The confidence intervals for the hazard ratios were estimated from the variance matrix of the coefficients in fitting the Cox proportional-hazards model. For the safety impact analysis on 22 Roche oncology trials, we use two-sided P values from Fisher’s exact tests to measure the difference in the withdrawal ratio given two sets of trials (Supplementary Table 35). When analysing toxicity data, we use two-sided P values from two-tailed Student’s t-tests to evaluate whether there is a significant difference in the baseline laboratory values between two toxicity groups (Extended Data Fig. 6).

### Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this paper.