Identifying patients who are eligible for clinical trials is one of the fundamental challenges the cancer research community faces. While there are several personal reasons that may dissuade a patient from participating in research, there are also many logistical barriers—identifying a patient at just the right time, when they are ready to be put on a therapy but have not yet started one, is often challenging when a practice may have dozens of trials open, each with a dozen or more inclusion/exclusion criteria, and with hundreds of patients coming into a practice a day.
This is where we think technology can help. However, people with software experience from other industries often assume the answer is one of two extremes: either a complex ensemble of ML approaches or a simple SQL statement comparing the patient data to the trial's requirements. Unfortunately, the data available is typically dirty, incomplete, unlabeled, and complex. There is inherently no 100% right answer, no matter how sophisticated your approach because you're never given 100% of the data needed to solve the problem. The "right" answer is any that performs better than the current solution, provides visibility into its workings, and could even be one you learned in CS 101.
When looking at a clinical trial, there are many different potential eligibility criteria: diagnosis, stage, age, prior therapies, and biomarkers are among the most common. As we've matched patients in the past, we've thought about this as a linear pathway:
Does the patient match on disease? If yes, go to step 2. If not, they are not eligible.
Does the patient have the correct biomarker mutations? If yes, go to step 3. If not, they are not eligible.
As we started to look at adding more criteria, and specifically biomarkers—measurable, biological indicators related to disease—we started to run into an issue: most trials were not so straightforward, and our current process for adding criteria wasn't going to hold up. Take the very simplified criteria for a real multi-tumor study (across several diseases) with a few different biomarkers (in this case, ER, PR and KRAS):
Is the patient over 18? If yes, go to step 2. If not, they are not eligible.
Does the patient have breast cancer? If yes, go to step 3. If not, go to step 4.
Is the patient ER- and PR-? If yes, they are eligible. If not, they are not.
Does the patient have colorectal cancer? If yes, go to step 5. If not, they are not eligible.
Does the patient have a KRAS mutation? If yes, they are eligible. If not, they are not.
As we started to run into this problem, we participated in the TOP Health sprint to improve our trial-to-patient matching capabilities. Through this, we had access to a mock dataset which had been hand-curated by the National Cancer Institute. In this dataset, they structured their trial criteria as boolean expression.
For the one above, it might be represented as:
(Age > 18) AND (ER = Negative AND PR = Negative AND Disease = Breast) OR (KRAS = Positive AND Disease = Colorectal)
This led to a breakthrough—instead of storing each eligibility criteria type in its own silo ("Here are the disease criteria, here are the biomarker criteria"), what if they could be stored and evaluated in a common data structure? If we treat the boolean operators above as if they are mathematical expressions, we can create a classic expression tree (internally, we've been calling this a decision tree, but that's something else).
Taking the example above, we can translate that into the tree on the right below: