Every day, almost 5,000 people around the country will hear the dreaded news that they have cancer. Each diagnosis kicks off a series of hard conversations between doctors and their patients. Many of these conversations center around the likely prognosis for that particular patient. For example, an oncologist may explain to a patient that they have a choice between two treatments — one is more aggressive, but may allow the patient to live a few months longer; the other has less risk of miserable side effects but might contribute fewer weeks or months of life. This is the sort of context that helps patients and their caretakers determine what treatment course will be best for them.

Information that allows the doctor to provide those prognoses to the patient in front of them is computed by statisticians and quantitative scientists like me. Such information truly matters for patients’ lives, so it’s important we get it right and base these calculations on complete, reliable and representative data. It may seem easy to collect patient data, de-identify it, and analyze it, but in reality, it is harder than it appears. For such analyses, we need a large amount of patient data, including how each patient’s disease was treated and what the survival “outcomes” were. Importantly, to give doctors the information needed to make a prognosis, we need to know if patients have passed away after a particular treatment and when that happened.

Such data is often collected as a part of clinical trials, following a defined protocol to ensure validity of the trial results. Patients are monitored across time, and events are observed including if and when they pass away. The resulting datasets are accurate because the fairly small number of patients in the trial are monitored quite closely.

Meanwhile, clinical trial datasets can be fairly artificial. The patients included must meet constrained eligibility criteria, so not all people with a disease are represented; for example, people over age 70 or those with coexisting cancer and heart disease are usually missing. However, ideally, information the doctor has for the conversation with a new cancer patient should be based on the experiences of people similar to the cancer patient sitting in front of them and not just the restricted group of individuals who participated in clinical trials.

This can be solved with “real-world data.” Real-world datasets are generated by accumulating information that is normally collected as a routine part of clinical care, such as via the electronic health record or an insurance company’s reimbursement claims dataset. This is great in that it is more representative of usual populations, including the new cancer patient sitting in front of the oncologist.  However, data from “real-world” cancer patients may not be as complete as datasets in clinical trials.

Remember that to create the analysis for the doctor using real-world data, we need to know which patients have died, and importantly, when those patients died. This seems simple. All oncologists’ records should contain the date of death for patients who passed away, right?

Not necessarily. Often, the oncologist doesn’t know that their patient has passed away. They might have clues — maybe the patient was transferred to hospice, or maybe the patient moved away to be closer to family, or maybe the patient just stopped coming to appointments and hasn’t been able to be reached. In these cases, they may have died, or they might not have yet, and if they did die, we don’t know the date.

This “missing data” problem led us on a year-long quest to fill in the gaps. We looked for supplemental sources of death data to combine with what was captured in the electronic health record data. Ultimately, we created a composite dataset by putting together mortality data from three sources: the electronic health records of oncology practices within our network, the Social Security Death Index and a commercial source. This composite dataset can then be used to conduct analyses that empower doctors with the information they need to help patients make important treatment decisions.

But, even after all of this, we still didn’t know how complete our composite dataset was. Did our composite dataset now capture all the deaths that occurred? Most? Some? To find out, we needed to compare our dataset to the best available U.S. death data — the “gold standard”, which is the National Death Index (NDI). While we could not include the NDI data in our composite dataset due to usage restrictions, we were able to ask the U.S. Centers of Disease Control (CDC), who maintain the NDI, for access to the data in order to make this validation exercise possible. We described our research to the CDC and explained what information we needed from them to know how good the composite dataset we created was, and how this would support cancer research. Finally, our inquiry was approved and the data arrived, allowing us to actually test the quality of our combined dataset.

I took this photo of a few of the talented team members who worked with on this project with me: We joined a tech company, but here we are, celebrating the arrival of the NDI data on a CD (!!) we got in the mail (!!!). Next step, figuring out whether anyone had an external CD drive so that we could actually access the data! [From left: Ben Holzman, Charlotte Rocker and Rachael Sorg]
The paper describing this process and the quality of the resulting data was published today in Health Services Research. Ultimately, we found that the completeness of the mortality information in our composite dataset ranged from 85 percent to 91 percent in the four cancer types that we studied, with high agreement on the date of death as compared to the NDI gold standard. This meant that the resulting analyses conducted on this dataset were very accurate.

The high-quality nature of this dataset marks the first step toward oncologists being able to learn from every patient, not just the small number of individuals studied in clinical trials. Our goal is that when a doctor has to give a patient the dreaded news “you have xx many months to live,” it is based on the best, most reliable data available from patients just like themselves. Today, we’re proud to be one step closer to making that goal a reality.


Note: Institutional Review Board (IRB) approval of the study protocol was obtained. Informed consent was waived by the IRB as this was a noninterventional study using routinely collected data. This study was made possible by the National Center for Health Statistics, a division of US Centers for Disease Control, which oversees the NDI. Flatiron Health standard methodology for data security and patient privacy were implemented for this work.

Senior Director, Quantitative Sciences