A hybrid approach to scalable real-world data curation by machine learning and human experts

March 8, 2023


Retrospective analysis of real-world data (RWD) obtained from sources such as tumor registries, insurance claims, and electronic health records (EHR) can help answer important scientific and policy questions. EHR data is especially valuable as it offers a comprehensive and long-term view of a patient’s journey, but structured data is often incomplete. Manual curation of unstructured data by trained abstractors has traditionally been used to capture important details from clinical notes or pathology reports to produce structured datasets for analysis. 

However, manual curation is time-consuming and limits the size of datasets. Machine learning (ML) algorithms can process large amounts of unstructured data quality with high reliability, but there is a risk of errors if the record is complex or requires clinical expertise. 

To increase scale without compromising quality, researchers from Flatiron Health have outlined a hybrid curation method that combines both manual abstraction by clinical experts and automated extraction by machine learning models.

Why this matters

Unstructured data is crucial to understanding patient treatment and outcomes in oncology, but extracting it at scale remains difficult. Although ML and human abstraction have traditionally been considered competing alternatives, this new method of curation offers a new paradigm where they are jointly leveraged. This adaptable approach can be used for a variety of data and ML models, making it useful for a range of curation tasks. Hybrid curation has significant potential for RWD as it allows for the study of larger patient cohorts and data points while remaining reliable and fit-for-use in various research and policy applications. 

Read the research