A hybrid approach to scalable real-world data curation by machine learning and human experts

Summary

Retrospective analysis of real-world data (RWD) obtained from sources such as tumor registries, insurance claims, and electronic health records (EHR) can help answer important scientific and policy questions. EHR data is especially valuable as it offers a comprehensive and long-term view of a patient’s journey, but structured data is often incomplete. Manual curation of unstructured data by trained abstractors has traditionally been used to capture important details from clinical notes or pathology reports to produce structured datasets for analysis.

However, manual curation is time-consuming and limits the size of datasets. Machine learning (ML) algorithms can process large amounts of unstructured data quality with high reliability, but there is a risk of errors if the record is complex or requires clinical expertise.

To increase scale without compromising quality, researchers from Flatiron Health have outlined a hybrid curation method that combines both manual abstraction by clinical experts and automated extraction by machine learning models.

Why this matters

Unstructured data is crucial to understanding patient treatment and outcomes in oncology, but extracting it at scale remains difficult. Although ML and human abstraction have traditionally been considered competing alternatives, this new method of curation offers a new paradigm where they are jointly leveraged. This adaptable approach can be used for a variety of data and ML models, making it useful for a range of curation tasks. Hybrid curation has significant potential for RWD as it allows for the study of larger patient cohorts and data points while remaining reliable and fit-for-use in various research and policy applications.

Read the research

Publications

A hybrid approach to scalable real-world data curation by machine learning and human experts

Summary

Why this matters

Share

Posted in

More publications

Nature Communications

March 2024

A framework for evaluating clinical artificial intelligence systems without ground-truth annotations

Kiyasseh D, Cohen A, Jiang C, et al.

Applied Sciences

June 2023

A natural language processing algorithm to improve completeness of ECOG performance status in real-world data

Cohen AB, Rosic A, Harrison K, Richey M, Nemeth S, Ambwani G, Miksad R, Haaland B, Jiang C

Poster presented at: ISPOR US 2024; May 5-8, 2024; Atlanta, GA

April 2024

Using large language models to extract PD-L1 testing details from electronic health records

Cohen A, Waskom M, Adamson B, et al.