Considerations for the use of machine learning extracted real-world data to support evidence generation: A research-centric evaluation framework

June 22, 2022

Our summary

When working with , key information, such as diagnosis dates, biomarker status, and therapies received, are only available as unstructured text in electronic health records (EHRs). Machine learning (ML) can be used to extract these unstructured data elements—but unique challenges emerge when using the data produced with ML techniques for research purposes. Specifically, how best to assess validity and generalizability to different cohorts of interest.

This framework covers the fundamentals of evaluating RWD produced using ML methods to maximize the use of EHR data for research purposes.

Why this matters

Using machine learning to extract unstructured data elements found in EHRs has the ability to unlock retrospective research at scale. This framework guides a multi-stakeholder evaluation that is transparent, goes beyond standard machine learning metrics, and focuses on RWD methodologic fundamentals and considerations, to help determine whether ML-extracted variables are fit for research use.

Read the research