When working with real-world data (RWD), key information, such as diagnosis dates, biomarker status, and therapies received, are only available as unstructured text in electronic health records (EHRs). Machine learning (ML) can be used to extract these unstructured data elements—but unique challenges emerge when using the data produced with ML techniques for research purposes. Specifically, how best to assess validity and generalizability to different cohorts of interest.
This framework covers the fundamentals of evaluating RWD produced using ML methods to maximize the use of EHR data for research purposes.
Why this matters
Using machine learning to extract unstructured data elements found in EHRs has the ability to unlock retrospective research at scale. This framework guides a multi-stakeholder evaluation that is transparent, goes beyond standard machine learning metrics, and focuses on RWD methodologic fundamentals and considerations, to help determine whether ML-extracted variables are fit for research use.