Skip to content

Database Characterization Guide

An overview of Flatiron Health EHR-derived US data for journal reviewers and readers to better understand the data used in publications

SCROLL TO:

About Flatiron Health US data

Flatiron Health collects longitudinal, patient-level, real-world data from electronic health records of patients receiving care at academic and community cancer centers across the US. The Flatiron Health network includes access to over 5M patient records from over 280 oncology practices at 800+ unique sites of care in the US.

Approximately 75% of Flatiron Health data come from community cancer centers, and 25% come from academic medical centers (relative community/academic proportions may vary depending on study cohort). All academic centers in the Flatiron Health network are National Cancer Institute (NCI)-designated comprehensive cancer centers.

our network includes:


5M+

patients


4,200+

clinicians


280

oncology clinics


800+

sites of care


Data abstraction and extraction process

Flatiron Health data are derived from electronic health records and other sources (e.g., obituary data) and include structured (predefined data points: e.g., patient sex or birth date) and unstructured (free text: e.g., clinician notes) data.1-2 

Unstructured data are processed using both technology-enabled abstraction and artificial intelligence-based extraction methods, including natural language processing (NLP), machine learning (ML), and large language models (LLMs). Abstracted and extracted data are validated using Flatiron Health’s quality and performance assessment frameworks.3-7


Patient privacy

Data are deidentified in accordance with the HIPAA privacy rule and may include patient demographics, tumor type, diagnosis date, cancer stage, treatment, and other characteristics. Deidentified data are subject to obligations to prevent reidentification and protect patient confidentiality. For example, patient cohorts of 5 or fewer patients are described as less than or equal to 5 (≤5) patients, and patients aged 85 years and older may have an adjusted birth year in the dataset or data reported as not available.


References

  1. 1. Ma X, Long L, Moon S, Adamson BJS, Baxi SS. Comparison of population characteristics in real-world clinical oncology databases in the US: Flatiron Health, SEER, and NPCR. medRxiv. 2023. doi: https://doi.org/10.1101/2020.03.16.20037143

  2. 2. Zhang Q, Gossai A, Monroe S, Nussbaum NC, Parrinello CM. Validation analysis of a composite real-world mortality endpoint for patients with cancer in the United States. Health Serv Res. 2021;56(6):1281-1287. doi:10.1111/1475-6773.13669
  3.  
  4. >3. Birnbaum B, Nussbaum N, Seidl-Rathkopf K, et al. Model-assisted cohort selection with bias analysis for generating large-scale cohorts from the EHR for oncology research. arXiv. 2020. https://doi.org/10.48550/arXiv.2001.09765

  5. 4. Adamson B, Waskom M, Blarre A, et al. Approach to machine learning for extraction of real-world data variables from electronic health records. Front Pharmacol. 2023;14:1180962. doi: 10.3389/fphar.2023.1180962

  6. 5. Estevez M, Benedum CM, Jiang C, et al. Considerations for the use of machine learning extracted real-world data to support evidence generation: a research-centric evaluation framework. Cancers. 2022;14(13):3063. https://doi.org/10.3390/cancers14133063

  7. 6. Castellanos EH, Wittmershaus BK, Chandwani S. Raising the bar for real-world data in oncology: approaches to quality across multiple dimensions. JCO Clin Cancer Inform. 2024;8:e2300046. doi:10.1200/CCI.23.00046
  8.  
  9. 7. Benedum CM, Sondhi A, Fidyk E, et al. Replication of real-world evidence in oncology using electronic health record data extracted by machine learning. Cancers (Basel). 2023;15(6):1853. doi:10.3390/cancers15061853