Tools for Reproducible Real-World Data Analysis

By Blythe Adamson, PhD, MPH October 29, 2018

FH_blog_machine_181029

The oncologist struggled to find the right words. The scientific publication upon which she based her most recent treatment recommendation for the patient sitting in front of her had just been retracted from a prestigious journal. She reflected on a lengthy discussion with this patient six months prior considering the trade-offs between treatment options. Balancing the evidence of efficacy, value of hope, and impact on quality of life was difficult enough when based on accurate and reliable research. The retracted comparative-effectiveness study that had once embodied so much promise now brought bitterness and confusion.

The cost of bad clinical research often extends beyond these intimate conversations to the broader scientific field. Scientific advances are almost universally incremental - they build upon the foundation laid by the previous generation. If that foundation turns out to be unstable, entire research areas that were built on top of it can crumble.

For centuries, the responsibility to identify mistakes in scientific research has fallen largely on the shoulders of peer-reviewers. They are challenged to critically evaluate the integrity and accuracy of a manuscript. Peer-reviewers can be ‘generous' to the authors by giving them the benefit of the doubt and assuming the black box of methods described is full of the rigorous tools we expect. However, unfortunately, manuscripts are often missing detailed methods, analysis code and/or the raw data necessary to critically check computationally intensive research. As fields like health economics and outcomes research embrace the enormous potential of "big data" and become increasingly reliant on modern scientific computing tools to answer important research questions, the gap between what is included in a written manuscript and what is needed to critically evaluate the research grows.

How do we know if the results of studies are accurate?

The first step is simple: reproducibility. But how do you define 'reproducible'? Does it simply mean other people in your organization can run your analysis code on their machine? Or if we asked a stranger to read one of your publications and you handed them the raw data, should they find the exact same answer if they tried to recreate the analysis? Years from now, when I want to update an old analysis with new data, will I be able to dust off my old code, understand it, and run the analysis again?

There are two main reasons why we need to ensure research is reproducible. First, we must show evidence that methods and results are accurate (improve transparency). This reduces uncertainty for decision-makers and peer-reviewers. Second, we must enable others to make use of and/or build on the methods and results. This is needed to accelerate the development of new medicines.

Reproducibility is correlated with better science, but it is no guarantee. Recent discussions of the book "Rigor Mortis: How Sloppy Science Creates Worthless Cures, Crushes Hope, And Wastes Billions" by NPR Scientific Correspondent Richard Harris created waves of realization and plans for reformation in the research community. Discussions in the media and in scientific literature have recently emphasized the importance of reproducible research, including a special issue of the journal Science .

The need to use more reproducible tools in health economics and outcomes research is growing rapidly as analyses of real-world data become more frequent, involve larger datasets and employ more complex computations. Flatiron data scientists demand and support the curation of high-quality data - aligning with regulatory agencies, health technology authorities, clinicians, patients and healthcare payers around the world that demand high-quality real-world evidence to make decisions.

Transformation of messy data into meaningful evidence often needs teams of researchers from different disciplines working together with clear communication, documentation and organized code. This is something Flatiron, and especially Flatiron's data scientists, cares deeply about.

Despite being commonplace in computer science programs, graduate training programs in biostatistics, health economics and epidemiology often miss the mark on the opportunity to teach students how to structure and organize code for data analysis. Software engineers have developed mature solutions for building robust and reproducible analytic software, though they are rarely mentioned in educational programs (this is starting to change). Excellent tools for publishing and sharing reproducible documents are commonplace in data science organizations at technology companies, though they are rarely utilized in academic research. Adopting these methods across the scientific research space and developing best practices for real-world data scientists is crucial for the next generation of reproducible research using real-world data.

Core value of teaching and learning

To provide educational opportunities about these tools for reproducible real-world data analysis applied in health, and to promote the standardization of approaches across organizations, we have developed a short course. Attendees of the Flatiron Research Summit will have the first opportunity to participate. We've also joined forces with the International Society for Pharmacoeconomics and Outcomes Research ( ISPOR ) to make the course available to a wider audience. Get ready for the launch of a new short-course at the ISPOR Europe meeting in Barcelona this fall!

Our course will cover the guiding principles of structuring and organizing a modern data analysis, literate statistical analysis tools, formal version control, software testing and debugging, and developing reproducible reports. We will showcase several real-world examples and interact with a hands-on code review exercise to reinforce the concepts and tools introduced.

We welcome students and professionals with diverse backgrounds. The course content should be valuable for analysts writing code, as well as managers and academic investigators who want to create a culture that promotes and facilitates reproducible research in their team. We also look forward to learning from attendees who will share their own experiences and solutions.

Please consider joining us on the cutting-edge of evolving good research practices for the future of regulatory-grade real-world data analysis.

===

Course: "Tools for Reproducible Real-World Data Analysis"
Instructors: Dr. Carrie Bennette and Dr. Blythe Adamson
Date and Time: Nov 10, 2018 from 8:00 AM–12:00 PM
Location: Barcelona, Spain
More information and registration