Reproducibility in Research


Vicky Steeves | April 18, 2017


The Problem

Researchers work with a lot of data...

...but how should it be organized?

Example: Human Error

State of Reproducibility

The majority of federally funded research is NOT reproducible.

Why Reproducibility?

"If I have seen further, it is by standing on the shoulders of giants." - Sir Isaac Newton

To build on top of previous work – science is incremental!

To verify the correctness of results

To defeat self-deception1

To help newcomers

To increase impact, visibility2 and research quality3

Challenges of Reproducibility

Human Error + Dependency Hell

  • People make mistakes--and it impacts their research
  • It’s good to have other people check out your data and analyses--it’s like having a copy editor for your data!

  • It’s hard to keep track of what version of what was used
  • Software get updates, and these changes can disrupt reproducibility

Example: Dependency Hell

The Effects of FreeSurfer Version, Workstation Type, and Macintosh Operating System Version on Anatomical Volume and Cortical Thickness Measurements | June 1, 2012

We investigated the effects of data processing variables such as FreeSurfer version (v4.3.1, v4.5.0, and v5.0.0), workstation (Macintosh and Hewlett-Packard), and Macintosh operating system version (OSX 10.5 and OSX 10.6). Significant differences were revealed between FreeSurfer version v5.0.0 and the two earlier versions. [...] About a factor two smaller differences were detected between Macintosh and Hewlett-Packard workstations and between OSX 10.5 and OSX 10.6.

The Solution

Research Data Management


Managing the way data is collected, processed, analyzed, preserved, and published for greater reuse by the community and the original researcher.

You can't have any sort of reproducibility without good data and project management.

So let's start with RDM, and go into reproducibility!

High-Level View of RDM

Data Type Group Roles Data Storage Data Archiving
format of data to be generated who is primarily responsible for carrying out RDM? Set group norms where will you store your data and how will you backup your data? how will you preserve and make your data available to others?

Documentation

Storage Rules!

Long Term Storage

Choose what you want to preserve/get to in the long term, but No matter WHAT, make sure you keep:

  • documentation (lab/field notebooks, etc.)
  • tools & analysis
Put your data into an archival format!

  • this should be open + accessible
  • Software agnostic

The New Traditional Model

  • Publish a paper.
  • Publish the underlying code and data.
  • Link the paper + code/data.
  • Bump up your H-Index.

And I have slides on that model here and here.

But the article, code, and data are really just the tip of the iceburg

But what if I told you that you could put code + data + analysis + environment in one small file?

This means ONE file to share with each paper to ensure full reproducibility, verifibility, and authenticity of your research!

Tool to Help: ReproZip!

Here comes ReproZip, the reproducibility packer! ReproZip is a tool aimed at simplifying the process of creating reproducible experiments.

2 Steps to Reproducibility

Step 1: Packing

Step 1: Tracing

Step 1b: Packing

Step 2: Unpacking

Step 2: Setting Up

Step 2a: Running

Optional Step:
Downloading Outputs

Optional Step:
Uploading Inputs

Bonus Demo Time:
Jupyter Notebooks

Keepin' Up With Reproducibility

There are many resources you can take advantage of to stay informed on reproducibility initiatives, resources, and studies!

Questions?


Email me: vicky.steeves@nyu.edu

Learn more about RDM: guides.nyu.edu/data_management

Get this presentation: guides.nyu.edu/data_management/resources

Make an appointment: guides.nyu.edu/appointment