Research Reproducibility

Reproducibility in Research

Vicky Steeves | April 18, 2017

The Problem

Researchers work with a lot of data...

...but how should it be organized?

Example: Human Error

This is another example: a more recent case where a very famous economics paper was making wrong conclusions. The paper shows that countries with debt over 90% of their gross domestic product (GDP) have a negative growth rate; this paper was published at the same time that Greece was having an economic crisis. But no one could actually reproduce the conclusions that the authors had – researchers could not replicate the results of the paper. Eventually, researchers from UMass asked the authors for their data spreadsheet, and it turned out that there was a mistake in one of their Excel formulas, where they erroneously excluded 5 countries from their study. If results were made reproducible since the beginning, this mistake would have been discovered way earlier – maybe in time for publication, by reviewers – and which would avoid the bad publicity.

State of Reproducibility

The majority of federally funded research is NOT reproducible.

Why Reproducibility?

"If I have seen further, it is by standing on the shoulders of giants." - Sir Isaac Newton

To build on top of previous work – science is incremental!

To verify the correctness of results

To defeat self-deception¹

To help newcomers

To increase impact, visibility² and research quality³

Challenges of Reproducibility

Human Error + Dependency Hell

People make mistakes--and it impacts their research
It’s good to have other people check out your data and analyses--it’s like having a copy editor for your data!

It’s hard to keep track of what version of what was used
Software get updates, and these changes can disrupt reproducibility

Example: Dependency Hell

The Effects of FreeSurfer Version, Workstation Type, and Macintosh Operating System Version on Anatomical Volume and Cortical Thickness Measurements | June 1, 2012

We investigated the effects of data processing variables such as FreeSurfer version (v4.3.1, v4.5.0, and v5.0.0), workstation (Macintosh and Hewlett-Packard), and Macintosh operating system version (OSX 10.5 and OSX 10.6). Significant differences were revealed between FreeSurfer version v5.0.0 and the two earlier versions. [...] About a factor two smaller differences were detected between Macintosh and Hewlett-Packard workstations and between OSX 10.5 and OSX 10.6.

The Solution

Research Data Management

Managing the way data is collected, processed, analyzed, preserved, and published for greater reuse by the community and the original researcher.

You can't have any sort of reproducibility without good data and project management.

So let's start with RDM, and go into reproducibility!

High-Level View of RDM

Data Type	Group Roles	Data Storage	Data Archiving
format of data to be generated	who is primarily responsible for carrying out RDM? Set group norms	where will you store your data and how will you backup your data?	how will you preserve and make your data available to others?

Documentation

Storage Rules!

Long Term Storage

Choose what you want to preserve/get to in the long term, but No matter WHAT, make sure you keep:

documentation (lab/field notebooks, etc.)
tools & analysis

Put your data into an archival format!

this should be open + accessible
Software agnostic

The New Traditional Model

Publish a paper.
Publish the underlying code and data.
Link the paper + code/data.
Bump up your H-Index.

And I have slides on that model here and here.

But the article, code, and data are really just the tip of the iceburg

But what if I told you that you could put code + data + analysis + environment in one small file?

This means ONE file to share with each paper to ensure full reproducibility, verifibility, and authenticity of your research!

Tool to Help: ReproZip!

Here comes ReproZip, the reproducibility packer! ReproZip is a tool aimed at simplifying the process of creating reproducible experiments.

2 Steps to Reproducibility

This graphic just shows some of the explicit steps involved in making and unpacking a reproducible package. The left shows the first step, packing. Here the original researcher uses reprozip trace, runs their experiment, and at the conclusion simply packs it into the single package. They can then send that package to colleagues or reviewers, or store it as an archival snapshot. In the unpacking step, the reviewer or collaborator would load in the rpz, optionally upload their own input files to the environment, and reproduce the experiment. They have the option to download the output files for further inspection, and then destroy the VM or container. While ReproZip is written in python, it packs experiments and environments regardless of language by systematically tracking and keeping a record of programs, libraries, and file accesses From there it creates a single reproducible package, a .rpz file from that captured information. Others can then unpack the rpz file using ReproUnzip on another machine regardless of configuration or OS. During the unpacking step, users have easy access to various unpackers, including vagrant and docker, to make reproduction simpler. I also wanted to mention that reprozip has the ability to pack graphical tools, client-server environments, databases, and interactive experiments. Automatically and systematically captures the provenance of an existing experiment (Linux only) Language-independent approach and solution Creates a self-contained reproducible package from captured provenance Extracts package in another environment, independent of the operating system Provides easy-to-use interfaces for replicating and varying the original configuration of the experiment Packs graphical tools, interactive environments, databases, client-server set-ups

Step 1: Packing

I'll show this a bit more explicitly. Right now, Reprozip can only pack experiments and environments from Linux. So we start with this computational environment. Then the researchers runs the experiment under reprozip trace. Reprozip then traces and records the execution of the experiment. We use some heuristics (a set of rules that normally work) for detecting files: Input Files: files that are only read, do not belong to any software package and are not executable + files mentioned on command line Log Files: files that are read and written afterwards (logs, database files, etc.) Output files: files that are only written to Reprozip then creates a configuration file that lists the files to be packed so the researcher can add or remove files that they know need to be packed or not. He should be wary of privacy and copyright issues when going over the list. All the files are packed in the same structure they are found in the original environment. The output is a single reproducible rpz file.

Step 1: Tracing

Step 1b: Packing

Step 2: Unpacking

Now that we have an rpz file, we can unpack it in any OS. Then we can run the unpacker we like. There are four unpackers currently available. The first is the directory unpacker (reprounzip directory) allows users to unpack the entire experiment (including library dependencies) in a single directory, and to reproduce the experiment directly from that directory. It does so by automatically altering environment variables (e.g.: PATH, HOME, and LD_LIBRARY_PATH) and the command line to point the experiment execution to the created directory, which has the same structure as in the packing environment. This is unreliable if the application cannot be trusted, since it can point outside the unpacked directory. Hardcoded paths for example will still hit outside that directory. The next is the chroot unpacker (reprounzip chroot), similar to reprounzip directory, a directory is created from the experiment package; however, a full system environment is also built, which can then be run with chroot(2), a Linux mechanism that changes the root directory / for the experiment to the experiment directory. Therefore, this unpacker addresses the limitation of the directory unpacker and does not fail in the presence of hardcoded absolute paths. Note as well that it does not interfere with the current environment since the experiment is isolated in that single directory. Although chroot offers pretty good isolation, it is not considered completely safe: malicious experiments might still escape to the host environment. Third, the vagrant unpacker (reprounzip vagrant) allows an experiment to be unpacked and reproduced using a virtual machine created through Vagrant. Therefore, the experiment can be reproduced in any environment supported by this tool, i.e., Linux, Mac OS X, and Windows. Lastly, ReproUnzip can extract and reproduce experiments as Docker containers. The docker unpacker (reprounzip docker) is responsible for such integration. Docker implements a high-level API to provide lightweight containers that run processes in isolation. A Docker container, as opposed to a traditional virtual machine, does not require or include a separate operating system. Instead, it relies on the kernel's functionality and uses resource isolation (CPU, memory, block I/O, network, etc.) and separate namespaces to isolate the application's view of the operating system, and is thus lighter and faster. ReproZip also allows users to generate a provenance graph related to the experiment execution by reading the metadata available in the .rpz package. This graph shows the experiment runs as well as the files and other dependencies they access during execution; this is particularly useful to visualize and understand the dataflow of the experiment. There is also a VisTrails plugin that creates a VisTrails workflow, that can be used to run the experiment in VisTrails.

Step 2: Setting Up

Step 2a: Running

Optional Step:
Downloading Outputs

Optional Step:
Uploading Inputs

Bonus Demo Time:
Jupyter Notebooks

Keepin' Up With Reproducibility

There are many resources you can take advantage of to stay informed on reproducibility initiatives, resources, and studies!

ReproducibleScience.org/NYU

Reproducibility LibGuide

Questions?

Email me: vicky.steeves@nyu.edu

Learn more about RDM: guides.nyu.edu/data_management

Get this presentation: guides.nyu.edu/data_management/resources

Make an appointment: guides.nyu.edu/appointment

Reproducibility in Research

The Problem

Researchers work with a lot of data...

...but how should it be organized?

Example: Human Error

State of Reproducibility

Why Reproducibility?

Challenges of Reproducibility

Human Error + Dependency Hell

Example: Dependency Hell

The Solution

Research Data Management

You can't have any sort of reproducibility without good data and project management.

So let's start with RDM, and go into reproducibility!

High-Level View of RDM

Documentation

Storage Rules!

Long Term Storage

The New Traditional Model

But the article, code, and data are really just the tip of the iceburg

But what if I told you that you could put code + data + analysis + environment in one small file?

Tool to Help: ReproZip!

2 Steps to Reproducibility

Step 1: Packing

Step 1: Tracing

Step 1b: Packing

Step 2: Unpacking

Step 2: Setting Up

Step 2a: Running

Optional Step: Downloading Outputs

Optional Step: Uploading Inputs

Bonus Demo Time: Jupyter Notebooks

Keepin' Up With Reproducibility

Questions?

Optional Step:
Downloading Outputs

Optional Step:
Uploading Inputs

Bonus Demo Time:
Jupyter Notebooks