Reproducibility in Research
Vicky Steeves | April 18, 2017
Oh hey, these are some notes. They'll be hidden in your presentation, but you can see them if you open the speaker notes window (hit 's' on your keyboard).
The Problem
Oh hey, these are some notes. They'll be hidden in your presentation, but you can see them if you open the speaker notes window (hit 's' on your keyboard).
Researchers work with a lot of data...
...but how should it be organized?
Oh hey, these are some notes. They'll be hidden in your presentation, but you can see them if you open the speaker notes window (hit 's' on your keyboard).
Example: Human Error
This is another example: a more recent case where a very famous economics paper was making wrong conclusions. The paper shows that countries with debt over 90% of their gross domestic product (GDP) have a negative growth rate; this paper was published at the same time that Greece was having an economic crisis. But no one could actually reproduce the conclusions that the authors had – researchers could not replicate the results of the paper. Eventually, researchers from UMass asked the authors for their data spreadsheet, and it turned out that there was a mistake in one of their Excel formulas, where they erroneously excluded 5 countries from their study. If results were made reproducible since the beginning, this mistake would have been discovered way earlier – maybe in time for publication, by reviewers – and which would avoid the bad publicity.
Oh hey, these are some notes. They'll be hidden in your presentation, but you can see them if you open the speaker notes window (hit 's' on your keyboard).
State of Reproducibility
The majority of federally funded research is NOT reproducible.
Oh hey, these are some notes. They'll be hidden in your presentation, but you can see them if you open the speaker notes window (hit 's' on your keyboard).
Why Reproducibility?
"If I have seen further, it is by standing on the shoulders of giants." - Sir Isaac Newton
To build on top of previous work – science is incremental!
To verify the correctness of results
To defeat self-deception1
To help newcomers
To increase impact, visibility2 and research quality3
Oh hey, these are some notes. They'll be hidden in your presentation, but you can see them if you open the speaker notes window (hit 's' on your keyboard).
Challenges of Reproducibility
Human Error + Dependency Hell
People make mistakes--and it impacts their research
It’s good to have other people check out your data and analyses--it’s like having a copy editor for your data!
It’s hard to keep track of what version of what was used
Software get updates, and these changes can disrupt reproducibility
Example: Dependency Hell
The Effects of FreeSurfer Version, Workstation Type, and Macintosh Operating System Version on Anatomical Volume and Cortical Thickness Measurements | June 1, 2012
We investigated the effects of data processing variables such as FreeSurfer version (v4.3.1, v4.5.0, and v5.0.0), workstation (Macintosh and Hewlett-Packard), and Macintosh operating system version (OSX 10.5 and OSX 10.6). Significant differences were revealed between FreeSurfer version v5.0.0 and the two earlier versions . [...] About a factor two smaller differences were detected between Macintosh and Hewlett-Packard workstations and between OSX 10.5 and OSX 10.6 .
Reproducibility has a lot of different problems, with many many solutions being pro-offererd -- hence this wonderful programs of events. ReproZip is all about reproducibility at a computational level -- because even if your code or environment is runnable, that doesn’t mean that it is necessarily reproducible. This study from PLOS One evaluated a popular software package in neuroanatomical science. The authors investigated whether or not the effects of data processing variables such as a software version, hardware, and version of OSX affected the results of the same study. They found significant differences within these variables -- OSX 10.5 and 10.6 produced vastly different results from each other.
The Solution
Oh hey, these are some notes. They'll be hidden in your presentation, but you can see them if you open the speaker notes window (hit 's' on your keyboard).
Research Data Management
Managing the way data is collected, processed, analyzed, preserved, and published for greater reuse by the community and the original researcher .
Oh hey, these are some notes. They'll be hidden in your presentation, but you can see them if you open the speaker notes window (hit 's' on your keyboard).
You can't have any sort of reproducibility without good data and project management.
So let's start with RDM, and go into reproducibility!
High-Level View of RDM
Data Type
Group Roles
Data Storage
Data Archiving
format of data to be generated
who is primarily responsible for carrying out RDM? Set group norms
where will you store your data and how will you backup your data?
how will you preserve and make your data available to others?
Oh hey, these are some notes. They'll be hidden in your presentation, but you can see them if you open the speaker notes window (hit 's' on your keyboard).
Documentation
Oh hey, these are some notes. They'll be hidden in your presentation, but you can see them if you open the speaker notes window (hit 's' on your keyboard).
Storage Rules!
Oh hey, these are some notes. They'll be hidden in your presentation, but you can see them if you open the speaker notes window (hit 's' on your keyboard).
Long Term Storage
Choose what you want to preserve/get to in the long term, but No matter WHAT, make sure you keep:
documentation (lab/field notebooks, etc.)
tools & analysis
Put your data into an archival format!
this should be open + accessible
Software agnostic
Oh hey, these are some notes. They'll be hidden in your presentation, but you can see them if you open the speaker notes window (hit 's' on your keyboard).
The New Traditional Model
Publish a paper.
Publish the underlying code and data.
Link the paper + code/data.
Bump up your H-Index.
And I have slides on that model here and here .
But the article, code, and data are really just the tip of the iceburg
But what if I told you that you could put code + data + analysis + environment in one small file?
This means ONE file to share with each paper to ensure full reproducibility, verifibility, and authenticity of your research!
Tool to Help: ReproZip!
Here comes ReproZip, the reproducibility packer! ReproZip is a tool aimed at simplifying the process of creating reproducible experiments.
So here comes ReproZip to help! ReproZip is a tool that automatically captures provenance of experiments and packs all the necessary files, library dependencies, and variables to reproduce the results. Reviewers can then unpack and run the experiments without having to install any additional software.
2 Steps to Reproducibility
This graphic just shows some of the explicit steps involved in making and unpacking a reproducible package. The left shows the first step, packing. Here the original researcher uses reprozip trace, runs their experiment, and at the conclusion simply packs it into the single package. They can then send that package to colleagues or reviewers, or store it as an archival snapshot.
In the unpacking step, the reviewer or collaborator would load in the rpz, optionally upload their own input files to the environment, and reproduce the experiment. They have the option to download the output files for further inspection, and then destroy the VM or container.
While ReproZip is written in python, it packs experiments and environments regardless of language by systematically tracking and keeping a record of programs, libraries, and file accesses
From there it creates a single reproducible package, a .rpz file from that captured information.
Others can then unpack the rpz file using ReproUnzip on another machine regardless of configuration or OS. During the unpacking step, users have easy access to various unpackers, including vagrant and docker, to make reproduction simpler.
I also wanted to mention that reprozip has the ability to pack graphical tools, client-server environments, databases, and interactive experiments.
Automatically and systematically captures the provenance of an existing experiment (Linux only)
Language-independent approach and solution
Creates a self-contained reproducible package from captured provenance
Extracts package in another environment, independent of the operating system
Provides easy-to-use interfaces for replicating and varying the original configuration of the experiment
Packs graphical tools, interactive environments, databases, client-server set-ups
Step 1: Packing
I'll show this a bit more explicitly. Right now, Reprozip can only pack experiments and environments from Linux. So we start with this computational environment. Then the researchers runs the experiment under reprozip trace. Reprozip then traces and records the execution of the experiment.
We use some heuristics (a set of rules that normally work) for detecting files:
Input Files: files that are only read, do not belong to any software package and are not executable + files mentioned on command line
Log Files: files that are read and written afterwards (logs, database files, etc.)
Output files: files that are only written to
Reprozip then creates a configuration file that lists the files to be packed so the researcher can add or remove files that they know need to be packed or not. He should be wary of privacy and copyright issues when going over the list.
All the files are packed in the same structure they are found in the original environment. The output is a single reproducible rpz file.
Step 2: Unpacking
Now that we have an rpz file, we can unpack it in any OS.
Then we can run the unpacker we like. There are four unpackers currently available.
The first is the directory unpacker (reprounzip directory) allows users to unpack the entire experiment (including library dependencies) in a single directory, and to reproduce the experiment directly from that directory. It does so by automatically altering environment variables (e.g.: PATH, HOME, and LD_LIBRARY_PATH) and the command line to point the experiment execution to the created directory, which has the same structure as in the packing environment. This is unreliable if the application cannot be trusted, since it can point outside the unpacked directory. Hardcoded paths for example will still hit outside that directory.
The next is the chroot unpacker (reprounzip chroot), similar to reprounzip directory, a directory is created from the experiment package; however, a full system environment is also built, which can then be run with chroot(2), a Linux mechanism that changes the root directory / for the experiment to the experiment directory. Therefore, this unpacker addresses the limitation of the directory unpacker and does not fail in the presence of hardcoded absolute paths. Note as well that it does not interfere with the current environment since the experiment is isolated in that single directory. Although chroot offers pretty good isolation, it is not considered completely safe: malicious experiments might still escape to the host environment.
Third, the vagrant unpacker (reprounzip vagrant) allows an experiment to be unpacked and reproduced using a virtual machine created through Vagrant. Therefore, the experiment can be reproduced in any environment supported by this tool, i.e., Linux, Mac OS X, and Windows.
Lastly, ReproUnzip can extract and reproduce experiments as Docker containers. The docker unpacker (reprounzip docker) is responsible for such integration. Docker implements a high-level API to provide lightweight containers that run processes in isolation. A Docker container, as opposed to a traditional virtual machine, does not require or include a separate operating system. Instead, it relies on the kernel's functionality and uses resource isolation (CPU, memory, block I/O, network, etc.) and separate namespaces to isolate the application's view of the operating system, and is thus lighter and faster.
ReproZip also allows users to generate a provenance graph related to the experiment execution by reading the metadata available in the .rpz package. This graph shows the experiment runs as well as the files and other dependencies they access during execution; this is particularly useful to visualize and understand the dataflow of the experiment.
There is also a VisTrails plugin that creates a VisTrails workflow, that can be used to run the experiment in VisTrails.
Optional Step: Downloading Outputs
Optional Step: Uploading Inputs
Bonus Demo Time: Jupyter Notebooks
Keepin' Up With Reproducibility
There are many resources you can take advantage of to stay informed on reproducibility initiatives, resources, and studies!
Oh hey, these are some notes. They'll be hidden in your presentation, but you can see them if you open the speaker notes window (hit 's' on your keyboard).