A         guide to making your data analysis       more reproducible

Karthik Ram • rOpenSci // UC Berkeley 

(rough and incomplete)

(and a bunch of other work you do)

1.
2.
3.
4.
5.
6.

KÖMPENDIUM

1x
14x
4x

KÖMPENDIUM

Research compendia

Gentleman and Temple Lang, 2004
...We introduce the concept of a compendium as both a container for the different elements that make up the document and its computations (i.e. text, code, data, ...), and as a means for distributing, managing and updating the collection.

 

Research compendium principles

Stick with the conventions of your peers

Keep data, methods and outputs separate

Specify your computational environment as clearly as you can

Key components you'll need for sharing a compendium

License
VCS
Metadata
Archive

The R package structure is great way to organize and share a compendium!

Package: glue
Title: Interpreted String Literals
Version: 1.3.0.9000
Authors@R: person("Jim", "Hester", email = "james.f.hester@gmail.com", role = c("aut", "cre"))
Description: An implementation of interpreted string literals, inspired by
  Python's Literal String Interpolation <https://www.python.org/dev/peps/pep-0498/> and Docstrings
  <https://www.python.org/dev/peps/pep-0257/> and Julia's Triple-Quoted String Literals
  <https://docs.julialang.org/en/stable/manual/strings/#triple-quoted-string-literals>.
Depends: 
    R (>= 3.1)
Imports:
  methods
Suggests: 
    testthat,
    (and many more)
License: MIT + file LICENSE
Encoding: UTF-8
LazyData: true
RoxygenNote: 6.0.1
Roxygen: list(markdown = TRUE)
URL: https://github.com/tidyverse/glue
BugReports: https://github.com/tidyverse/glue/issues
VignetteBuilder: knitr
ByteCompile: true

Package DESCRIPTION file

compendium DESCRIPTION file

Packaging your analysis as a compendium gives you access to powerful developer tools

COMPENDIUM
DESCRIPTION
LICENSE
Readme.md
data/
analysis/
Mydata.csv
Report.Rmd
Marwick et al 2017

Small compendia

COMPENDIUM
DESCRIPTION
LICENSE
Readme.md
data/
analysis/
Mydata.csv
Report.Rmd
NAMESPACE
R/
myfunctions.R
my_functions.Rd
man/
Marwick et al 2017

Medium compendia

COMPENDIUM
DESCRIPTION
LICENSE
Readme.md
NAMESPACE
R/
myfunctions.R
analysis/
man/
tests/
.travis.yml
Dockerfile
scripts/
my_report.Rmd
data/
Makefile
datasets.csv
Marwick et al 2017

Large/complex compendia

Data (Small Medium)

Computing environment

Workflows

1. Data

How does one manage small to medium data in the context of a research compendium?

Small data

Put small data inside packages, especially if you ship a methods package with your analysis

 

 

CRAN = < 5 mb.

 

37% of the 13K packages on CRAN have some form of data. 

piggyback

Attach large [data] files to Github repositories

github.com/ropensci/piggyback

Leveraging Github releases to share medium sized files

github.com/ropensci/piggyback
Medium data
github.com/ropensci/arkdb

2. Isolate your computing environment

It's important to isolate the computing environment so that changes in software dependencies don't break your analysis.

doi:10.1038/nbt.3780

Adding a Dockerfile to your compendium

Many ways to write a Dockerfile for your R project 

o2r/containerit

jupyter/repo2docker

Binder

Binder is an open source project that is designed to make it really easy to share analyses that are in notebooks.

mybinder.org

Binder

mybinder.org

Git + Docker + RStudio

Setting up Binder

r-2018-12-20

Setting up Binder

install.packages("ggplot2")

Binder

Basic

Premium

Pro

install.r

runtime.txt

apt.txt
Dockerfile

install.r
Dockerfile
DESCRIPTION

Slow but easy to setup. Recommended for beginners

Faster launch

Best for compendia

free

free

free

Basic

Premium

Pro

install.r

runtime.txt

apt.txt
Dockerfile

install.r
Dockerfile
DESCRIPTION

Slow but easy to setup. Recommended for beginners

Faster launch

Best for compendia

free

free

free

Basic

Premium

Pro

install.r

runtime.txt

apt.txt
Dockerfile

install.r
Dockerfile
DESCRIPTION

Slow but easy to setup. Recommended for beginners

Faster launch

Best for compendia

free

free

free

A fast set up binder

DESCRIPTION
Dockerfile
Pull a base image from Rocker (e.g. rocker:binder/latest)
rocker-project.org

3. Workflow

Include a workflow to manage relationships between data output and code. 

drake

general purpose workflow manager & pipeline toolkit for reproducibility and high-performance computing.

 

github.com/ropensci/drake

Drake: Data Frames in R for Make

No cumbersome Makefiles
Vast arsenal of parallel computing options

Visualize dependency graph and estimate run times
 Convenient organization of output

Drake: visualize dependency graph

Take home

Leverage the R package structure and support tools/services as much as possible

Take home

Use modern tools to make your compendia more accessible, but don't forget long-term archives and simpler formats

github.com/topics/research-compendium

data

environment

workflow

Near term

Long term

piggyback, data packages

Zenodo and friends

Binder and friends

Dockerfile

Drake

Core R tools, Make

1.
2.
3.
4.
5.

KÖMPENDIUM

1x
14x
4x

KÖMPENDIUM

1x
   /
karthik/
rstudio2019
git repo has links to slides and all resources mentioned in the talk