Karthik

 

@berkeley.edu

Data Science

Reproducible

Sustainable

and how the libraries can enable such a transformation

(

)

Ram

Use the down arrow or space bar to advance slides

Jon Clarebout

Electronic Documents Give Reproducible Research a New Meaning

Judgement of the reproducibility of computationally oriented research no longer requires an expert - a clerk can do it

(in 1992)

" The actual scholarship is in the full software environment, code, & data that produce the result"

David Donoho

The reproducibility crisis

is widespread

Baker, 2015. Baker & Dolgin 2017, Aschwanden, C. 2016, Casadevall & Fang 2010

Lack of reproducibility is quite widespread even in applied computational research

The extent to which code would actually build with reasonable effort is quite low

Collberg et al 2014​
< 20%

Prof. Daniel Bolnick

Recently, Dr. Tony Wilson from CUNY Brooklyn tried to recreate my analysis, so that he could figure out how it worked and apply it to his own data ... he couldn’t quite recreate some of my core results.

So: how many results, negative or positive, that enter the published literature are tainted by a coding mistake as mine was. We just don’t know. Which raises an important question: why don’t we review code (or other custom software) as part of the peer-review process?

Without software, modern research would be impossible

Without research software, modern research would be impossible

Research software (as opposed to simply software) is software developed within academia and used for the purpose of research: generate, process and analyze results.

 

Hettrick et al 2016

90%

70%

63%

95%

Use

Can't continue without

Research Software

doi.org/10.5281/zenodo.843607

Without data it’s difficult to validate results. But without code, we waste the opportunity to advance science."

"Data implies software: it's not much good gathering data if you don't have the ability to analyze it."

Neil Chue Hong

C. Titus Brown

1

Training

Training in computational skills is one of the largest unmet needs

Barone et al, 2017

What we train people for

What we expect after

2

Credit

We don't know how to cite software

Howison & Bullard 2016

Formal citations: 31% - 43% 

Informal mentions are the norm, even in high impact journals

Software is frequently inaccessible (15 - 29%)

 Lack of visibility means that incentives to produce high-quality, widely shared, and collaboratively developed software are lacking

3

Sustainability

Software sustainability describes the practices, both technical and non-technical that allow software to continue to operate as expected in the future

Hettrick et al 2016

Software sustainability is strongly linked to reproducibility and transparency

> 18k awards totaling $9.6 billion related to research software

NSF funding 1996-2016

Career paths

Besides credit and training, we don't have institutional support for developers and maintainers to ensure long-term availability of software.

Academic data science

Open Science Takes Over

What will data science look like in 2065?

Reproducible computation is finally being recognized today by many scientific leaders as a central requirement for valid scientific publication.

What will data science look like in 2065?

To work reproducibly in today’s computational environment, one constructs automated workflows that generate all the computations and all the analyses in a project. 

What will data science look like in 2065?

We are almost in the golden age that Clarebout talked about in 1992

Notebooks are finally becoming the lingua franca of computational science

1

Training

The carpentries, hackweeks, project based learning, data science university courses

Summer schools

Workshops

Traditional scientific meetings

Hackweeks

Training landscape

rOpenSci unconf

dotAstro

Project based learning

Pedagogy

The Carpentries

Hackweeks

Hack Weeks fill the gaps between pedagogically focused and project focused models.

10.1073/pnas.1717196115

2

Credit

We are developing new journals aimed at developers, and highlighting reproducibility as scholarship

Journal of Open Source Software

joss.theoj.org

JOSS is a is a developer friendly, open access journal for research software packages
 

A mechanism for research software developers to get credit within the current merit system of science

published papers

APC 

500

to publish 

$0

$3.50

OSI compatible license

Complete documentation

High test coverage

Readable code

Usability

Software Review

A software review thread

Nov 2017

Feb 2018

“I don’t really see myself writing another serious package without having it go through code review.”

3

Sustainability

We need to address issues around creating sustainable software and career paths for researchers who engage in those activities

Conceptualizing a US Research Software Sustainability Institute (URSSI)

#1743188
urssi.us

 URSSI mission

To improve the quality, usefulness, and sustainability of research software by improving practices, and increasing diversity of practitioners

Where do the libraries fit in the academic data science tent?

Be the stewards of best good practices

1

Stages of open community for research software

Libraries fit in here

Stages of open community for research software

Enable scientists to make their work reproducible

2

Research compendia

Gentleman and Temple Lang, 2004
...We introduce the concept of a compendium as both a container for the different elements that make up the document and its computations (i.e. text, code, data, ...), and as a means for distributing, managing and updating the collection.

 

1.
2.
3.
4.
5.
6.

KÖMPENDIUM

1x
14x
4x

KÖMPENDIUM

Research compendium principles

Stick with the conventions of your peers

Keep data, methods and outputs separate

Specify your computational environment as clearly as you can

Key components you'll need for sharing a compendium

License
VCS
Metadata
Archive

Data

Computing Env

Workflow

COMPENDIUM
DESCRIPTION
LICENSE
Readme.md
NAMESPACE
R/
myfunctions.R
analysis/
man/
tests/
.travis.yml
Dockerfile
scripts/
my_report.Rmd
data/
Makefile
datasets.csv
Metadata and software dependencies
Computing environment
Workflow

Data

How does a researcher manage small to medium data in the context of a research compendium?

 Capturing the computing environment

It's important to isolate the computing environment so that changes in software dependencies don't break your analysis.

doi:10.1038/nbt.3780

Binder

Binder is an open source project that is designed to make it really easy to share analyses that are in notebooks.

mybinder.org

Binder

mybinder.org

Executable manuscript

Archived on Zenodo

Easily accessible data

License + a Docker file

Live manuscript

Collaborate on GitHub

Executable manuscript

Archived on Zenodo

Easily accessible data

License + a Docker file

Live manuscript

Collaborate on GitHub

Help researchers share data in easy to use formats

3

Make it easier to ingest data as part of data analyses

Easily read data back into scripts

Package smaller data along with software

Become the hub of data science training on campuses

4

The Carpentries

Summer schools

Workshops

Traditional scientific meetings

Hackweeks

Training landscape

rOpenSci unconf

dotAstro

Project based learning

Pedagogy

Continue helping researchers with data curation but explore ways to ease friction for data science use cases.

Help early stage research software projects with best practices for software sustainability.

Help researchers make their work more reproducible.

Train researchers to embrace data science practices. 

bit.ly/erl
19