Research Software & Open Reproducible Research

Karthik Ram

Karthik Ram

@_inundata

Jon Clarebout

A revolution in education... marriage of word processing and software command scripts

1992

In this marriage, an author attaches to every figure caption a push button to recalculate the figures from all its data, parameters and programs

This provides a concrete definition of reproducibility in computationally oriented research

1. Issue the "burn illustration" command

2. Activate a search program to hunt for any remaining images

3. Issue the "build illustrations" command

4. Verify the document

Judgement of the reproducibility of computationally oriented research no longer requires an expert - a clerk can do it

Verification of reproducibility

" The actual scholarship is in the full software environment, code, & data that produce the result"

David Donoho

            10.1007/978-1-4612-2544-7_5

The reproducibility crisis

is widespread

Baker, 2015. Baker & Dolgin 2017, Aschwanden, C. 2016, Casadevall & Fang 2010

Medicine

▪ study power and bias

Medicine

▪ study power and bias

Psychology

▪ p-hacking

Medicine

▪ study power and bias

Psychology

▪ p-hacking

▪ Lack of access to full datasets and protocols

Biomed

Lack of reproducibility is quite widespread even in applied computational research

Collberg et al 2014

The extent to which code would actually build with reasonable effort is quite low

Collberg et al 2014

< 20%

Software is critical for research but we don't value it as scholarship

Prof. Daniel Bolnick

bit.ly/2wB606Y

Recently, Dr. Tony Wilson from CUNY Brooklyn tried to recreate my analysis, so that he could figure out how it worked and apply it to his own data ... he couldn’t quite recreate some of my core results.

“

I dug up my original code, sent it to him, and after a couple of back-and-forth emails we found my error.

“

I immediately sent a retraction email to the journal (Evolutionary Ecology Research), which will be appearing soon in the print version. So let me say this clearly, I was wrong.

“

So: how many results, negative or positive, that enter the published literature are tainted by a coding mistake as mine was. We just don’t know. Which raises an important question: why don’t we review code (or other custom software) as part of the peer-review process?

“

I suspect that I am not the only biologist out there to make a small mistake in my code that has a big impact.

“

When software is not visible, it is often excluded from peer review

Computational science has culture problems

no verification,

no transparency

no efficiency

1

Training

Training in computational skills is one of the largest unmet needs

Barone et al, 2017

90%

70%

63%

95%

Use

Can't continue without

Research Software

of researcher-contributed packages have unit tests

Keyes et al 2015

19%

2

Credit

We don't know how to cite software

Howison & Bullard 2016

Lack of visibility means that incentives to produce high-quality, widely shared, and collaboratively developed software are lacking

Formal citations: 31% - 43%

Informal mentions are the norm, even in high impact journals

Software is frequently inaccessible (15 - 29%)

3

Sustainability

Career paths

Besides credit and training, we don't have institutional support for developers and maintainers to ensure long-term availability of software.

Academic data science

David Donoho

50 years of data science

Journal Of Computational And Graphical Statistics, 2017

Academic data science is the big tent

Open Science Takes Over

What will data science look like in 2065?

Reproducible computation is finally being recognized today by many scientific leaders as a central requirement for valid scientific publication.

What will data science look like in 2065?

To work reproducibly in today’s computational environment, one constructs automated workflows that generate all the computations and all the analyses in a project.

What will data science look like in 2065?

We are currently in the golden age that Clarebout talks about

Notebooks are finally becoming the lingua franca of computational science

The best technology is something you don't know you're using

Fluent interfaces

1

Training

The carpentries, hackweeks, project based learning, data science university courses

Data 8, UC Berkeley

SWC, DC

Summer schools

Workshops

Traditional scientific meetings

Hackweeks

Projects

Pedagogy

rOpenSci unconf

dotAstro

Hackweeks

Hack Weeks fill the gaps between pedagogically focused and project focused models.

arxiv.org/abs/1711.00028

Developing a community of research software engineers, and the next generation of data science mentors.

tidy text

Making text analysis easier and reproducible

textworkshop17.ropensci.org

2

Credit

We are developing new journals aimed at developers, and highlighting reproducibility as scholarship

Journal of Open Source Software

joss.theoj.org

Arfon Smith

Data Science Mission Office (DSMO) Head, STSCI

Journal of Open Source Software

joss.theoj.org

A mechanism for research software developers to get credit within the current merit system of science

Submission only require a Github repository URL, an ORCID, and a succinct high-level description of the software

joss.theoj.org

Reviewing software without a publication

Even without software pubs, we need to create a culture around peer-reviewing our research software

rOpenSci

100+ software packages to support data science. e.g. spatial data, biodiversity informatics & climate change, glue for workflows.

OSI compatible license

Complete documentation

High test coverage

Readable code

Usability

Software Review

A typical software review thread

Nov 2017

Feb 2018

“I don’t really see myself writing another serious package without having it go through code review.”

The ReScience journal

Replication is the scholarship

peerj.com/articles/cs-142/

The ReScience journal

Original

Article

Successful replication

Unsuccessful replication

3

Sustainability

We need to address issues around creating sustainable software and career paths for researchers who engage in those activities

> 18k awards totaling $9.6 billion related to research software.

NSF funding 1996-2016

US Research Software Sustainability Institute (URSSI)

urssi.us

nsf.gov/awardsearch/showAward?AWD_ID=1743188

US Research Software Sustainability Institute

Help projects grow, become sustainable, develop a governance model

Track software impact and usage, which are difficult to measure and interpret.

Help grantees of major funders communicate and share resources with each other

Governance

Credit

Training

Impact

10.7554/eLife.16800.001

Thanks!

inundata.org/talks/ernz18