41
Reproducible research: Practice Tobin Magle, PhD Bioinformationist Health Science Library University of Colorado Anschutz Medical Campus

Reproducible research: practice

Embed Size (px)

Citation preview

Reproducible research: Practice

Tobin Magle, PhDBioinformationist

Health Science LibraryUniversity of Colorado Anschutz Medical Campus

Reproducibility

is the practice of distributing all data, software source code, and tools required to reproduce the results discussed in a research

publication. https://www.ctspedia.org/do/view/CTSpedia/ReproducibleResearchStandards

Replication vs. Reproducibility• Replication: The confirmation of results and conclusions from one study

obtained independently in another is considered the scientific gold standard. • “Again, and Again, and Again …” BR Jasny et. al. Science, 2011. 334(6060) pp. 1225 DOI: 10.1126/science.334.6060.1225

• Some studies can’t be replicated: too big, too costly, too time consuming, one time event, rare samples

• Reproducibility: minimum standard for assessing the value of scientific claims, particularly when full independent replication of a study is not feasible

• “Reproducible Research in Computational Science”. RD Peng Science, 2011. 334 (6060) pp. 1226-1227 DOI: 10.1126/science.1213847

Research Lifecycle:

FormHypothesis

Collect Data

Design Experiment

Publish research

Analyze Data

Write manuscript

1. Technological advances:• Huge, complex digital datasets• Computational power• Ability to share

2. Human Error:• Poor Reporting• Flawed analyses

Complications

Complicated Research LifecycleForm

Hypothesis

Collect Data

Design Experiment

Publish research

Clean Data

Analyze Data

Write manuscript

Share data

Curate data

Plan for data storage

Requires new expertise and infrastructure

FormHypothesis

Collect Data

Design Experiment

Publish research

Clean Data

Analyze Data

Write manuscript

Share data

Curate data

Plan for data storage

Data Management Plans

Version control

Literate Statistical Computing

Reproducible research tools

DMPTool• Developed by California Digital Libraries to help researchers write

data management plans• https://dmptool.org/user_sessions/institution• Select University of Colorado Anschutz Medical Campus

Create an account* or signin

*We’re working with OIT to allow us to log in with CU passport credentials. Stay tuned

CU Anschutz-specific content

Data management exercise• Create a DMPTool account

• Pick a template and create a DMP

• Take 5 minutes to click through the template and think about how these questions relate to your research

Version control

Version control is a system that records changes to a file or set of files over time so that you can recall specific versions later.

https://git-scm.com/doc

Intuitive version control

But what if you save a new file into the wrong version?

Original(V1)

V3

V2

Local version control system

Figure 1-1. Local version control.https://git-scm.com/book/en/v2/Getting-Started-About-Version-Control

But what if you need to collaborate?

• Keeps files in one place • No copies• Keeps track of changes• Like Apple’s Time machine

Centralized version control

https://git-scm.com/book/en/v2/Getting-Started-About-Version-Control

Figure 1-2. Centralized version control.

But what the server goes down?

What if you can’t get online?

• Keeps files on a server• No copies• Keeps track of changes• Can work simultaneously

on different files

Distributed version control

https://git-scm.com/book/en/v2/Getting-Started-About-Version-Control

Figure 1-3. Distributed version control.

Git, Mercurial, Bazaar or Darcs

• Keeps files locally AND on a server• Changes are among computers and

server• Keeps track of changes• Can work simultaneously

What is Git?• Distributed version control system developed by the Linux community• A stream of snapshots

Figure 1-5. Storing data as snapshots of the project over time.

https://git-scm.com/book/en/v2/Getting-Started-Git-Basics

3 states of repository files• Modified – the file is altered but not committed

• Staged – the file is altered and marked to go to the next commit

• Committed- the file is altered and stored in your local DB

3 Sections of your directory

Figure 1-6. Working directory, staging area, and Git directory.https://git-scm.com/book/en/v2/Getting-Started-Git-Basics

Committed

ModifiedStaged

Important git commands• Init (Initialize) – start a git repository

• Add – add files to the git repository (for initial add and staging), can be skipped with –a command

• Commit – safely store the files in your git repository

• Clone – make a copy of someone else’s git repository

File statuses and how they change

Figure 2-1. The lifecycle of the status of your files.https://git-scm.com/book/en/v2/Git-Basics-Recording-Changes-to-the-Repository

GitHub Desktop

Repositories

Visual

RepositoriesAltered files

Commit notes

Git your hands on git

• Create a GitHub account

• Go to the repository: https://github.com/maglet/hands-on-git

• Clone the repository

Log in to GitHub desktop• Hands-on-git should be in the left hand panel under GitHub

When you change a file…

Automatically adds files/alterations

To commit

After commitAdded a “bubble”

Click there to revert

Reverting

Cloning/Branching/Forking• Cloning: make a local copy of a repository online or elsewhere

• Branching: creating a separate stream to test new features, so you don’t affect the “trunk”; branches depend on the trunk• Collaboration

• Forking: Making a separate copy of a repository that is not dependent• Using others’ work is a starting point; preserving things that the owner might

delete for yourself

BranchingSplits off

Editing a branch

Pull requestMeets back up with “master”

Approve Pull request

Meets back up with “master”, can be reverted

Can delete unused branches

Pull request approvedBack on one track

Exercise• Go to the repository you cloned earlier

• Create a text file with your name on it

• Add it to the name folder

• Submit a pull request

• Look at what happens to the visual representation

Literate (statistical) programming• Resulting report is a stream of text (human readable) and code

(machine readable)

• Alternate text and code• Sweave• R markdown

Install knitr and markdown packages• Tools > install packages

• Enter the package name (will autocomplete)• Knitr• Markdown

• OR install.packages("knitr”)

• If it fails, try again

Open/Create a markdown document

Write: useful syntax• Plain text

• *italics* -> italics

• **bold** -> bold

• #Header -> Header (more # decreases size)

• Can also draw: • Insert pictures• Ordered and unordered list• Tables

Embed code• Inline – Use variables in the human readable text• `r 2 + 2`

• Code chunks - Include working code that generates output• ```{r}• #Code goes here• ```

• Display Options –

Render• Won’t render unless the code runs with no errors• You know it should be reproducible

• Render using the knit function

• Output Formats• Knit HTML• Knit PDF – requires latex• Knit Word

Exercise• Edit the markdown document using the cheat sheet to see what you

can do• Try to knit it after creating a typo in the code• Insert other pictures from the web• Try to make a table• Make some bulleted lists• Insert a block quote• Make the graph prettier

• Play around!