Tools for reproducible and accessible science
VMs, KnitR and OMERORob Davidson
Cardiac Physiome WorkshopAuckland, April 8th 2015
All Your Research Objects
• Project proposal • Project experimental SOPs • Images of equipment, subjects, conditions• RAW data• Meta-data• Analysis code, parameters, pipelines• Analysis environment, VM or provisioning script• Intermediate results• Publication figures/images/tables: codify• Publication text
Source: DOI: 10.6084/m9.figshare.1330219
GigaSolution: deconstructing the paperCombines and integrates:
Open-access journal
Data Publishing Platform
Data Analysis Platform
Today’s message
• Tools that fit with GigaDB– General purpose Research Object store
• Enhancing– Accessibility– Reproducibility
• Of some of your research objects– Software– images
Problems with scientific software - reproducibility
Measuring software reproducibility
• Systematic study:• 515 papers (429 conference, 86 journal)• <30% reproducible
DOI: 10.6084/m9.figshare.1330219http://reproducibility.cs.arizona.edu
Measuring software reproducibilityDOI: 10.6084/m9.figshare.1330219http://reproducibility.cs.arizona.edu
Reasons for failure
“The good news is that I was able to find some code. I am just hoping that it is a stable working version of the code... I have lost some data... The bad news is that the code is not commented and/or clean. So, I cannot really guarantee that you will enjoy playing with it.”
DOI: 10.6084/m9.figshare.1330219http://reproducibility.cs.arizona.edu
Cost of failure
• Waste time• Waste money
– Ioannidis 2014 – 85% resources wasted
• Frustrating• Distrust
DOI: 10.6084/m9.figshare.1330219DOI: 10.1371/journal.pmed.1001747
Literate programming - KnitR
Literate programming
• Let us change our traditional attitude to the construction of programs: Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to humans what we want the computer to do.– Donald E. Knuth, Literate Programming, 1984
Literate programming options
• See listing: http://www.gigasciencejournal.com/content/3/1/19– R: KnitR, Sweave, R-Markdown– Javascript: Tangle, Active Markdown (CoffeeScript)– Python: Ipython Notebooks – iReport links this functionality for Galaxy
DOI: 10.6084/m9.figshare.1330219
KnitR is versatile
R
Python
Ruby
HaskellPerl
SAS
Coffeescript
.txt
LaTeX
HTML
D3.js
R Markdown
HTML5 slides
Command line Any text?
WordPress
KnitR – how does it work?
• Code chunks– Basic text (or latex or markdown), interrupted by
‘chunks’ of code• For latex, similar to Sweave
…some text \Sexpr{rfunc(var)} more text……some text <<language, chunk_name, chunk_options>>=Some code@
• Process this combined text/code with knit() in R
KnitR uses: easy to explainDOI: 10.6084/m9.figshare.1330219http://reproducibility.cs.arizona.edu
KnitR uses: reproducible analysis
• Can string different tools/languages together • Stores parameters• Just like a pipeline/workflow system
– E.g. galaxy, taverna, Knime
• But also: codifies your figures…
KnitR uses – codified figuresDOI: 10.6084/m9.figshare.1330219
• Classic problems:• No description of error
bars• No description of
distributions
• Admittedly this could be fixed by ‘proper’ peer review
Source code: http://bit.ly/1NQZlHh
KnitR uses: codified figuresDOI: 10.6084/m9.figshare.1330219
• Code can be found quickly• Using text as markers
• Plot can be altered – 1 line of code
• New visualisation produced instantaneously
• Better evaluation of results
Source code: http://bit.ly/1NQZlHh
GigaScience KnitR example• “This article is an example of a literate programming document. It has
been created in R using the knitr package. Figures and tables in this paper are generated dynamically as the document is compiled. Several R packages are required to run the analysis. Materials are archived in the Gigascience database”
DOI: 10.6084/m9.figshare.1330219DOI:10.1186/2047-217X-3-3
Environment wrappers - VMs
DOI: 10.6084/m9.figshare.1330219
Measuring software reproducibilityDOI: 10.6084/m9.figshare.1330219http://reproducibility.cs.arizona.edu
Your environment
• How hard would it be to start from scratch?• What if you move from Ubuntu to Centos? Or
just upgrade?
• Dependencies / Versions• System settings• Hard for you, horrendous for others!
DOI: 10.6084/m9.figshare.1330219
Share your environment• Virtual machine
– Copy your exact environment– If it works for you, it works for anyone– Reproducibility, frozen in time
DOI: 10.6084/m9.figshare.1330219DOI:10.1186/2047-217X-3-23
Share your environment
• Docker– ‘light’ vm – Discrete unit of code+environment– Can be called from command line– Can be linked together
• New possibilities e.g. nucleotid.es – Benchmarking -> “data-driven peer-review”?
DOI: 10.6084/m9.figshare.1330219http://nucleotid.es/
Share your environment
• Some concerns:– http://ivory.idyll.org/blog/vms-considered-harmfu
l.html– VM = black box?– Docker == black box!
Solution-> codify the environment
DOI: 10.6084/m9.figshare.1330219
Codify your environment
• Provisioning scripts are ‘research objects’• Improves adaptability (easier to recode for
alternative OS etc)• Builds in extra documentation• Easier to share – although GigaDB still wants a
compiled snapshot (i.e. full machine)
DOI: 10.6084/m9.figshare.1330219
Short list of provisioning systems
• Vagrant• Chef• Salt• Puppet• Ansible
• Many more – see link for info
DOI: 10.6084/m9.figshare.1330219Source: http://bit.ly/1wrYiuI
Images: release ALL the images with OMERO
“And now for something completely different”
NO
Phenotyping with microCTdoi:10.1186/2047-217X-2-14
NO
Phenotyping with microCTdoi:10.1186/2047-217X-3-6
Hosting Images• Image LIMS
• Links to GigaDB • Can handle most
formats• Web embedding
• View online, no need for software
• Open Source
www.openmicroscopy.org/site/products/omero
www.openmicroscopy.org/site/products/omero
OMERO: providing access to imaging data
View, filter, measure raw images with direct links from journal article.
See all image data, not just cherry picked examples.
Download and reprocess.
OMERO: Adding value http://jcb-dataviewer.rupress.org/
The alternative...
...look but don't touch
Thanks for listening!
Acknowledgements• GigaTeam
– Scott Edmunds– Peter Li– Chris Hunter– Jesse Xiao– Nicole Edmunds– Laurie Goodman
Where to get these slides• FigShare DOI: