2013 10-30-sbc361-reproducible designsandsustainablesoftware

Programming in RQuick refresher

• creating a vector

> myvector <- 5:11> myvector <- seq(from=5, to=11, by=1)> myvector <- c(5, 6, 7, 8, 9, 10, 11)> myvector[1] 5 6 7 8 9 10 11

• accessing a subset

• three synonyms:

• of a vector> bigvector <- 150:100> bigvector [1] 150 149 148 147 146 145 144 143 142 141 140 139 138 137 136 135 134 133 132[20] 131 130 129 128 127 126 125 124 123 122 121 120 119 118 117 116 115 114 113[39] 112 111 110 109 108 107 106 105 104 103 102 101 100> mysubset <- bigvector[myvector]> mysubset[1] 146 145 144 143 142 141 140

> subset(bigvector, bigvector > 120) [1] 150 149 148 147 146 145 144 143 142 141 140 139 138 137 136 135 134 133 132[20] 131 130 129 128 127 126 125 124 123 122 121

Regular expressions: Text search on steroids.

Regular expression FindsDavid David

Dav(e|id) David, Dave

Dav(e|id|ide|o) David, Dave, Davide, Davo

At{1,2}enborough Attenborough, Atenborough

Atte[nm]borough Attenborough, Attemborough

At{1,2}[ei][nm]bo{0,1}ro(ugh){0,1} Atimbro, attenbrough, etc.

Easy counting, replacing all with “Sir David Attenborough”

• for subsetting/counting:grep()

• for replacing:

gsub()

Functions• R has many. e.g.: plot(), t.test()

• Making your own: tree_age_estimate <- function(diameter, species) { [...do the magic... # maybe something like: growth.rate <- growth.rates[ species ] age.estimate <- diameter / growth.rate ...]

return(age.estimate)}> tree_age_estimate(25, "White Oak")+ 66> tree_age_estimate(60, "Carya ovata")+ 190

“for” Loop

> possible_colours <- c('blue', 'cyan', 'sky-blue', 'navy blue', 'steel blue', 'royal blue', 'slate blue', 'light blue', 'dark blue', 'prussian blue', 'indigo', 'baby blue', 'electric blue')

> possible_colours [1] "blue" "cyan" "sky-blue" "navy blue" [5] "steel blue" "royal blue" "slate blue" "light blue" [9] "dark blue" "prussian blue" "indigo" "baby blue" [13] "electric blue"

> for (colour in possible_colours) {+ print(paste("The sky is oh so, so", colour))+ }

[1] "The sky is so, oh so blue"[1] "The sky is so, oh so cyan"[1] "The sky is so, oh so sky-blue"[1] "The sky is so, oh so navy blue"[1] "The sky is so, oh so steel blue"[1] "The sky is so, oh so royal blue"[1] "The sky is so, oh so slate blue"[1] "The sky is so, oh so light blue"[1] "The sky is so, oh so dark blue"[1] "The sky is so, oh so prussian blue"[1] "The sky is so, oh so indigo"[1] "The sky is so, oh so baby blue"[1] "The sky is so, oh so electric blue"

Experimental design

Reproducible research & Scientific computing.

Why consider experimental design?• If you’re performing experiments

• Cost• Time

• for experiment• for analysis

• Ethics• If you’re deciding to fund? to buy? to approve? to compete?

• are the results real?• can you trust the data?

Main potential problems

•Insufficient data/power

•Inappropriate statistics

•Pseudoreplication

•Confounding factors

Wrong Inaccurate & Misleading

Example: deer parasites• Do red deer that feed in woodland have more parasites than

deer that feed on moorland?

• Find a woodland + a highland; collect faecal samples from 20 deer in each.

• Conclusion?

• But: • pseudoreplication: (n = 1 not 20!):

• shared environment (influence each other)• relatedness

• many confounding factors: (e.g. altitude...)

Your turn: small & big Pheidole

workers.

• Is there a genetic predisposition for becoming a larger

worker?

• Design an experiment alone.

• Exchange ideas with your neighbor.

e.g.: John.

Your turn again: protein production• Large amounts of potential superdrug takeItEasyProtein™

required for Phase II trials.• 10 cell lines can produce takeItEasyProtein™. • You have 5 possible growth media. • Optimization question: Which combination of temperature, cell

line, and growth medium will perform best?• Constraints:

• each assay takes 4 days. • access to 2 incubators (each can contain 1-100 growth tubes). • large scale production starts in 2 weeks

• Design an experiment alone. • Exchange ideas with your neighbor.

Reproducible Research & Scientific Computing

Why care?

Some sources of inspiration

arX

iv:1

210.

0530

v3 [

cs.M

S] 2

9 N

ov 2

012

Best Practices for Scientific ComputingGreg Wilson !, D.A. Aruliah †, C. Titus Brown ‡, Neil P. Chue Hong §, Matt Davis ¶, Richard T. Guy ",Steven H.D. Haddock !!, Katy Hu! ††, Ian M. Mitchell ‡‡, Mark D. Plumbley §§, Ben Waugh ¶¶,Ethan P. White !!!, Paul Wilson †††

!Software Carpentry ([email protected]),†University of Ontario Institute of Technology ([email protected]),‡MichiganState University ([email protected]),§Software Sustainability Institute ([email protected]),¶Space Telescope Science Institute([email protected]),"University of Toronto ([email protected]),!!Monterey Bay Aquarium Research Institute([email protected]),††University of Wisconsin ([email protected]),‡‡University of British Columbia ([email protected]),§§QueenMary University of London ([email protected]),¶¶University College London ([email protected]),!!!Utah StateUniversity ([email protected]), and †††University of Wisconsin ([email protected])

Scientists spend an increasing amount of time building and usingsoftware. However, most scientists are never taught how to do thise"ciently. As a result, many are unaware of tools and practices thatwould allow them to write more reliable and maintainable code withless e!ort. We describe a set of best practices for scientific softwaredevelopment that have solid foundations in research and experience,and that improve scientists’ productivity and the reliability of theirsoftware.

Software is as important to modern scientific research astelescopes and test tubes. From groups that work exclusivelyon computational problems, to traditional laboratory and fieldscientists, more and more of the daily operation of science re-volves around computers. This includes the development ofnew algorithms, managing and analyzing the large amountsof data that are generated in single research projects, andcombining disparate datasets to assess synthetic problems.

Scientists typically develop their own software for thesepurposes because doing so requires substantial domain-specificknowledge. As a result, recent studies have found that scien-tists typically spend 30% or more of their time developingsoftware [19, 52]. However, 90% or more of them are primar-ily self-taught [19, 52], and therefore lack exposure to basicsoftware development practices such as writing maintainablecode, using version control and issue trackers, code reviews,unit testing, and task automation.

We believe that software is just another kind of experi-mental apparatus [63] and should be built, checked, and usedas carefully as any physical apparatus. However, while mostscientists are careful to validate their laboratory and fieldequipment, most do not know how reliable their software is[21, 20]. This can lead to serious errors impacting the cen-tral conclusions of published research [43]: recent high-profileretractions, technical comments, and corrections because oferrors in computational methods include papers in Science[6], PNAS [39], the Journal of Molecular Biology [5], EcologyLetters [37, 8], the Journal of Mammalogy [33], and Hyper-tension [26].

In addition, because software is often used for more than asingle project, and is often reused by other scientists, comput-ing errors can have disproportional impacts on the scientificprocess. This type of cascading impact caused several promi-nent retractions when an error from another group’s code wasnot discovered until after publication [43]. As with bench ex-periments, not everything must be done to the most exactingstandards; however, scientists need to be aware of best prac-tices both to improve their own approaches and for reviewingcomputational work by others.

This paper describes a set of practices that are easy toadopt and have proven e!ective in many research settings.Our recommendations are based on several decades of collec-tive experience both building scientific software and teach-ing computing to scientists [1, 65], reports from many othergroups [22, 29, 30, 35, 41, 50, 51], guidelines for commercial

and open source software development [61, 14], and on empir-ical studies of scientific computing [4, 31, 59, 57] and softwaredevelopment in general (summarized in [48]). None of thesepractices will guarantee e"cient, error-free software develop-ment, but used in concert they will reduce the number oferrors in scientific software, make it easier to reuse, and savethe authors of the software time and e!ort that can used forfocusing on the underlying scientific questions.

1. Write programs for people, not computers.Scientists writing software need to write code that both exe-cutes correctly and can be easily read and understood by otherprogrammers (especially the author’s future self). If softwarecannot be easily read and understood it is much more di"cultto know that it is actually doing what it is intended to do. Tobe productive, software developers must therefore take severalaspects of human cognition into account: in particular, thathuman working memory is limited, human pattern matchingabilities are finely tuned, and human attention span is short[2, 23, 38, 3, 55].

First, a program should not require its readers to hold morethan a handful of facts in memory at once (1.1). Human work-ing memory can hold only a handful of items at a time, whereeach item is either a single fact or a “chunk” aggregating sev-eral facts [2, 23], so programs should limit the total number ofitems to be remembered to accomplish a task. The primaryway to accomplish this is to break programs up into easilyunderstood functions, each of which conducts a single, easilyunderstood, task. This serves to make each piece of the pro-gram easier to understand in the same way that breaking up ascientific paper using sections and paragraphs makes it easierto read. For example, a function to calculate the area of arectangle can be written to take four separate coordinates:def rect_area(x1, y1, x2, y2):

...calculation...

or to take two points:def rect_area(point1, point2):

...calculation...

The latter function is significantly easier for people to readand remember, while the former is likely to lead to errors, not

Reserved for Publication Footnotes

1–7

arX

iv:1

210.

0530

v3 [

cs.M

S] 2

9 N

ov 2

012












...calculation...


...calculation...



1–7

arX

iv:1

210.

0530

v3 [

cs.M

S] 2

9 N

ov 2

012












...calculation...


...calculation...



1–7

1. Write programs for people, not computers.2. Automate repetitive tasks.3. Use the computer to record history.4. Make incremental changes.5. Use version control.6. Don’t repeat yourself (or others).7. Plan for mistakes.8. Optimize software only after it works correctly.9. Document the design and purpose of code rather than its mechanics.10. Conduct code reviews.

• even better : with Markdown.

Education

A Quick Guide to Organizing Computational BiologyProjectsWilliam Stafford Noble1,2*

1 Department of Genome Sciences, School of Medicine, University of Washington, Seattle, Washington, United States of America, 2 Department of Computer Science and

Engineering, University of Washington, Seattle, Washington, United States of America

Introduction

Most bioinformatics coursework focus-es on algorithms, with perhaps somecomponents devoted to learning pro-gramming skills and learning how touse existing bioinformatics software. Un-fortunately, for students who are prepar-ing for a research career, this type ofcurriculum fails to address many of theday-to-day organizational challenges as-sociated with performing computationalexperiments. In practice, the principlesbehind organizing and documentingcomputational experiments are oftenlearned on the fly, and this learning isstrongly influenced by personal predilec-tions as well as by chance interactionswith collaborators or colleagues.

The purpose of this article is to describeone good strategy for carrying out com-putational experiments. I will not describeprofound issues such as how to formulatehypotheses, design experiments, or drawconclusions. Rather, I will focus onrelatively mundane issues such as organiz-ing files and directories and documentingprogress. These issues are importantbecause poor organizational choices canlead to significantly slower research pro-gress. I do not claim that the strategies Ioutline here are optimal. These are simplythe principles and practices that I havedeveloped over 12 years of bioinformaticsresearch, augmented with various sugges-tions from other researchers with whom Ihave discussed these issues.

Principles

The core guiding principle is simple:Someone unfamiliar with your projectshould be able to look at your computerfiles and understand in detail what you didand why. This ‘‘someone’’ could be any of avariety of people: someone who read yourpublished article and wants to try toreproduce your work, a collaborator whowants to understand the details of yourexperiments, a future student working inyour lab who wants to extend your workafter you have moved on to a new job, yourresearch advisor, who may be interested in

understanding your work or who may beevaluating your research skills. Most com-monly, however, that ‘‘someone’’ is you. Afew months from now, you may notremember what you were up to when youcreated a particular set of files, or you maynot remember what conclusions you drew.You will either have to then spend timereconstructing your previous experimentsor lose whatever insights you gained fromthose experiments.

This leads to the second principle,which is actually more like a version ofMurphy’s Law: Everything you do, youwill probably have to do over again.Inevitably, you will discover some flaw inyour initial preparation of the data beinganalyzed, or you will get access to newdata, or you will decide that your param-eterization of a particular model was notbroad enough. This means that theexperiment you did last week, or eventhe set of experiments you’ve been work-ing on over the past month, will probablyneed to be redone. If you have organizedand documented your work clearly, thenrepeating the experiment with the newdata or the new parameterization will bemuch, much easier.

To see how these two principles areapplied in practice, let’s begin by consid-ering the organization of directories andfiles with respect to a particular project.

File and Directory Organization

When you begin a new project, youwill need to decide upon some organiza-tional structure for the relevant directo-ries. It is generally a good idea to storeall of the files relevant to one project

under a common root directory. Theexception to this rule is source code orscripts that are used in multiple projects.Each such program might have a projectdirectory of its own.

Within a given project, I use a top-levelorganization that is logical, with chrono-logical organization at the next level, andlogical organization below that. A sampleproject, called msms, is shown in Figure 1.At the root of most of my projects, I have adata directory for storing fixed data sets, aresults directory for tracking computa-tional experiments peformed on that data,a doc directory with one subdirectory permanuscript, and directories such as srcfor source code and bin for compiledbinaries or scripts.

Within the data and results directo-ries, it is often tempting to apply a similar,logical organization. For example, youmay have two or three data sets againstwhich you plan to benchmark youralgorithms, so you could create onedirectory for each of them under data.In my experience, this approach is risky,because the logical structure of your finalset of experiments may look drasticallydifferent from the form you initiallydesigned. This is particularly true underthe results directory, where you maynot even know in advance what kinds ofexperiments you will need to perform. Ifyou try to give your directories logicalnames, you may end up with a very longlist of directories with names that, sixmonths from now, you no longer knowhow to interpret.

Instead, I have found that organizingmy data and results directories chro-nologically makes the most sense. Indeed,

Citation: Noble WS (2009) A Quick Guide to Organizing Computational Biology Projects. PLoS ComputBiol 5(7): e1000424. doi:10.1371/journal.pcbi.1000424

Editor: Fran Lewitter, Whitehead Institute, United States of America

Published July 31, 2009

Copyright: ! 2009 William Stafford Noble. This is an open-access article distributed under the terms of theCreative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in anymedium, provided the original author and source are credited.

Funding: The author received no specific funding for writing this article.

Competing Interests: The author has declared that no competing interests exist.

* E-mail: [email protected]

PLoS Computational Biology | www.ploscompbiol.org 1 July 2009 | Volume 5 | Issue 7 | e1000424

with this approach, the distinction be-tween data and results may not be useful.Instead, one could imagine a top-leveldirectory called something like experi-ments, with subdirectories with names like2008-12-19. Optionally, the directoryname might also include a word or twoindicating the topic of the experimenttherein. In practice, a single experimentwill often require more than one day ofwork, and so you may end up working afew days or more before creating a newsubdirectory. Later, when you or someoneelse wants to know what you did, thechronological structure of your work willbe self-evident.

Below a single experiment directory, theorganization of files and directories islogical, and depends upon the structureof your experiment. In many simpleexperiments, you can keep all of your filesin the current directory. If you startcreating lots of files, then you shouldintroduce some directory structure to storefiles of different types. This directorystructure will typically be generated auto-matically from a driver script, as discussedbelow.

The Lab Notebook

In parallel with this chronologicaldirectory structure, I find it useful tomaintain a chronologically organized labnotebook. This is a document that residesin the root of the results directory andthat records your progress in detail.Entries in the notebook should be dated,and they should be relatively verbose, withlinks or embedded images or tablesdisplaying the results of the experimentsthat you performed. In addition to de-scribing precisely what you did, thenotebook should record your observations,conclusions, and ideas for future work.Particularly when an experiment turns outbadly, it is tempting simply to link the finalplot or table of results and start a newexperiment. Before doing that, it isimportant to document how you knowthe experiment failed, since the interpre-tation of your results may not be obviousto someone else reading your lab note-book.

In addition to the primary text describ-ing your experiments, it is often valuableto transcribe notes from conversations aswell as e-mail text into the lab notebook.

These types of entries provide a completepicture of the development of the projectover time.

In practice, I ask members of myresearch group to put their lab notebooksonline, behind password protection ifnecessary. When I meet with a memberof my lab or a project team, we can referto the online lab notebook, focusing onthe current entry but scrolling up toprevious entries as necessary. The URLcan also be provided to remote collabo-rators to give them status updates on theproject.

Note that if you would rather not createyour own ‘‘home-brew’’ electronic note-book, several alternatives are available.For example, a variety of commercialsoftware systems have been created tohelp scientists create and maintain elec-tronic lab notebooks [1–3]. Furthermore,especially in the context of collaborations,storing the lab notebook on a wiki-basedsystem or on a blog site may be appealing.

Figure 1. Directory structure for a sample project. Directory names are in large typeface, and filenames are in smaller typeface. Only a subset ofthe files are shown here. Note that the dates are formatted ,year.-,month.-,day. so that they can be sorted in chronological order. Thesource code src/ms-analysis.c is compiled to create bin/ms-analysis and is documented in doc/ms-analysis.html. The READMEfiles in the data directories specify who downloaded the data files from what URL on what date. The driver script results/2009-01-15/runallautomatically generates the three subdirectories split1, split2, and split3, corresponding to three cross-validation splits. The bin/parse-sqt.py script is called by both of the runall driver scripts.doi:10.1371/journal.pcbi.1000424.g001

PLoS Computational Biology | www.ploscompbiol.org 2 July 2009 | Volume 5 | Issue 7 | e1000424

In each results folder :•script getResults.rb or WHATIDID.txt or MyAnalysis.Rnw•intermediates•output

Take notes in Markdown “compile”to html, pdf,

### in R: library(knitr)knit(“MyFile.Rnw”)# --> creates MyFile.tex

### in shell:pdflatex MyFile.tex# --> creates MyFile.pdf

knitr (sweave)

\documentclass{article}\usepackage[sc]{mathpazo}\usepackage[T1]{fontenc}\usepackage{url}

\begin{document}

<<setup, include=FALSE, cache=FALSE, echo=FALSE>>=# this is equivalent to \SweaveOpts{...}opts_chunk$set(fig.path='figure/minimal-', fig.align='center', fig.show='hold')options(replace.assign=TRUE,width=90)@

\title{A Minimal Demo of knitr}

\author{Yihui Xie}

\maketitleYou can test if \textbf{knitr} works with this minimal demo. OK, let'sget started with some boring random numbers:

<<boring-random,echo=TRUE,cache=TRUE>>=set.seed(1121)(x=rnorm(20))mean(x);var(x)@

The first element of \texttt{x} is \Sexpr{x[1]}. Boring boxplotsand histograms recorded by the PDF device:

<<boring-plots,cache=TRUE,echo=TRUE>>=## two plots side by side par(mar=c(4,4,.1,.1),cex.lab=.95,cex.axis=.9,mgp=c(2,.7,0),tcl=-.3,las=1)boxplot(x)hist(x,main='')@

Do the above chunks work? You should be able to compile the \TeX{}document and get a PDF file like this one: \url{https://github.com/downloads/yihui/knitr/knitr-minimal.pdf}.The Rnw source of this document is at \url{https://github.com/yihui/knitr/blob/master/inst/examples/knitr-minimal.Rnw}.

\end{document}

A Minimal Demo of knitr

Yihui Xie

February 26, 2012

You can test if knitr works with this minimal demo. OK, let’s get started with some boring randomnumbers:

set.seed(1121)

(x <- rnorm(20))

## [1] 0.14496 0.43832 0.15319 1.08494 1.99954 -0.81188 0.16027 0.58589 0.36009

## [10] -0.02531 0.15088 0.11008 1.35968 -0.32699 -0.71638 1.80977 0.50840 -0.52746

## [19] 0.13272 -0.15594

mean(x)

## [1] 0.3217

var(x)

## [1] 0.5715

The first element of x is 0.145. Boring boxplots and histograms recorded by the PDF device:

## two plots side by side (option fig.show=’hold’)par(mar = c(4, 4, 0.1, 0.1), cex.lab = 0.95, cex.axis = 0.9,

mgp = c(2, 0.7, 0), tcl = -0.3, las = 1)

boxplot(x)

hist(x, main = "")

●

●

−0.5

0.0

0.5

1.0

1.5

2.0

x

Frequency

−1.0 −0.5 0.0 0.5 1.0 1.5 2.0

0

2

4

6

8

Do the above chunks work? You should be able to compile the TEX document and get a PDF file likethis one: https://github.com/downloads/yihui/knitr/knitr-minimal.pdf. The Rnw source of thisdocument is at https://github.com/yihui/knitr/blob/master/inst/examples/knitr-minimal.Rnw.

1

Analyzing & Reporting in a single file.

MyFile.Rnw

Also works with Markdown

instead of LaTeX!

Choosing a programming language

R

Excel

Python

Ruby

Java

Unix command-line (i.e., shell, i.e., bash)

Perl

Javascript

Ruby.“Friends don’t let friends do Perl” - reddit user

### in PERL: open INFILE, "my_file.txt";while (defined ($line = <INFILE>)) { chomp($line); @letters = split(//, $line); @reverse_letters = reverse(@letters); $reverse_string = join("", @reverse_letters); print $reverse_string, "\n";}

### in Ruby: File.open("my_file.txt").each do |line| puts line.chomp.reverseend

example: reverse the contents of each line in a file

More ruby examples.

5.times do puts "Hello world"end

# Sorting peoplepeople_sorted_by_age = people.sort_by{ |person| person.age}

Getting help.

• In real life: Make friends with people. Talk to them.

• Online:• Specific discussion mailing lists (e.g.: R, Stacks, bioruby, MAKER...) • Programming: http://stackoverflow.com• Bioinformatics: http://www.biostars.org• Sequencing-related: http://seqanswers.com• Stats: http://stats.stackexchange.com

http://stackoverflow.com/

http://stackoverflow.com/

http://www.biostars.org/

http://www.biostars.org/

http://seqanswers.com/

http://seqanswers.com/

http://stats.stackexchange.com

http://stats.stackexchange.com

• Online reputation is good:

• forums

• “citizen science”