Making neurophysiological data analysis reproducible: Why and how?

Embed Size (px)

Text of Making neurophysiological data analysis reproducible: Why and how?

  • re

    ns-De

    is ao inespre

    endtistippron thnd

    could communicate much more efciently their results by adopting the reproducible research paradigm

    ciencef, it isrship ise com

    led by

    the paper. When the article includes numerical simulations, key analysis and modeling methods would require too long a descrip-tion to t the usual bounds of the printed paper. This is reasonablefor we all have a lot of different things to do and we cannot affordto systematically look at every piece of work as thoroughly assuggested above. Many people (Claerbout and Karrenbach, 1992;Buckheit and Donoho, 1995; Rossini and Leisch, 2003; Baggerly,2010; Diggle and Zeger, 2010; Stein, 2010) feel neverthelessuncomfortable with the present way of diffusing scienticinformation as a canonical (printed) journal article. We suggestwhat is needed are more systematic and more explicit ways todescribe how the analysis (or modeling) was done.

    Corresponding author. Tel.: +33 0 142863828; fax: +33 0 142863830.E-mail addresses: matthieu.delescluse@polytechnique.org (M. Delescluse),

    franconviller@janelia.hhmi.org (R. Franconville), sebastien.joucla@parisdescartes.fr(S. Joucla), tiffany.lieury@parisdescartes.fr (T. Lieury), christophe.pouzat@parisdescartes.fr (C. Pouzat).1 Present address: Janelia Farm Research Campus, 19700 Helix Drive, Ashburn, VA20147, USA.

    2 Present address: Institut de Neurosciences Cognitives et Intgratives dAquitaine,CNRS UMR 5287, Universit de Bordeaux, Btiment B2 Biologie Animale 4me

    Journal of Physiology - Paris 106 (2012) 159170

    Contents lists available at

    Journal of Phys

    journal homepage: www.elsetage, Avenue des Facults, 33405 Talence cedex, France.The preparation of manuscripts and reports in neuroscienceoften involves a lot of data analysis as well as a careful design andrealization of gures and tables, in addition to the time spent onthe bench doing experiments. The data analysis part can requirethe setting of parameters by the analyst and it often leads to thedevelopment of dedicated scripts and routines. Before reachingthe analysis stage per se the data frequently undergo a preprocessingwhich is rarely extensively documented in the methods section of

    instead of the linear scale use by the authors? What would be the result of applying that same analysis to myown data set?

    We can of course all think of a dozen of similar questions. Theproblem is to nd a way to address them. Clearly the classical jour-nal article format cannot do the job. Editors cannot publish twoversions of each gure to satisfy different readers. Many intricate1. Introduction

    An article about computational stion is not the scholarship itselthe scholarship. The actual scholadevelopment environment and thwhich generated the gures.

    Thoughts of Jon Claerbout distil(1995).0928-4257/$ - see front matter 2011 Elsevier Ltd. Adoi:10.1016/j.jphysparis.2011.09.011from their lab books all the way to their articles, thesis and books. 2011 Elsevier Ltd. All rights reserved.

    in a scientic publica-merely advertising ofthe complete software

    plete set of instructions

    Buckheit and Donoho

    elements of the analysis, like the time step used for conductancebased neuronal models, are often omitted in the description. Asreaders or referees of articles/manuscripts we are therefore oftenled to ask questions like:

    What would happen to the analysis (or simulation) results if agiven parameter had another value? What would be the effect of applying my preprocessing to thedata instead of the one used by the authors? What would a given gure look like with a log scale ordinateOrg-modePython

    used with R, Matlab, Python and many more generalist data processing software. Both solutions canbe used with Unix-like, Windows and Mac families of operating systems. It is argued that neuroscientistsMaking neurophysiological data analysis

    Matthieu Delescluse, Romain Franconville 1, SbastieLaboratoire de physiologie crbrale, CNRS UMR 8118, UFR biomdicale, Universit Pari

    a r t i c l e i n f o

    Article history:Available online 4 October 2011

    Keywords:SoftwareREmacsMatlabOctaveLATEX

    a b s t r a c t

    Reproducible data analysiswith everything required tthe data, the computer codhistory of this approach issince the early eighties toysis oriented elds like staimplementation of these aexample demonstrates thesis: the Sweave family all rights reserved.producible: Why and how?

    Joucla 2, Tiffany Lieury, Christophe Pouzat scartes, 45, rue des Saints-Pres, 75006 Paris, France

    n approach aiming at complementing classical printed scientic articlesdependently reproduce the results they present. Everything covers here:and a precise description of how the code was applied to the data. A briefsented rst, starting with what economists have been calling replicationwith what is now called reproducible research in computational data anal-cs and signal processing. Since efcient tools are instrumental for a routineaches, a description of some of the available ones is presented next. A toye use of two open source software programs for reproducible data analy-the org-mode of emacs. The former is bound to R while the latter can be

    SciVerse ScienceDirect

    iology - Paris

    vier .com/locate / jphyspar is

  • These issues are not specic to published material. Any scientist

    160 M. Delescluse et al. / Journal of Physioafter a few years of activity is very likely to have experienced asituation similar to the one we now sketch. A project is ended afteran intensive work requiring repeated daily long sequences ofsometimes tricky analysis. After 6 months or one year we getto do again very similar analysis for a related project; but thenightmare scenario starts since we forgot:

    The numerical lter settings we used. The detection threshold we used. The way to get initial guesses for our nonlinear tting softwareto converge reliably.

    In other words, given enough time, we often struggle to exactlyreproduce our ownwork. The same mechanisms lead to know-howbeing lost from a laboratory when a student or a postdoc leaves:the few parameters having to be carefully set for a successful anal-ysis were not documented as such and there is nowhere to ndtheir typical range. This leads to an important time loss whichcould ultimately culminate in a project abandonment.

    We are afraid that similar considerations sound all too familiarto most of our readers. It turns out that the problems describedabove are not specic to our scientic domain, and seem insteadto be rather common at least in the following domains: economics(Dewald et al., 1986; Anderson and Dewald, 1994; McCulloughet al., 2006; McCullough, 2006), geophysics (Claerbout andKarrenbach, 1992; Schwab et al., 2000), signal processing(Vandewalle et al., 2009; Donoho et al., 2009), statistics (Buckheitand Donoho, 1995; Rossini, 2001; Leisch, 2002a), biostatistics(Gentleman and Temple Lang, 2007; Diggle and Zeger, 2010),econometrics (Koenker and Zeileis, 2007), epidemiology (Pengand Dominici, 2008) and climatology (Stein, 2010; McShane andWyner, 2010) where the debate on analysis reproducibility hasreached a particularly acrimonious stage. The good news about thisis that researchers have already worked out solutions to our mun-dane problem. The reader should not conclude from the aboveshort list that our community is immune to the reproducibilityconcern since the computational neuroscience community is alsonow addressing the problem by proposing standard ways todescribe simulations (Nordlie et al., 2009) as well as developingsoftware like Sumatra3 to make simulations/analysis reproducible.The whole scientic community is also giving a growing publicityto the problem and its solutions as witnessed by two workshopstaking place in 2011: Veriable, reproducible research andcomputational science, a mini-symposium at the SIAM Conferenceon Computational Science & Engineering in Reno, NV on March 4,20114; Reproducible Research: Tools and Strategies for ScienticComputing, a workshop in association with Applied MathematicsPerspectives 2011 University of British Columbia, July 1316,20115 as well as by The Executable Paper Grand Challenge orga-nized by Elsevier.6

    In the next section we review some of the already availabletools for reproducible research, which include data sharing andsoftware solutions for mixing code, text and gures.

    2. Reproducible research tools

    2.1. Sharing research results

    Reproducing published analysis clearly depends, in the generalcase, on the availability of both data and code. It is perhaps worth

    3 http://neuralensemble.org/trac/sumatra/wiki.

    4 http://jarrodmillman.com/events/siam2011.html.5 http://www.stodden.net/AMP2011/.6 http://www.executablepapers.com/index.html.reminding at this stage NIH and NSF grantees of the data sharingpolicies of these two institutions: Data sharing should be timelyand no later than the acceptance of publication of the main nd-ings from the nal data set. Data from large studies can be releasedin waves as data become available or as they are published (NIH,2003) and Investigators are expected to share with otherresearchers, at no more than incremental cost and within a reason-able time, the primary data, samples, physical collections and othersupporting materials created or gathered in the course of work un-der NSF grants. (The National Science Foundation, 2011, Chap. VI,Sec. D 4 b). Publishers like Elsevier (Elsevier, 2011, Sec. Duties ofAuthors) and Nature (ESF, 2007, p. 15) also require that authorsmake their data available upon request, even if that seems to bemere lip service in some cases (McCullough and McKitrick, 2009,pp. 1617).

    In short, many of us already work and publish in a contextwhere we have to share data and sometimes codes even if we arenot aware of it. Th