Making neurophysiological data analysis reproducible: Why and how?

  • Published on
    05-Sep-2016

  • View
    216

  • Download
    2

Transcript

  • re

    ns-De

    is ao inespre

    endtistippron thnd

    could communicate much more efciently their results by adopting the reproducible research paradigm

    ciencef, it isrship ise com

    led by

    the paper. When the article includes numerical simulations, key analysis and modeling methods would require too long a descrip-tion to t the usual bounds of the printed paper. This is reasonablefor we all have a lot of different things to do and we cannot affordto systematically look at every piece of work as thoroughly assuggested above. Many people (Claerbout and Karrenbach, 1992;Buckheit and Donoho, 1995; Rossini and Leisch, 2003; Baggerly,2010; Diggle and Zeger, 2010; Stein, 2010) feel neverthelessuncomfortable with the present way of diffusing scienticinformation as a canonical (printed) journal article. We suggestwhat is needed are more systematic and more explicit ways todescribe how the analysis (or modeling) was done.

    Corresponding author. Tel.: +33 0 142863828; fax: +33 0 142863830.E-mail addresses: matthieu.delescluse@polytechnique.org (M. Delescluse),

    franconviller@janelia.hhmi.org (R. Franconville), sebastien.joucla@parisdescartes.fr(S. Joucla), tiffany.lieury@parisdescartes.fr (T. Lieury), christophe.pouzat@parisdescartes.fr (C. Pouzat).1 Present address: Janelia Farm Research Campus, 19700 Helix Drive, Ashburn, VA20147, USA.

    2 Present address: Institut de Neurosciences Cognitives et Intgratives dAquitaine,CNRS UMR 5287, Universit de Bordeaux, Btiment B2 Biologie Animale 4me

    Journal of Physiology - Paris 106 (2012) 159170

    Contents lists available at

    Journal of Phys

    journal homepage: www.elsetage, Avenue des Facults, 33405 Talence cedex, France.The preparation of manuscripts and reports in neuroscienceoften involves a lot of data analysis as well as a careful design andrealization of gures and tables, in addition to the time spent onthe bench doing experiments. The data analysis part can requirethe setting of parameters by the analyst and it often leads to thedevelopment of dedicated scripts and routines. Before reachingthe analysis stage per se the data frequently undergo a preprocessingwhich is rarely extensively documented in the methods section of

    instead of the linear scale use by the authors? What would be the result of applying that same analysis to myown data set?

    We can of course all think of a dozen of similar questions. Theproblem is to nd a way to address them. Clearly the classical jour-nal article format cannot do the job. Editors cannot publish twoversions of each gure to satisfy different readers. Many intricate1. Introduction

    An article about computational stion is not the scholarship itselthe scholarship. The actual scholadevelopment environment and thwhich generated the gures.

    Thoughts of Jon Claerbout distil(1995).0928-4257/$ - see front matter 2011 Elsevier Ltd. Adoi:10.1016/j.jphysparis.2011.09.011from their lab books all the way to their articles, thesis and books. 2011 Elsevier Ltd. All rights reserved.

    in a scientic publica-merely advertising ofthe complete software

    plete set of instructions

    Buckheit and Donoho

    elements of the analysis, like the time step used for conductancebased neuronal models, are often omitted in the description. Asreaders or referees of articles/manuscripts we are therefore oftenled to ask questions like:

    What would happen to the analysis (or simulation) results if agiven parameter had another value? What would be the effect of applying my preprocessing to thedata instead of the one used by the authors? What would a given gure look like with a log scale ordinateOrg-modePython

    used with R, Matlab, Python and many more generalist data processing software. Both solutions canbe used with Unix-like, Windows and Mac families of operating systems. It is argued that neuroscientistsMaking neurophysiological data analysis

    Matthieu Delescluse, Romain Franconville 1, SbastieLaboratoire de physiologie crbrale, CNRS UMR 8118, UFR biomdicale, Universit Pari

    a r t i c l e i n f o

    Article history:Available online 4 October 2011

    Keywords:SoftwareREmacsMatlabOctaveLATEX

    a b s t r a c t

    Reproducible data analysiswith everything required tthe data, the computer codhistory of this approach issince the early eighties toysis oriented elds like staimplementation of these aexample demonstrates thesis: the Sweave family all rights reserved.producible: Why and how?

    Joucla 2, Tiffany Lieury, Christophe Pouzat scartes, 45, rue des Saints-Pres, 75006 Paris, France

    n approach aiming at complementing classical printed scientic articlesdependently reproduce the results they present. Everything covers here:and a precise description of how the code was applied to the data. A briefsented rst, starting with what economists have been calling replicationwith what is now called reproducible research in computational data anal-cs and signal processing. Since efcient tools are instrumental for a routineaches, a description of some of the available ones is presented next. A toye use of two open source software programs for reproducible data analy-the org-mode of emacs. The former is bound to R while the latter can be

    SciVerse ScienceDirect

    iology - Paris

    vier .com/locate / jphyspar is

  • These issues are not specic to published material. Any scientist

    160 M. Delescluse et al. / Journal of Physioafter a few years of activity is very likely to have experienced asituation similar to the one we now sketch. A project is ended afteran intensive work requiring repeated daily long sequences ofsometimes tricky analysis. After 6 months or one year we getto do again very similar analysis for a related project; but thenightmare scenario starts since we forgot:

    The numerical lter settings we used. The detection threshold we used. The way to get initial guesses for our nonlinear tting softwareto converge reliably.

    In other words, given enough time, we often struggle to exactlyreproduce our ownwork. The same mechanisms lead to know-howbeing lost from a laboratory when a student or a postdoc leaves:the few parameters having to be carefully set for a successful anal-ysis were not documented as such and there is nowhere to ndtheir typical range. This leads to an important time loss whichcould ultimately culminate in a project abandonment.

    We are afraid that similar considerations sound all too familiarto most of our readers. It turns out that the problems describedabove are not specic to our scientic domain, and seem insteadto be rather common at least in the following domains: economics(Dewald et al., 1986; Anderson and Dewald, 1994; McCulloughet al., 2006; McCullough, 2006), geophysics (Claerbout andKarrenbach, 1992; Schwab et al., 2000), signal processing(Vandewalle et al., 2009; Donoho et al., 2009), statistics (Buckheitand Donoho, 1995; Rossini, 2001; Leisch, 2002a), biostatistics(Gentleman and Temple Lang, 2007; Diggle and Zeger, 2010),econometrics (Koenker and Zeileis, 2007), epidemiology (Pengand Dominici, 2008) and climatology (Stein, 2010; McShane andWyner, 2010) where the debate on analysis reproducibility hasreached a particularly acrimonious stage. The good news about thisis that researchers have already worked out solutions to our mun-dane problem. The reader should not conclude from the aboveshort list that our community is immune to the reproducibilityconcern since the computational neuroscience community is alsonow addressing the problem by proposing standard ways todescribe simulations (Nordlie et al., 2009) as well as developingsoftware like Sumatra3 to make simulations/analysis reproducible.The whole scientic community is also giving a growing publicityto the problem and its solutions as witnessed by two workshopstaking place in 2011: Veriable, reproducible research andcomputational science, a mini-symposium at the SIAM Conferenceon Computational Science & Engineering in Reno, NV on March 4,20114; Reproducible Research: Tools and Strategies for ScienticComputing, a workshop in association with Applied MathematicsPerspectives 2011 University of British Columbia, July 1316,20115 as well as by The Executable Paper Grand Challenge orga-nized by Elsevier.6

    In the next section we review some of the already availabletools for reproducible research, which include data sharing andsoftware solutions for mixing code, text and gures.

    2. Reproducible research tools

    2.1. Sharing research results

    Reproducing published analysis clearly depends, in the generalcase, on the availability of both data and code. It is perhaps worth

    3 http://neuralensemble.org/trac/sumatra/wiki.

    4 http://jarrodmillman.com/events/siam2011.html.5 http://www.stodden.net/AMP2011/.6 http://www.executablepapers.com/index.html.reminding at this stage NIH and NSF grantees of the data sharingpolicies of these two institutions: Data sharing should be timelyand no later than the acceptance of publication of the main nd-ings from the nal data set. Data from large studies can be releasedin waves as data become available or as they are published (NIH,2003) and Investigators are expected to share with otherresearchers, at no more than incremental cost and within a reason-able time, the primary data, samples, physical collections and othersupporting materials created or gathered in the course of work un-der NSF grants. (The National Science Foundation, 2011, Chap. VI,Sec. D 4 b). Publishers like Elsevier (Elsevier, 2011, Sec. Duties ofAuthors) and Nature (ESF, 2007, p. 15) also require that authorsmake their data available upon request, even if that seems to bemere lip service in some cases (McCullough and McKitrick, 2009,pp. 1617).

    In short, many of us already work and publish in a contextwhere we have to share data and sometimes codes even if we arenot aware of it. The data sharing issue is presently actively debatedbut the trend set by the different committees and organizationsseems clear: data will have to be shared sooner or later. We canexpect or hope that the same will be true for publicly funded codedevelopments.

    2.2. Early attempts: database deposits

    The economists took on very early the problem of reproducibleanalysis, that they usually call replication, of published results: In1982, the Journal of Money, Credit and Banking, with nancial sup-port from the National Science Foundation, embarked upon theJMCB Data Storage and Evaluation Project. As part of the Project,the JMCB adopted an editorial policy of requesting from authorsthe programs and data used in their articles and making these pro-grams and data available to other researches on request. Dewaldet al. (1986). The results turned out to be scary, with just 2 outof 54 (3.7%) papers having reproducible/ replicable results (Dewaldet al., 1986). This high level of non reproducible results was largelydue to authors not respecting the JMCB guidelines and not givingaccess to both data and codes. A recent study (McCullough et al.,2006) using the 19962003 archives of the JMCB found a better but still small rate of reproducible results, 14/62 (22%). We donot expect the neurobiologists to behave much better in the samecontext. Policies are obviously not sufcient. This points out theneed for dedicated tools making reproducibility as effortless aspossible. Besides data sharing platforms, this is a call for reproduc-ible research software solutions. The ideal software should:

    Provide the complete analysis code along with its documenta-tion (anyone working with data and code knows they are use-less if the author did not take the time to properly documentthem) and its theoretic or practical description. Allow any user/reader to easily re-run andmodify the presentedanalysis.

    We will rst present a brief survey of existing tools matchingthese criteria, before focusing on the two that appear to us as themost promising and generally applicable to neuroscience: Sweaveand org-mode. We will then illustrate their use with a toyexample.

    2.3. Implementations of reproducible analysis approaches

    A rst comprehensive solution: The Stanford Exploration Project.Claerbout and Karrenbach (1992), geophysicists of the Stanford

    logy - Paris 106 (2012) 159170Exploration Project, state in their abstract: A revolution in educa-tion and technology transfer follows from the marriage of wordprocessing and software command scripts. In this marriage an

  • author attaches to every gure caption a pushbutton or a name tag be found in neurobiological laboratories. R is a general data analysis

    the code producing the analysis (calculations, gures, tables) pre-sented by the document. The processed le is a new text document

    this article a solution which appeared recently: the org-modeof emacs.12 emacs (Stallman, 1981) is an extremely powerful text

    M. Delescluse et al. / Journal of Physiology - Paris 106 (2012) 159170 161usable to recalculate the gure from all its data, parameters, andprograms. This provides a concrete denition of reproducibilityin computationally oriented research. Experience at the StanfordExploration Project shows that preparing such electronic docu-ments is little effort beyond our customary report writing; mainly,we need to le everything in a systematic way. Going beyond thedatabase for data and code concept, they decided to link organi-cally some of the gures of their publications and reports withthe data and the code which generated them. They achieved thisgoal by using TEX (Knuth, 1984b) and LATEX (Lamport, 1986) totypeset their documents, a avor of fortran called ratfor (forrational fortran) and C to perform the computation and cake aavor of make to repeat automatically the succession of programcalls required to reproduce any given gure automatically. Col-leagues were then given a CD ROM with the data, the codes, the -nal document as well as other required open source software. Theythen could reproduce the nal document or change it by alteringparameter values provided they had a UNIX running computer aswell as the appropriate C and fortran compilers. The approachwas comprehensive in the sense that it linked organically the re-sults of a publication with the codes used to generate them. Itwas nevertheless not very portable since it required UNIX runningcomputers. The Stanford Exploration Project has since then done aconsiderable effort towards portability with their Madagascarproject7 (Fomel and Hennenfent, 2007). Madagascar users can workwith compiled codes (C, C++, fortran) as well as with generalistpackages like Matlab or Octave or with interpreted scripting lan-guages like Python. The use of the package on Windows still re-quires motivation since users have to install and use cygwin aLinux like environment for Windows and this is, in our experience,a real barrier for the users. In addition, although the link is organic,gures and the code which generated them are not kept in the samele.

    2.3.1. WaveLab: A more portable solutionBuckheit and Donoho (1995), Donoho et al. (2009), statisticians

    from Standford University inspired by their colleague JonClaerbout, proposed next a reproducible research solution basedentirely on Matlab in t...

Recommended

View more >