Upload
tracykteal
View
152
Download
1
Embed Size (px)
Citation preview
Compu&ngWorkflowsforBiologists
Basedon:Shade&Teal,Compu&ngWorkflowsforBiologists:ARoadmap,
PLOSBiologyDataCarpentrydataorganiza&onlessons
• Howmanypeoplehereplantoanalyzedatawithacomputerintheirwork?
• Areyouworkingwithotherpeopleonthisanalysis?
• Dootherpeopleneedtounderstandyouranalysis?
• Doyouneedtorememberandunderstandyouranalysis?
Elementsofcompu&ng
• Howdatawasgenerated(metadata)• Data• Datacleaningsteps• Dataanalysissteps• Finalplotsandcharts
Data!
• Keeprawdataraw• Usemeaningfulnames• Organizeyourdatasocomputerscanreadit
Keeprawdataraw
• Whatisrawdata?• WhyshouldIleaveitalone?
Usemeaningfulnames
Organizeyourdatasocomputerscanreadit
(let’stalkaboutspreadsheets)
hTp://www.datacarpentry.org/spreadsheet-ecology-lesson/00-intro.html
…alsoavoidformaZngerrors
OrganizingdatainspreadsheetsThecardinalrulesofusingspreadsheetprogramsfordata:• Putallyourvariablesincolumns-thethingyou're
measuring,like'weight'or'temperature'.• Puteachobserva/oninitsownrow.• Don'tcombinemul/plepiecesofinforma/oninonecell.
Some&mesitjustseemslikeonething,butthinkifthat'stheonlywayyou'llwanttobeabletouseorsortthatdata.
• Leavetherawdataraw-don'tmesswithit!• ExportthecleaneddatatoatextbasedformatlikeCSV.
Thisensuresthatanyonecanusethedata,andistheformatrequiredbymostdatarepositories.
FormaZngproblems
hTp://www.datacarpentry.org/spreadsheet-ecology-lesson/02-common-mistakes.html
ARoadmapfortheCompu&ngBiologist
• Considertheoverarchinggoalsoftheanalysis• AdoptanItera&ve,BranchingPaTerntoSystema&callyExploreOp&ons
• ReproducibilityCheckpoints• TakingNotesforComputa&onalAnalysis• SharedResponsibility:TheTeamApproachtoReproducibilityandDataManagement
ShadeandTeal,Compu&ngWorkflowsforBiologists:ARoadmaphTp://journals.plos.org/plosbiology/ar&cle?id=10.1371/journal.pbio.1002303
ConsidertheOverarchingGoalsoftheAnalysis
• Workingtoaddressagivenhypothesiswillmo&vatedifferentanalysisstrategiesthanconduc&ngdataexplora&on
ReproducibilityCheckpoints
Reproducibilitycheckpointsareplacesinaworkflowdevotedtoscru&nizingitsintegrity- theworkflow(orstepintheworkflow)canbeseamlesslyused(itdoesn’tcrashhalfwayorreturnerrormessages)
- theoutcomesareconsistentandvalidatedacrossmul&ple,iden&calitera&ons
- resultsshouldmakebiologicalsense
AdoptanItera/ve,BranchingPaFerntoSystema/callyExploreOp/ons
TakingNotesforComputa/onalAnalysis
• Takenoteslikeyouwouldforexperimentalwork
• Commentcode• Useversioncontrol(Github/Gitlab)
Whatneedstogoinnotes:- Soiwareversionsused- Descrip&onofwhatthesoiwareisdoing/goalofthatstep
- Briefnotesondevia&onsfromdefaultop&ons- Workflowscanincludedifferentsoiware(e.g.,PANDAseqtoQIIMEtoR),andshouldalsoincludeall“formaZngsteps”neededtomovebetweentoolshopefullyyoudon’tneedtomanuallyformattoomuch;avoidifpossible
SharedResponsibility:TheTeamApproachtoReproducibilityandData
Management
Wepositthatintegrityincomputa&onalanalysisofbiologicaldataisenhancedifthereisasenseofsharedresponsibilityforensuringreproducibleworkflows.Researchteamsthatworktogethertodevelopanddebugcode,performinternalreproducibilitycheckpointsforeachother,andgenerallyholdoneanotheraccountableforhigh-qualityresultslikelywillenjoyalowmanuscriptretrac&onrate,highlevelofconfidenceintheirresults,andstrongsenseofcollabora&on.
You,yourlabmatesandPIneedtovaluethe&meittakestodoanalysesreproduciblyandcorrectly
Sharedresponsibility
• Sharedstorageandworkspacecanfacilitateaccesstoallgroupdata
• Usingversioncontrolrepositoriescanprovideaccesstocodeanddocumenta&on(Github,Dropbox)
• SeZngexpecta&onsfor‘reproducibilitycheckpoints’(team“hackathons”:open-computergroupmee&ngsdedicatedtoanalysis)
• Paperreviews• Lookingforhelp/supportoutsidethelab(bioinforma&csorusergroups,officehours,StackOverflow)
Lookingforhelp
hTps://github.com/mblmicdiv/course2016/blob/master/bioinfo-resources.md
Youarenotalone
Surveyresponses
Exercise
hFp:///nyurl.com/mbl-workflows