20
Roswell Park Cancer Institute Data Science Strategic Plan March 2017

Roswell Park Cancer Institute Data Science Strategic Plan ... · outcomes data, resulting in the need to capture, manage and analyze very large heterogeneous datasets from diverse,

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Roswell Park Cancer Institute Data Science Strategic Plan ... · outcomes data, resulting in the need to capture, manage and analyze very large heterogeneous datasets from diverse,

RoswellParkCancerInstituteDataScienceStrategicPlanMarch2017

Page 2: Roswell Park Cancer Institute Data Science Strategic Plan ... · outcomes data, resulting in the need to capture, manage and analyze very large heterogeneous datasets from diverse,

v4.20170309

2

DATASCIENCETASKFORCERECOMMENDATIONSRestructureITtoincludeadistinct‘research’componentITrestructuringshouldextendtothetopoftheITadministrationinfrastructure,withanindividualleadertaskedwithenablingresearchatRPCI.Itshouldextendthroughthesupportandhelpdeskinfrastructureandstafftoincludeindividualsidentifiedas'research-enabled'.TheresearchportionofITshouldhaveit'sownstrategicplanandotheroperationalaspectsthatareonthesamefootingasthecurrentadministrative/clinicalinitiativetocreateacomprehensivestrategicplan.ITSecurityandtheLegaldepartmentmustdevelopa'NONO'policyITSecurityandLegaldepartmentsresponsetoresearcherneedsistodelivertheappropriatesolution,ratherthantoensnaretheresearcherincompleximplementationandpolicydetails.TheprimarypointofcontactfortheresearchershouldbetheIThelpdesk;negotiationsbetweensecurityandlegalshouldprimarilybe'behindthescenes'andthehelpdeskshouldbeempoweredtoescalaterecalcitrantissuestothehighestlevelsforpromptresolution.ITinfrastructureandsupportneedstoenableuseofhigh-throughputandscientificcomputing(e.g.,Linuxworkstation)resourcesThereisnoalternativetothisformoderninformatics.BuildadistributedcancerinformaticsknowledgeenvironmentAscalable‘datagrid’architecturewillprovideintegratedaccesstodataacrosstheRPCIenterprise,e.g.EHR,TumorRegistry,department-basedclinicaldatabases,researchdatabases,specimenmetadata,‘omicsdataarchives.Buildingthiscapabilityfacilitatesbothdevelopmentofcancerontologyanddatasharingrequiredfortranslationalresearchandmedicine.Implementafive-yeareducationalinitiativetoestablishaninformatics-capablecommunityThecomponentsofthisinitiativearecomprehensive:technicaltrainingonuseofcomputingresources;developmentofshortcoursesfocusingoninformaticsuseinanalysisandcomprehensionof'omicsdata;closetieswithrelevantUBacademicprograms;investigatoreducationonopportunitiesforintroducinginformaticsintotheirresearchprogramandfundingefforts;recruitmentofinformatics-trainedpost-docsandscientists.

Page 3: Roswell Park Cancer Institute Data Science Strategic Plan ... · outcomes data, resulting in the need to capture, manage and analyze very large heterogeneous datasets from diverse,

v4.20170309

3

I.OverviewRoswellParkCancerInstitute(RPCI)isaworld-classNCI-designatedComprehensiveCancerCenterdedicatedtounderstanding,preventingandcuringcancer.Inadditiontodeliveringhigh-qualitymedicalcaretomorepeopleeffectively,oneoftheinstitute’scorestrategicgoalsisexcellenceinresearch.Theexplosionofdataintheareaofcancerresearchaswellasthelargeamountofclinicalinformationcollectedaspartofstandardofcarehaselevatedtheimportanceofeffectivelymanagingthequality,consistencyandhandlingofresearchandclinicaldata.Increasingly,accessisneededtolarge-scalecomputingandstoragedevices,togetherwithhigh-speednetworkswithglobalconnectivity.Enablingtheannotation,integrationandsharingofawidevarietyofdatatypesandsizessupportshigherorderanalyticsandvisualization,whichpositivelyimpactsqualityofresearchandpatientcare.RPCImuststructureandimplementcutting-edgedatasciencecapabilitytostayontheforefrontofcancerresearchandtreatment.Datasciencerequirementsarediverse.Individualresearchershavevaryingcomputingandinformationtechnologyneeds,includingdesktopandlaptopcomputersforindividualuse;dedicatedcomputerspairedwithspecificinstruments;softwareofvaryingdegreesofsophisticationandcomplexity;infrastructure,networkandsoftwaretofacilitatehigh-throughputpipelines;highperformancecomputerpower;andhighbandwidthnetworks.Increasingly,cancerresearchandhealthdeliverydependsondatabasetechnologyandanalysisofclinical,epidemiologic,pathologic,biologicalandoutcomesdata,resultingintheneedtocapture,manageandanalyzeverylargeheterogeneousdatasetsfromdiverse,geographicallydistributedsources.Infall2016,Dr.AdekunleOdunsiformedaDataScienceTaskForcetoreviewexistingdatasciencecapabilities,identifygaps,anddevelopacomprehensivedatasciencestrategyforRPCI.TheTaskForceincludedrepresentativesfromresearch,clinical,ethics,datagovernance,computingandinformationtechnologyareasacrosstheinstitution,andinputfromgroupsnotrepresentedontheTaskForcewassolicited.Environmentalscansinformedthedevelopmentofthestrategicplanbycatalogingdatarepositoriesofallsizeswithintheinstitution,evaluatingRPCIpoliciesthatimpactdatasciencecapability,assessingnetworkandcomputingcapacity,andgatheringinformationfromcommunitymemberswhoseinterestsandgovernanceimpactRPCI’sdatascienceenvironment.SummaryTableswithinformationfromtheenvironmentalscansmaybefoundintheAppendices.ThisdocumentservesastheRPCIdatasciencestrategicplan.

Page 4: Roswell Park Cancer Institute Data Science Strategic Plan ... · outcomes data, resulting in the need to capture, manage and analyze very large heterogeneous datasets from diverse,

v4.20170309

4

II.VisionRPCIwillprovideauserfriendly,comprehensivedatascienceenvironmenttosupportandextendcuttingedgetranslationalcancerresearchbyestablishinganinformatics-capablecommunitythrougheducationalinitiatives,implementinganenablingresearchITinfrastructure,buildingdataconnectivity,andkeepingpacewithemergingtechnologies.III.StrategicGoalsA.LeverageexistingassetsRPCIhasmanydatasciencecapabilitiesandstrengths,including:

1. CoreCancerCenterSupportGrantanalyticresourcesa. Biostatisticsb. Bioinformaticsc. ClinicalResearchServices

2. ElectronicHealthRecord(EHR)3. ElectronicPathologysystems4. Electronicsystemstosupportclinicaladministration(e.g.scheduling,billing)5. InformationTechnology(IT)resources6. LaboratoryInformationManagementSystem(LIMS)7. ClinicalDataNetworkwithexpertiseindatastandards,cancerontologiesand

datamanagement8. CancerRegistrywithrichlongitudinaldata9. Analyticsoftwaredevelopment(e.g.RBioconductor,BioinformaticsResource)10. Clinicaltrialsdatamanagement11. Biorepositories(DBBR,PRN,Lymphoma,Leukemia,Ovary)12. NRGandALLIANCECancerCooperativeGroups13. PartnershipwiththeUniversityatBuffalo’sCenterforComputationalResources

Whiletheorganizationhassignificantstrengths,thecurrentdatascienceenvironmenthasinherentbarriersthatmakeitdifficulttomakethebestuseoftheseassets.Generally,dataaresiloed,difficulttoextractandsemanticallyheterogeneous.Theapproachofaddingmorevendorand/orinternally-developedtechnicalsolutionsadhocmayexacerbatetheproblem;increasingcomplexitytendstoresultinsiloed,uncoordinateddataandcapabilitiesandmakesdataintegrationanduseacrossplatformsmoredifficult.Tofacilitatestrongdatascience,RPCIneedsafinanciallysustainablesolution(ITresearchinfrastructure,technology,standards,processes,methods,andleadership)thatseamlesslyknitstogethertheseestablishedhighqualitydataassetsandanynewcomponentsthatwillbeaddedinthefuture,breakingdownsilos,supporting

Page 5: Roswell Park Cancer Institute Data Science Strategic Plan ... · outcomes data, resulting in the need to capture, manage and analyze very large heterogeneous datasets from diverse,

v4.20170309

5

interoperabilityanddataaccessacrosssystemsanddomains,andenablingRPCItoachieveitsgoals.Fromadatascienceperspective,theinstitutionalgoalsinclude:

• Integrationofpatientreported,clinicalandresearchfindingstoenhancepatientengagement,treatmentandoutcomes

• Secureandtimelyaccesstoclinicaldataforresearchers• Accesstoandintegrationofdifferentkindsofdataonapatientorpopulationto

enablehighlevelanalytics• Abilitytomanage,shareandanalyzehighthroughput,highvolumeandbigdata• Integrationofderivativedatawithbiorepositoryandclinicalinformation• Strongdatasciencetechnicalinfrastructureandexpertisetofacilitatetheaward

offederal,foundationandindustrialpartnershipsandgrantdollars• Enhanceabilitytoattractstronginformatics,basicscience,andclinical

researcherstoRPCIAresearchITinfrastructureistheunderpinningofahighlyfunctionaldatasciencecapabilityandcriticalformeetingtheaboveaims.RPCIfacesthechallengeofmanagingincreasingdatavolumeandcomplexity.Asignificantcapitalinvestmentindatacaptureandvarioustechnologiestomanagethatinformationhastheresultedintheemergenceofdatasilosandisolatedsystems.FailingtomakethenecessaryinvestmentinastrongdatasciencesolutioninthenextfewyearswillleavetheRPCIincreasinglychallengedtohandledataproductivelyandprofitablyandmayprecludeitfromreachingitsgoals.B.Developaninformatics-capableinfrastructureandcommunityDevelopinghighleveldatasciencecapabilityrequires,atitscore,anenablingITandsoftwareinfrastructurethatisfinanciallysustainable,scalable,platformagnostic,dataformatagnostic,minimallydisruptivetoexistingworkflows,andinclusiveoftheinstitution’ssignificantinvestmentintechnology(meaningitmustnotrequirethediscardorreplacementofexistingsoftwaresystemsthatsupportestablishedworkflows).Italsorequiresaninstitutionalculturethatempowersresearcherswithdatascienceeducation,enablingpolicies,cross-departmentalsupportforrapidandreasonableproblemresolution,andresearch-appropriaterisktolerance.WearerecommendingbothITreorganizationandcultureshift.

Page 6: Roswell Park Cancer Institute Data Science Strategic Plan ... · outcomes data, resulting in the need to capture, manage and analyze very large heterogeneous datasets from diverse,

v4.20170309

6

IV.StrategicInitiatives:ActionableStepsThisTaskForcerecommendsthesupportofseveralinitiativestodrivethisstrategy.A.ReorganizeRPCIITtoincludeadistinct‘research’component.FeedbackfromfacultyandstaffconsistentlypointstoanITinfrastructurethatcripplesresearchactivitieswithoverlyrisk-aversepolicies,dearthofchannelsfortimelyresolutionofresearchITissues,inadequatestoragespace,inflexibilitywithresearch-specificsoftwareinstallationonmachines,andinadequateinfrastructure(e.g.workstationsanddatasharing-enablinginfrastructure).Additionally,wenotedinadequateonsiteresearchITsupport.Existingresourcesmaybeleveragedtodevelopa‘research’ITnetwork,howeverthisnetworkmustbeanITarmwithfullleadership,decision-makingauthorityandsupport.Supportmustincludebothhelpdeskstaffwhoarewelleducatedinresolvingresearchissuesandeffectiveproceduresforresolvingissues.Althoughthe‘research’ITnetworkwillnothandleclinicaldataexchangedinthedeliveryofhealthcare,clinicalandbillingdatamaymoveontothisnetworkinthecontextofresearch,thereforeprivacyand/orde-identificationprocedureswillbecarefullyaddressed,andclosecollaborationwithRPCIITiscritical.Importantly,asegregatednetworkwouldallowinvestigatorstohaveadministrativerightstotheirownpersonalcomputers,enablingthemtouploadsoftwarepackagesnecessaryfortheirresearch.Makingfulluseofa‘research’ITnetworkrequiresresearchersandITstafftobecapableofusingscientificcomputingresources.Tothisendwerecommendthedevelopmentofanambitiouseducationalprogramfocusingoninformaticsuseinmanagement,analysisandcomprehensionofdiversebiomedicaldatae.g.‘omicsandotherhighthroughputdata.Theprogramwouldincludetechnicaltrainingonuseofcomputingresources,shortinformaticscourses,andinvestigatoreducationonopportunitiesforintroducinginformaticsintotheirresearchprogramandfundingefforts.B.Constructaworld-classdatasciencetechnicalinfrastructureRPCIneedsaneffectivedatasciencetechnicalinfrastructurewherescientificdata,clinicalandbillingdata,study-specificdataandbiospecimenmetadatamaybecaptured,assessedandsharedataninstitutionallevel.Tocreateatransparent,grid-typearchitecture,werecommendfocusingoninitiallyaddressingfivemajorinformaticsgoals:

1. DefineaninformationmodelfordescribingtheRPCIresearchspace(thefocusoftheDSCO)andpromotingthedatastandardsacrossRPCI

2. DefineaninformationmodelfordescribingtheRPCIclinicalspace(maybeacquiredfromordevelopedincollaborationwiththeclinicalinformaticsteam)

Page 7: Roswell Park Cancer Institute Data Science Strategic Plan ... · outcomes data, resulting in the need to capture, manage and analyze very large heterogeneous datasets from diverse,

v4.20170309

7

3. Enableallcomponentsofthetechnicalinfrastructuretobedistributed4. Provideuserfriendlysoftwareinterfacesforcapture,discoveryandauthorized

accessofdataresourcesacrossRPCI(interfacesthatarecomplexmaydetractfrombestuseofinformation)

5. ProvideasecuretransferanddistributioninfrastructuretomeetUnitedStatesFederalandRPCIregulationsfordatasharingofhealthinformation

6. ProvideanintegratedportalenvironmentforaccesstothedistributedRPCIdataandresources

TheRPCIdatasciencearchitecture,intendedtoleverageinformaticstechnologytoenableresearchanddevelopmentofnewapproachestoscientificdiscovery,willtakeaverypragmaticapproachbydeconstructingtheprocessofcancerresearchintoasetoffunctionsandprovidingalayeredsystemwithapplicationsconstructedontopoftheinfrastructuretoenableRPCIresearcheffortstobeanintegratednetwork.Theapplicationsrepresentcriticalfunctionsthatareperformedbytheresearchcommunity.Furthermore,byintegratingexistinganddevelopedapplicationsintoanenterprisesystem,RPCIwillprovidethecapabilityformanagingtheinformationassetsataninstitute-widelevel.Thearchitecturethereforeisdecomposedintoasetofprojectsthatmakeuptheinformaticsportfolio.Theprojectsmaybeimplementedacrosstheinstitutionbyadiverse,collaborative“informaticsteam”frommultiplegroupsacrossRPCI.Theprojectsinclude:1.DevelopmentofCommonDataElements(CDEs)toexplicitlycaptureandmanagedataattributesinaconsistentmanner2.Developmentofacancerontology,organizingtheCDEsintoasetofobjectsandrelationshipsthatrepresentstheinformationspaceofRPCIcancerresearchthatwillleverageandalignascloselyaspossiblewithnationalandindustrystandards3.DocumentationoftheEMRdatamodeltofacilitateuseofEMRdata4.Developmentofanoverarchingintegratedsystemforaccessingandsharinginformationincludingbiospecimen,epidemiological,biomarker,clinical,pathologic,ontologic,outcome,billingandstudyinformation(e.g.‘omicsdata)fromcancerresearch5.Developmentofaninformationsystemresourceforcaptureandmanagementofcomplexbiomedicalresearchdata.Thisincludesbuildingateamwithexpertiseinsystemsarchitecture,databasetheory,relationaldatabaseandwebtechnology,andprocessengineeringtocollaboratewithinvestigatorswhoseprojectsaretoocomplexforasimpletoollikeREDCap.Thisteamwilltriageprojectsforappropriateinformaticssupportandparticipateingrantwritingtoensureappropriateresourcingofnew

Page 8: Roswell Park Cancer Institute Data Science Strategic Plan ... · outcomes data, resulting in the need to capture, manage and analyze very large heterogeneous datasets from diverse,

v4.20170309

8

researchandstrongtechnicalgrantsmanship.RPCI’sClinicalDataNetworkiswellpositionedtogrowtofillthisneed.6.Developmentofaninfrastructureforcapturingandwarehousingresultswhencentralizeddatastorageisbeneficial.TheseresultsmayincludetherawandprocesseddatafromRPCIresearchstudies.Theinfrastructuremayprovideacommonsoftwarecomponent,“CatalogandArchiveService,”thatcanbeconfiguredtocaptureinformationacrossverydifferentstudies.7.DevelopmentofcapabilitiesintextanddataminingandNaturalLanguageProcessing(NLP)byRPCIfacultyandexternalcollaborators,implementationofwhichtobefacilitatedbythedatasciencearchitecture.8.DevelopmentofawebenabledRPCIDataSciencePortalforsharingresourcesandresultswiththeresearchcommunity(withinandoutsideoftheRPCI).WerecommendspecificationanddevelopmentofanApacheOODT-basedresearchandanalyticsplatformtoenable1)institute-widedatacapture,organization,sharing,integration,archive,anddissemination;2)integrationofexistingdatarepositoriesintoacenter-widecapability;and3)collaborationamongresearchersandresearchgroups.Thisprovenoverarchingopensourceinfrastructureallowsbothintegrationofexistingprogramsanddataresourcesandincorporationofnewtechnologiesanddatarepositoriesastheyaredevelopedoracquired.Thisdatasciencecapabilitybringsinformationtogetherandstagesitforanalyticandvisualizationoperations;itistheintegrating“middlepiece”thatmakespossibletheuseoftheincreasingnumberofemerginganalyticalanddataclassificationtools.ApacheOODTisanApacheSoftwareFoundationtop-levelprojectthathasastrongcommunityandfeatureshighqualityinteroperabletools.Toavoid“reinventingthewheel,”wewillexplorethesuccessfuldatascienceenvironmentsatseveralhigh-functioningacademicinstitutions,e.g.FredHutchinsonCancerCenter,MemorialSloanKetteringCancerCenter,DanaFarberCancerInstitute/Harvard,andVanderbilt,tomodelthepositivesandlearnfromthenegatives.C.EstablishDataScienceWorkingGroupstodevelopandexecutedetailedplansWerecommendtheestablishmentofthefourworkinggroupsdescribedinthetablebelow.ThegoaloftheseWorkingGroupsistosupportthestrategicplanwithdevelopmentofrecommendationsandspecificactionitemsinthefollowingcriticalareas:

Page 9: Roswell Park Cancer Institute Data Science Strategic Plan ... · outcomes data, resulting in the need to capture, manage and analyze very large heterogeneous datasets from diverse,

v4.20170309

9

WorkingGroup

Objectives Membership

DataScienceResearchNetwork&HighPerformanceComputing(DSRN)

1.Documentrequirementsforadistinct‘research’ITdivisionofRPCIIT2.Estimateone-timeandongoingcoststobeincurred3.Developimplementationplan4.Developelementsofeducationalprogram(communicatepossibilitiesandaccess);engageRPCIeducationresources-5.DefineRPCIneedsandsystemrequirementsforhighperformancecomputing(HPC)6.IdentifyandcoordinateresourcesandeffortsforHPCprogramacrossRPCI7.Directthedevelopmentofuser-friendlyaccesstohighperformancecomputing,includingthesupportneededtomakeHPCaccessibletoresearchers

MartinMorgan(Lead),ChrisDarlak,SongLiu,AlanHutson,KevinEng,SandraGollnick,ScottGould,JianminWang

DataSciencePolicies&RiskManagement(DSPRM)

1.Amendordevelopenabling,sustainableandenforceabledatasciencepolicies2.Engineerprocessesforapproving,enforcingandupdatingpolicies3.Developpipelineforresolutionofresearchinformaticsproblems-4.ConsultwithDataScienceWorkingGroupstoassessriskofnewlydevelopingprocessesandtechnology5.Facilitatedatascienceinitiativesbyresearchingguidelinesandrequirements,consultingwithappropriateRPCIresourcesandauthorities,andmeetingsecurityandethicalrequirements

CamilleWicher(Lead),AlanHutson,EverettWeiss,ChrisDarlak,AmiColeman,LaurieMusial

DataScienceCancerOntology(DSCO)

1.Developaninstitution-widestandardintheformofontologyfordataandmetadatatofacilitatedataconsistency,sharingandintegration2.DevelopCommonDataElements(CDEs)fromtheontology3.Establishatimelineandphasesforontologydevelopment4.PromotethedatastandardsacrossRPCI

WilliamDuncan(Lead),EverettWeiss,JoshKillion,ChrisDarlak

DataScienceBiospecimen(DSB)

1.Addressdevelopmentofmetadatastandards(aspartofcancerontology)2.Explorecentralizationofstandardsandbankingandproposerecommendations,incollaborationwiththe5specimenbanks-DataBankandBiorepositorySharedResource(DBBR),PRN,Lymphoma,LeukemiaandOvary3.MakerecommendationstoRPCIleadershipandstakeholders

KirstenMoysich(Lead),SandraGollnick,BarbaraFoster,BrahmSegal

Page 10: Roswell Park Cancer Institute Data Science Strategic Plan ... · outcomes data, resulting in the need to capture, manage and analyze very large heterogeneous datasets from diverse,

v4.20170309

10

D.ReorganizationSuccessfulimplementationofthisstrategywillrequirereorganizationinRPCIInformationTechnologyaswellastheCDNandothertechnicalteams.ResearchITandDataSciencewilldovetailwiththeresearchnetworkandresearchdatabasesystembuildingshiftingtotheresearchdatascienceteams.StrongcollaborationbetweenITandDataScienceiscriticaltosuccess.ThisTaskForcewilldevelopaproposalforanewadministrativestructurethatwilladvanceproposeddatascienceinitiatives;theproposalwillbepresentedtoseniorleadershipforconsideration.E.OversightandGuidanceThedevelopmentofadatascienceprogramwillbestrengthenedbytheguidanceofaScientificandTechnicalAdvisoryBoard(STAB)comprisedofcancerareadomainexpertsaswellasethics,datagovernanceandclinicalleadership.TheestablishmentoftheSTABbothaddsanoutsidevoicetothedatascienceprioritiesandbroadenstheoutreachoftheimplementeddatasciencecapabilities.ThespecificgoalsoftheSTABincludeguidingtheTaskForceonnewandemergingscientificandclinicalneeds,institutionalchangesimpacting(orimpactedby)datascience,andexistenceordevelopmentofothersynergisticprogramsbothwithinandexternaltoRPCI.Importantly,theSTABwillhelptopreventinstancesofparallelsiloedactivities.TheTaskForcewilllooktotheSTABforsupportresolvingissuesandovercomingroadblocks.InYear1theSTABwillmeetwiththeTaskForcebi-annually;themeetingscheduleafterYear1willbedeterminedasinitialprogressisassessed.TheDataScienceTaskForcewillinvitethesefacultyandstafftotheSTAB:ChristineAmbrosoneStephenEdgeAndreiGudkovKerryKerlinJamesMohlerAdekunleOdunsiThefigurebelowdescribesthelinesofaccountabilityandguidanceamongthegroupsspecifyingandimplementingthisdatasciencestrategy.Dottedlinesindicatekeycollaborations.

Page 11: Roswell Park Cancer Institute Data Science Strategic Plan ... · outcomes data, resulting in the need to capture, manage and analyze very large heterogeneous datasets from diverse,

v4.20170309

11

V.BusinessTheproposedorganizationalchanges,informationalmodeling,andarchitectureandsoftwaredevelopmentwillrequirecapitalinvestment.TheTaskForcewillcontinuetomeettoidentifycosts(informedbytheWorkingGroups),opportunitiesforrevenue,andaprocessforassessingresourcesrequiredtosupportandintegratenewresearchprojects,technologyanddataresources.Someoutcomeofthiseffortwillbeenhancementofourabilitytocompeteforlargergrantsandpotentiallymorecommercializationopportunities.PilotstudiesforresearchplatformAninitialphasewilldevelopthedatascienceprogramandarchitecturetosupporttworesearchprojectsthatwillbestrengthenedbythecapabilitiesproposedinthisstrategy.Leveragingfundsfromtheseprojects,inconjunctionwithRPCIresources,willenablerapidorganizationandimplementationoffirst-phasedatasciencecapabilities.Earlysuccesswilldemonstratetheneedfor,andbenefitof,improveddatascience.PhotodynamicTherapy(PDT)RegistryPrincipalInvestigatorsDrs.SandraGollnickandMaryReidAnexistingpartnershipbetweenRPCIandConcordiaLaboratoriesInc.willbeexpandedbythefundingofanewcomprehensiveregistry.ThenewregistrywillcollectdataonpatientswhohavereceivedPDTforlungcanceroresophagealcancer,withexpansiontoheadandneckcancer,mesothelioma,pancreaticcarcinoma,braincancerandcholangiocarcinoma.Thismulti-centerregistrycanbebuiltwithinthenewdatascienceenvironment(networkandarchitecture),strengthenedbytheopportunityforbothcentralarchiveanddynamicintegrationofdatastoredatdistributedsources,an

Page 12: Roswell Park Cancer Institute Data Science Strategic Plan ... · outcomes data, resulting in the need to capture, manage and analyze very large heterogeneous datasets from diverse,

v4.20170309

12

interactiveportalfeaturingbothpublicandprotectedspaces,andcleardataandmetadatastandards.Thearchitecturewillallowforintegrationoftechnologiesalreadydeveloped,e.g.alegacysystemthroughwhichphysicianssharecaseinformation.OvarianSPOREandP50SupplementPrincipalInvestigatorDr.AdekunleOdunsiUniversityofPittsburgCancerCenter(UPCI)andRPCIhavereceivedtheNCI’sSpecializedProgramofResearchExcellence(SPORE)forOvarianCancer.ThefocusoftheSPOREistoreducethemorbidityandmortalityofovariancancerthroughinnovativetranslationalresearchanddevelopmentofnewimmunotherapyapproachesfortreatmentof,andriskassessmentfor,ovariancancer.AcomponentofaP50supplementtotheOvarianSPOREaddressestheestablishmentofatechnicalinfrastructuretofacilitatecatalog,archiveandintegrationoftrialandotherdataacrosstheSPORE,allbasedonthedevelopmentofaninformationstandard.Thedatascienceneedsofthissupplementarefulfilledwiththearchitectureproposedinthisstrategicplan.Fundsfromthisprojectmaybecontributedfortheinitialphaseofdatascienceinfrastructureimplementation.V.TimelineFeedbackfromresearchfacultyexpressingstrongneedfortheinfrastructure,technologyandservicesproposedinthisstrategyplanmotivatesinitiationofactivitiesassoonaspossible.Administrative/leadershipapprovalofthisStrategicPlanwilltriggerthedevelopmentofaformal3–5yearprojectplanoutlinedinthefollowinghigh-leveltimeline.Group Milestone Q1 Q2 Q3 Q4DataScienceWorkingGroups

Establishmembershipandinitialmeeting

X

Developgoals,projectplan,anticipatedchallenges,timeline,budgetaryneeds

X X

ReviewplanswithDataScienceTaskForce

X

Confirmrequiredresources X X Implementprojectplan X X X

DataScienceTechnicalArchitecture(ApacheOODT)

Developobjectivestatement,projectplan,anticipatedchallenges,timeline,budgetaryneeds

X

Confirmrequiredresources X X Implementprojectplan X X X

Page 13: Roswell Park Cancer Institute Data Science Strategic Plan ... · outcomes data, resulting in the need to capture, manage and analyze very large heterogeneous datasets from diverse,

v4.20170309

13

Review/Interviewdatascienceatotherinstitutions

Makecontactandorganizevideo/teleconferences

X

Gatherinformation X X DataScienceTaskForce

Meet,discussplansandneedswithWorkingGroups

X X X X

Developfullbudget X X MeetwithSTAB X X X X

ScientificandTechnicalAdvisoryBoard(STAB)

Invitemembersandorganizequarterlymeetings

X

Meet,discussprogressandchallenges;STABprovidesfeedbackandguidance

X X X X

Page 14: Roswell Park Cancer Institute Data Science Strategic Plan ... · outcomes data, resulting in the need to capture, manage and analyze very large heterogeneous datasets from diverse,

v4.20170309

14

Thisdatasciencestrategyrecommendsa“bestof”approachthatleveragesthestrongexpertise,technologyandmethodsalreadyinplaceatRPCIasthebasisforaworld-classinstitution-widedatasciencecapability.Implementinganindependent‘research’ITdivision,developingaeducationalprogramsandbuildingascalableoverarchingarchitecturetosecurelytieexistingtechnology,resourcesanddatarepositoriestogetherallowsforgrowthandtheadditionofdatatypesandtechnologiesyettobedeveloped.Wewillenablemoretranslationalandcollaborativescienceandprovideaninformaticsenvironmentrequiredforattractingtop-notchfacultytoRPCI.Ourgoalis,throughbuildingavibrant,capableresearchcomputingcommunity,tobecomeaworldleaderindatascienceforcancerresearch.DataScienceTaskForceParticipantsKristenAntonChrisDarlakSandyGollnickAlanHutsonSongLiuJamesMohlerMartinMorganKirstenMoysichLaurieMusialAnuragSinghEverettWeissCamilleWicher

Page 15: Roswell Park Cancer Institute Data Science Strategic Plan ... · outcomes data, resulting in the need to capture, manage and analyze very large heterogeneous datasets from diverse,

v4.20170309

15

AppendixA:PolicyEvaluationThetablebelowcontainshighlightsfromaexistingsummaryofRPCIinstitutionalpoliciesthatintersectwithdatascienceandresearchandnotessomemissingpolicies.Thetaskforcesubcommitteereviewedpoliciesinthe900InformationTechnologyseries,1100GeneralScientificResearchseries,and1200GeneralEducationStaffseriesandextractedthetenpoliciesaboveasexamplesofpoliciesthatsignificantlyimpactresearch.Alargevolumeofpoliciesaredifficulttounderstand,updateandenforce.TheTaskForceproposestheestablishmentofaDataSciencePoliciesandRiskManagementWorkingGroup(DSPRM)toamendordevelopenablingdatascienceandresearchpolicies.Thesepolicieswillbereasonableinnumber,accessibleandenforceable.Effectivechangemanagementanddisseminationmethodswillbedeveloped.DSPRMwillconferwithRPCIIT,datagovernance,ethicsandadministrationinthecourseofpolicydevelopment.RPCIPolicy# Summaryofpolicyfocusandcomments905.1 TheNOPHI/subjectlinerequirementforemailusageintermsof

sendingattachmentsisnotanofficialpolicy907.1 Officially,RPCIhasthreenetworks:WorkforceNetwork(primary

networkusedbyanyoneperformingworkinsupportofthemissionofRPCI);Patient/Visitor/VendorNetwork;andDemilitarizedZoneNetwork(forITapprovedexternalaccesstospecialRPCIserversandservices)

921.1 IThaschosentofollowtheindustrybestpracticesconstructcalled“ThePrincipleofLeastPrivileges”(restrictsaresearcher’sadministrativerightstohis/hermachines)

925.1 AlldatabasescreatedforRPCIorHRIthatcontainPHIand/orPIIdatamustberegisteredwiththeITDepartment

937.1 TheinformationsecurityorganizationfollowstheHiTrustInformationSecurityFrameworkforitsgovernance,policies,procedures,andcontrols.ITSecurityshallhaveadistinctbudgetline.Budgetplanningandexpendituresshallbeapprovedviathecommitteesnotedpreviouslyandbebasedonrisk.

1107.1 ClinicalDataNetworkdirectorapprovestheHonestBrokerapplications(TherearemanyfacultyandstaffatRPCIservingasdefactohonestbrokerswhoarenotontheofficiallydesignatedlist)

1208.1 Theretentionofdatapolicy:maybeanissueintermsofstoragerequirements

NopoliciesexistpertainingtoourrelationshipwiththeUniversityatBuffalointermsofCCRand/orutilizationoftheUBinternetatRPCI,e.g.downstreamaffiliationagreements,etc.

Page 16: Roswell Park Cancer Institute Data Science Strategic Plan ... · outcomes data, resulting in the need to capture, manage and analyze very large heterogeneous datasets from diverse,

v4.20170309

16

NopoliciesexistrelativetoLINUX-basedissues Documentsthataredeemedsearchableoni2andthataresupposed

tobeupdatedyearlyasnotedinthepolicystatementssuchasthe“ITSecurityRiskManagementPlan”arenotcurrentlyfoundonline

Page 17: Roswell Park Cancer Institute Data Science Strategic Plan ... · outcomes data, resulting in the need to capture, manage and analyze very large heterogeneous datasets from diverse,

v4.20170309

17

AppendixB.ResearchComputingEvaluationStrengthsWeaknessesOpportunitiesThreatsAnalysisof:RPCIResearchComputingInfrastructureStrengths:In-HouseComputingCluster–RPCIfullyownsa1600CoreclusterHighPerformanceComputingsystemcomprisedof97computenodesand2headnodes,100TBoffastproductionstorageand114TBofslowerstorage.Thecluster,purchasedin2012withfundingfromwesternNewYorkregionaleconomicdevelopmentalcouncil,hasbeenusedprimarilytosupportanumberofsequencingprojectsatCenterforPersonalizedMedicineandGenomicsSharedResource.ITTechnologyTalent–DuringthepastseveralyearsIThasacquiredasmallpoolITtalentcapableofsupportingthistypeofresearchcomputingenvironment.Weaknesses:Desktop&InfrastructureSupport–Desktop&ITInfrastructuresupportofresearchcomputingenvironmentiscurrentlyprovidedviathecorporateITServiceDesk&Infrastructureteamswhichalsosupportsbusinessandclinicaloperations(mainlyWindows).Overallthissupportstructuredoesnotallowforadeeplevelsupportwithinaresearchcomputingenvironment(ex:MacandLinux),astheServiceDeskandInfrastructureteamsarespreadacrosstheentireenterprisewithminimalopportunitytospecializeinsomeoftheuniquenessoftheresearchcomputingenvironmentatRPCI.ComputingClusterUtilization–InvestigatorsatRPCIarenotutilizingtheclustertoitsfullcomputationalpotentialintheirresearchefforts,mainlydueto1)thelackoftraining/educationonhowtoruntheiranalysiswithinthisenvironment,2)thelimitationofstoragespaceintheeraofbigdata(currentlytheclusterreaches~90%storagequotaevery3months),3)aswellasthecurrentdesktoplimitationsofbeingprimarilya“Windows”shopatRPCI.HIPAACoveredEntity-BecausealloftheRPCIoperationisconsidereda“coveredentity”underHIPAA(includingresearchcomputing),theresearchcomputingenvironmentmustcomplywithHIPAArequirementstoprotecttheprivacyandsecurityofhealthinformation,thusmakingtheITenvironmentlessflexibletoworkundercomparedtootheracademicmedicalenvironmentswhohaveseparatelegalentitiessuchasUPMC(MedicalCentervs.University)&FredHutch(ResearchCentervs.SeattleCareCancerAlliance).Opportunities:UpgradingRPCIBioinformaticsCapabilities–Manyofthe‘wetlab’componentsoftheRPCIresearchoperationarehighlyproductiveandcreativeintheirendeavors.ComplementingthoseeffortswitheasiertouseBioinformaticsresources&

Page 18: Roswell Park Cancer Institute Data Science Strategic Plan ... · outcomes data, resulting in the need to capture, manage and analyze very large heterogeneous datasets from diverse,

v4.20170309

18

technologywouldupgradetheresearcheffortsatRPCIoverall,allowingresearcherstofocusontheirresearch,andnotthemechanicsof‘howto’storeandanalyzetheirresearchdatawithintheRPCIenvironment.Anupgradedbioinformaticsenvironmentwouldincludeenhancedtraining,tools,technologyandfocusedITsupport,additionallymakingtheclusterpublicallyavailabletothewholeRPCIresearchcommunity.InvestigatorTrust–ManyNCIfundedgrantshaveasmallhumansubjectcomponent,howeverRPCIlacksthetrustthatmanyofourbasicscientistsindeeddonotworkwithprotectedhealthinformation(PHI)inanycapacity,andrequiresthemtoworkinthesameITenvironmentastheclinicalworkforcehavingappropriateITsecuritymeasurestoprotectregulateddata.Byrecognizingthatasignificantamountofourscientificcommunitydoesnothandleregulateddatawithintheirresearchefforts,RPCIcouldtrustscientificworkforcememberstoperformtheirresearchinasignificantlymoreflexible(lesssecure)ITenvironment,similartotheirresearchcollegesatmanyotheracademicinstitutions.UBCOE-TheNYSCenterofExcellenceinBioinformaticsandLifeSciences(CBLS)isahubforlifesciencesinnovationandtechnology-basedeconomicdevelopmentdrivingscientificdiscovery,facilitatingcollaborationamongacademia,industryandthepublicsectortocreatejobsthatdirectlyimpacttheregion’sandstate’seconomies.GiventheproximityoftheUBCOE&RPCI,anenhanced(moreformal)relationshipbetweenthetwoentitiesmaybeadvantageousforRoswellParkCancerInstitute.Threats:UseofBigData-Theabilitytoanalyze“BigData”isacrucialabilitywithintoday’scancerresearchspace.ApanelsetupbyPresidentObama’sMoonshotprogramsuggestscancercurescouldliewithinknownbigdata,statingwemustmakebetteruseofourexistingresearchdata.ThecurrentstateofRPCI’sResearchComputingInfrastructurecouldputtheinstituteatadisadvantageinthisregardcomparedtootherorganizationsbettersuitedtoprocess&analyzevarioustypesofbigdata.

Page 19: Roswell Park Cancer Institute Data Science Strategic Plan ... · outcomes data, resulting in the need to capture, manage and analyze very large heterogeneous datasets from diverse,

v4.20170309

19

AppendixC.DataRepositoryEvaluationThetablebelowcontainsasampleofthedatarepositoriesinplaceatRPCI.Therepositoriesrangefromsmalltolarge,maybeorganizedindatabaseorspreadsheetformat,mayrunontheRPCIserversoronindividualmachines,andareorganizedbyindividualresearchersorITResearchComputing/CDNstaff.OurevaluationhasallowedusonlytoestimatethenumberandkindsofdatarepositoriesatRPCI.ThenumberofdatabasesbeinggeneratedatRPCIisgrowingsteadily.DatabaseName SupportingSoftwarePhotodynamicTherapy(PDT)Database AccessEHR-ElectronicHealthRecord(SCM) AllscriptsEMREHRAnalyticsDatabase AllscriptsEMRLaboratoryMedicineResulting--Cerner Cerner-OraclePsychologyDatasets ExcelPediatricLongTermFollow-upDatabase ExcelClinicalGeneticsDatabase ExcelMelanomaLymphNodeMetastasisDatabase ExcelNCCNNon-Hodgkin'sLymphomaOutcomesDatabase ExcelandCDdiscsNCCNBreastCancerOutcomesDatabase ExcelonCDNsharedriveandCDdiscsNCCNColorectalOutcomesDatabase ExcelonCDNsharedriveandCDdiscsNCCNNon-SmallCellLungCancerOutcomesDatabase ExcelonCDNsharedriveandCDdiscsBladder(Robot-assistedRadicalCystectomy)Database EXPeRT--OracleandwebProstateCancerDatabase EXPeRT--OracleandwebOvarianSPOREP1 EXPeRT--OracleandwebStaceyScottLungCancerRegistry EXPeRT--OracleandwebPancreasDatabase EXPeRT--Oracleandweb/REDCapClinicalTrialsDataManagementSystem EXPeRT--OracleandWebHemOncBiobankDatabase LIMS-OracleOvarianBiobank LIMS-OracleOvarianFamilialCancerRegistry LIMS-OracleDBBRbiobank LIMS-OracleDBBRAnnotationDataSet(DADS) LIMS-OraclePRNBiobankDatabase LIMS-OracleInvisionRegistrationandBillingSystem MainframesystemCancerRegistry MetriqGammaKnifeDatabase MSAccessTIES mySQLAnatomicPathologySystem(PowerPath) OracleRadiationMedicineClinicalCareDatabase REDCapHeadandNeckCancerDatabase REDCapBreastScreeningandHighRiskManagementDatabase REDCap

Page 20: Roswell Park Cancer Institute Data Science Strategic Plan ... · outcomes data, resulting in the need to capture, manage and analyze very large heterogeneous datasets from diverse,

v4.20170309

20

RenalTumor(Kidney)Database REDCapCenterforImmunotherapy-ClinicalDatabase(ovariandatabase) REDCapAutofluorescenceScreening(HighRiskOral)Database REDCapTobaccoCessationDatabase REDCapMelanomaDatabase REDCapNewYorkStateSepsis REDCapOvarianSPOREP4 REDCapGUMedicine:Kidney REDCapGUMedicine:Urothelial REDCapGU:Medicine:Prostate REDCapH&NMelanoma REDCapLiverSurgical REDCapEsophagealDatabase REDCapRenalTumorPatientsonActiveSurveillance REDCapSpineSurgical REDCapNYSTEM-multipledatabases REDCapPediatricDatabase REDCapQualityofLifeSurveys REDCapSurvivorship REDCapHIPEC REDCapFinancialbillingdatabase-Pinpoint SQLHemOncClinicalDatabase SQLandAccessBMTDatabase SQLback-AccessfrontendBreastProgramDatabase SQLback-AccessfrontendGeneralThoracicSurgeryDatabase SQLbackwebfront