Upload
others
View
9
Download
0
Embed Size (px)
Citation preview
HADOOPECOSYSTEM
APPLICATIONS
SpecialthankstoShonaMa<hewtheNIHRGlobalHealthResearchUnitManagerandHowardRogersfromHICcenterUoD.TheresearchwascommissionedbytheNaIonalInsItuteforHealthResearchusingOfficialDevelopmentAssistance(ODA)funding[INSPIRED16/136/102].Disclaimer:Theviewsexpressedarethoseoftheauthor(s)andnotnecessarilythoseoftheNHS,theNIHRortheDepartmentofHealthandSocialCare.
ACKNOWLEDGEMENT
SCOTLANDDATA:• ClinicalData:250Kpeople
o GoDARTSo GoSHAREo NHSTaysideandNHSFile
• GenotypeData13,000peopleX40MillionSNPs
INDIANDATA:• ClinicalData:400Kpeople• GenotypeData:25KpeopleUKBIOBANKDATA:• ClinicalData:490Kpeople• GenotypeData:
490Kpeopleand93MillionSNPs
INTRODUCTION
CHALLENGES
BigDataTechnologiesforStorageandQueryingofLarge-ScaleMedicalDatainClinicalGeneIcResearch
Gi$uGeorge1,PhilipAppleby1,AlexDoney1,V.Mohan2,RadhaV21SchoolofMedicine,UniversityofDundee,Dundee,UnitedKingdom
2MadrasDiabe=cResearchFounda=on(MDRF),Chennai,India
• Workinbigdataecosystem
• Movetoalgorithmsthatcanruninparallel
• Advancedfileformatsthatareindexedandreadyforparallelism
• Centralizedstorageforbothphenotypeandgenotypedata
• Providebestperformancetotheamountofhardwareavailable
SOLUTION
• IdenIfydifferentquesIonstoaskthedata
• ExpandtheHadoopclustertobecomemorepowerful
• ShowinggenomicsignificanceforeachchromosomeinFIGIWAS
• DynamicUIforclinicians/geneIciststointeractwiththedata.
FUTUREWORK
• Flattextfileswhicharehardtobreakornaturallynotindexed,makesitstrenuoustomanuallyseparateandparallelize
• Lackofcentralizedstorageforgenotypeandphenotypedata
• Singlenodetoolsandprogrammingmethodsthatdoesn’tscale
• DifficulIesinlearningacrossmulIpleanddeeperphenotypes&genotypes
• OpensourceframeworkforstoringdataandrunningapplicaIonsinclusters.
• Horizontalscalability
• Failureisnormalandexpected
• DataLocality-Computeshouldmovetothedata
WHYHADOOP?
GOAL
HADOOPDISTRIBUTIONFILESYSTEM(HDFS):• FilesystemforHadoopframework• Usescommodityhardware–lowcost• OpJmizedforMapReduceworkloads-deliver
dataintothecomputeinfrastructureatahugedatarate
• Supportofhighlyefficientdatatypes–Parquet,ORC
MAPREDUCE:• ProgrammingparadigmforHadoop• ConsistofMapperandReducer
Pig:AnengineforexecuIngdataflowsinparallelonHadoop.Hive:ADataWarehouseinfrastructureforHadoopOozie:WorkflowschedulersystemforMapReducejobs.Spark:Hadooponsteroids.RunsupfasterinmemoryandevenfastondiskthanHadoop.HBase:Acolumn-orienteddatabasemanagementsystemthatrunsontopofHDFS.Kama:Buildingreal-ImedatapipelinesandstreamingappsonHadoop.
• PipelinetoloadthedatatoHDFS
• EnrichthedatainHDFS
• MergethePhenotypeandGenotypedataintoasinglefileforfurtheranalysis
• CreatevisualizaIonUItobridgethegapbetweenthemedicalresearchersandgrowingbigdata
• LandscapevisualizaIonofManyGeneVariantandManyDisease
• HeatmaptovisualizethevariaIonsamongdifferentphenotypeoneachchromosome
• Abilitytozoom,pan,orbitalandturntablerotaIon
• Provideacutofthelandscapeforuserstovisualizethesignificantregions
FIGIWAS(OngoingWork)