Big Data Technologies for Storage and Querying of Large ...sites.dundee.ac.uk/go-darts/wp-content/uploads/... · Big Data Technologies for Storage and Querying of Large-Scale Medical

HADOOPECOSYSTEM

APPLICATIONS

SpecialthankstoShonaMa<hewtheNIHRGlobalHealthResearchUnitManagerandHowardRogersfromHICcenterUoD.TheresearchwascommissionedbytheNaIonalInsItuteforHealthResearchusingOfficialDevelopmentAssistance(ODA)funding[INSPIRED16/136/102].Disclaimer:Theviewsexpressedarethoseoftheauthor(s)andnotnecessarilythoseoftheNHS,theNIHRortheDepartmentofHealthandSocialCare.

ACKNOWLEDGEMENT

SCOTLANDDATA:•  ClinicalData:250Kpeople

o  GoDARTSo  GoSHAREo  NHSTaysideandNHSFile

•  GenotypeData13,000peopleX40MillionSNPs

INDIANDATA:•  ClinicalData:400Kpeople•  GenotypeData:25KpeopleUKBIOBANKDATA:•  ClinicalData:490Kpeople•  GenotypeData:

490Kpeopleand93MillionSNPs

INTRODUCTION

CHALLENGES

BigDataTechnologiesforStorageandQueryingofLarge-ScaleMedicalDatainClinicalGeneIcResearch

Gi$uGeorge1,PhilipAppleby1,AlexDoney1,V.Mohan2,RadhaV21SchoolofMedicine,UniversityofDundee,Dundee,UnitedKingdom

2MadrasDiabe=cResearchFounda=on(MDRF),Chennai,India

•  Workinbigdataecosystem

•  Movetoalgorithmsthatcanruninparallel

•  Advancedfileformatsthatareindexedandreadyforparallelism

•  Centralizedstorageforbothphenotypeandgenotypedata

•  Providebestperformancetotheamountofhardwareavailable

SOLUTION

•  IdenIfydifferentquesIonstoaskthedata

•  ExpandtheHadoopclustertobecomemorepowerful

•  ShowinggenomicsignificanceforeachchromosomeinFIGIWAS

•  DynamicUIforclinicians/geneIciststointeractwiththedata.

FUTUREWORK

•  Flattextfileswhicharehardtobreakornaturallynotindexed,makesitstrenuoustomanuallyseparateandparallelize

•  Lackofcentralizedstorageforgenotypeandphenotypedata

•  Singlenodetoolsandprogrammingmethodsthatdoesn’tscale

•  DifficulIesinlearningacrossmulIpleanddeeperphenotypes&genotypes

•  OpensourceframeworkforstoringdataandrunningapplicaIonsinclusters.

•  Horizontalscalability

•  Failureisnormalandexpected

•  DataLocality-Computeshouldmovetothedata

WHYHADOOP?

GOAL

HADOOPDISTRIBUTIONFILESYSTEM(HDFS):•  FilesystemforHadoopframework•  Usescommodityhardware–lowcost•  OpJmizedforMapReduceworkloads-deliver

dataintothecomputeinfrastructureatahugedatarate

•  Supportofhighlyefficientdatatypes–Parquet,ORC

MAPREDUCE:•  ProgrammingparadigmforHadoop•  ConsistofMapperandReducer

Pig:AnengineforexecuIngdataflowsinparallelonHadoop.Hive:ADataWarehouseinfrastructureforHadoopOozie:WorkflowschedulersystemforMapReducejobs.Spark:Hadooponsteroids.RunsupfasterinmemoryandevenfastondiskthanHadoop.HBase:Acolumn-orienteddatabasemanagementsystemthatrunsontopofHDFS.Kama:Buildingreal-ImedatapipelinesandstreamingappsonHadoop.

•  PipelinetoloadthedatatoHDFS

•  EnrichthedatainHDFS

•  MergethePhenotypeandGenotypedataintoasinglefileforfurtheranalysis

•  CreatevisualizaIonUItobridgethegapbetweenthemedicalresearchersandgrowingbigdata

•  LandscapevisualizaIonofManyGeneVariantandManyDisease

•  HeatmaptovisualizethevariaIonsamongdifferentphenotypeoneachchromosome

•  Abilitytozoom,pan,orbitalandturntablerotaIon

•  Provideacutofthelandscapeforuserstovisualizethesignificantregions

FIGIWAS(OngoingWork)

Documents

Big Data Technologies for Storage and Querying of Large ...sites.dundee.ac.uk/go-darts/wp-content/uploads/... · Big Data Technologies for Storage and Querying of Large-Scale Medical