1
HADOOP ECOSYSTEM APPLICATIONS Special thanks to Shona Ma<hew the NIHR Global Health Research Unit Manager and Howard Rogers from HIC center UoD. The research was commissioned by the NaIonal InsItute for Health Research using Official Development Assistance (ODA) funding [INSPIRED 16/136/102]. Disclaimer: The views expressed are those of the author(s) and not necessarily those of the NHS, the NIHR or the Department of Health and Social Care. ACKNOWLEDGEMENT SCOTLAND DATA: Clinical Data: 250K people o GoDARTS o GoSHARE o NHS Tayside and NHS File Genotype Data 13,000 people X 40 Million SNPs INDIAN DATA: Clinical Data: 400K people Genotype Data: 25K people UK BIOBANK DATA: Clinical Data: 490K people Genotype Data: 490K people and 93 Million SNPs INTRODUCTION CHALLENGES Big Data Technologies for Storage and Querying of Large-Scale Medical Data in Clinical GeneIc Research Gi$u George 1 , Philip Appleby 1 , Alex Doney 1 , V. Mohan 2 , Radha V 2 1 School of Medicine, University of Dundee, Dundee, United Kingdom 2 Madras Diabe=c Research Founda=on (MDRF), Chennai, India Work in big data ecosystem Move to algorithms that can run in parallel Advanced file formats that are indexed and ready for parallelism Centralized storage for both phenotype and genotype data Provide best performance to the amount of hardware available SOLUTION IdenIfy different quesIons to ask the data Expand the Hadoop cluster to become more powerful Showing genomic significance for each chromosome in FIGIWAS Dynamic UI for clinicians/geneIcists to interact with the data. FUTURE WORK Flat text files which are hard to break or naturally not indexed, makes it strenuous to manually separate and parallelize Lack of centralized storage for genotype and phenotype data Single node tools and programming methods that doesn’t scale DifficulIes in learning across mulIple and deeper phenotypes & genotypes Open source framework for storing data and running applicaIons in clusters. Horizontal scalability Failure is normal and expected Data Locality - Compute should move to the data WHY HADOOP ? GOAL HADOOP DISTRIBUTION FILE SYSTEM (HDFS): File system for Hadoop framework Uses commodity hardware – low cost OpJmized for MapReduce workloads - deliver data into the compute infrastructure at a huge data rate Support of highly efficient datatypes – Parquet , ORC MAPREDUCE: Programming paradigm for Hadoop Consist of Mapper and Reducer Pig: An engine for execuIng data flows in parallel on Hadoop. Hive: A Data Warehouse infrastructure for Hadoop Oozie: Workflow scheduler system for Map Reduce jobs. Spark: Hadoop on steroids. Runs up faster in memory and even fast on disk than Hadoop. HBase: A column-oriented database management system that runs on top of HDFS. Kama: Building real-Ime data pipelines and streaming apps on Hadoop. Pipeline to load the data to HDFS Enrich the data in HDFS Merge the Phenotype and Genotype data into a single file for further analysis Create visualizaIon UI to bridge the gap between the medical researchers and growing big data Landscape visualizaIon of Many Gene Variant and Many Disease Heatmap to visualize the variaIons among different phenotype on each chromosome Ability to zoom, pan, orbital and turntable rotaIon Provide a cut of the landscape for users to visualize the significant regions FIGIWAS (Ongoing Work)

Big Data Technologies for Storage and Querying of Large ...sites.dundee.ac.uk/go-darts/wp-content/uploads/... · Big Data Technologies for Storage and Querying of Large-Scale Medical

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Big Data Technologies for Storage and Querying of Large ...sites.dundee.ac.uk/go-darts/wp-content/uploads/... · Big Data Technologies for Storage and Querying of Large-Scale Medical

HADOOPECOSYSTEM

APPLICATIONS

SpecialthankstoShonaMa<hewtheNIHRGlobalHealthResearchUnitManagerandHowardRogersfromHICcenterUoD.TheresearchwascommissionedbytheNaIonalInsItuteforHealthResearchusingOfficialDevelopmentAssistance(ODA)funding[INSPIRED16/136/102].Disclaimer:Theviewsexpressedarethoseoftheauthor(s)andnotnecessarilythoseoftheNHS,theNIHRortheDepartmentofHealthandSocialCare.

ACKNOWLEDGEMENT

SCOTLANDDATA:•  ClinicalData:250Kpeople

o  GoDARTSo  GoSHAREo  NHSTaysideandNHSFile

•  GenotypeData13,000peopleX40MillionSNPs

INDIANDATA:•  ClinicalData:400Kpeople•  GenotypeData:25KpeopleUKBIOBANKDATA:•  ClinicalData:490Kpeople•  GenotypeData:

490Kpeopleand93MillionSNPs

INTRODUCTION

CHALLENGES

BigDataTechnologiesforStorageandQueryingofLarge-ScaleMedicalDatainClinicalGeneIcResearch

Gi$uGeorge1,PhilipAppleby1,AlexDoney1,V.Mohan2,RadhaV21SchoolofMedicine,UniversityofDundee,Dundee,UnitedKingdom

2MadrasDiabe=cResearchFounda=on(MDRF),Chennai,India

•  Workinbigdataecosystem

•  Movetoalgorithmsthatcanruninparallel

•  Advancedfileformatsthatareindexedandreadyforparallelism

•  Centralizedstorageforbothphenotypeandgenotypedata

•  Providebestperformancetotheamountofhardwareavailable

SOLUTION

•  IdenIfydifferentquesIonstoaskthedata

•  ExpandtheHadoopclustertobecomemorepowerful

•  ShowinggenomicsignificanceforeachchromosomeinFIGIWAS

•  DynamicUIforclinicians/geneIciststointeractwiththedata.

FUTUREWORK

•  Flattextfileswhicharehardtobreakornaturallynotindexed,makesitstrenuoustomanuallyseparateandparallelize

•  Lackofcentralizedstorageforgenotypeandphenotypedata

•  Singlenodetoolsandprogrammingmethodsthatdoesn’tscale

•  DifficulIesinlearningacrossmulIpleanddeeperphenotypes&genotypes

•  OpensourceframeworkforstoringdataandrunningapplicaIonsinclusters.

•  Horizontalscalability

•  Failureisnormalandexpected

•  DataLocality-Computeshouldmovetothedata

WHYHADOOP?

GOAL

HADOOPDISTRIBUTIONFILESYSTEM(HDFS):•  FilesystemforHadoopframework•  Usescommodityhardware–lowcost•  OpJmizedforMapReduceworkloads-deliver

dataintothecomputeinfrastructureatahugedatarate

•  Supportofhighlyefficientdatatypes–Parquet,ORC

MAPREDUCE:•  ProgrammingparadigmforHadoop•  ConsistofMapperandReducer

Pig:AnengineforexecuIngdataflowsinparallelonHadoop.Hive:ADataWarehouseinfrastructureforHadoopOozie:WorkflowschedulersystemforMapReducejobs.Spark:Hadooponsteroids.RunsupfasterinmemoryandevenfastondiskthanHadoop.HBase:Acolumn-orienteddatabasemanagementsystemthatrunsontopofHDFS.Kama:Buildingreal-ImedatapipelinesandstreamingappsonHadoop.

•  PipelinetoloadthedatatoHDFS

•  EnrichthedatainHDFS

•  MergethePhenotypeandGenotypedataintoasinglefileforfurtheranalysis

•  CreatevisualizaIonUItobridgethegapbetweenthemedicalresearchersandgrowingbigdata

•  LandscapevisualizaIonofManyGeneVariantandManyDisease

•  HeatmaptovisualizethevariaIonsamongdifferentphenotypeoneachchromosome

•  Abilitytozoom,pan,orbitalandturntablerotaIon

•  Provideacutofthelandscapeforuserstovisualizethesignificantregions

FIGIWAS(OngoingWork)