Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
TheHumanVariantDatabase
MyaWarrenMichaelSmithGenomeSciencesCentre
VancouverBC
Bioinforma=csisBigData
• Humangenomehas– 3billionnucleo=debases– 60thousandgenes– 10-20thousandproteins
• Bioinforma=cstakesadvantageof– Highperformancecompu=ng– Sophis=catedalgorithms– Math/Sta=s=cs– Machinelearning
Ourmission
• Twoparallelgoals:– PersonalizedOncogenomicsProgram
Usepa'entgenomicstodiagnoseandiden'fytherapiesforeachpa'ent’suniquedisease
– CancerresearchFindnewpa9ernsinthegenomicsdatatoiden'fynoveltargetsfortherapy,learnfundamentaltruthsaboutcancer
Ourmission
• Twoparallelgoals:– PersonalizedOncogenomicsProgram
Usepa'entgenomicstodiagnoseandiden'fytherapiesforeachpa'ent’suniquedisease
– CancerresearchFindnewpa9ernsinthegenomicsdatatoiden'fynoveltargetsfortherapy,learnfundamentaltruthsaboutcancer
• Thedatabasesupportsthesegoalsthrough:– Fastqueryingandexplora=onofpa=entgenomics,clinicalcovariates
– Dataminingandanalysisofpa=entcohorts
HAWQ(HAdoopWithQueries)
Amassivelyparallelprocessing(MPP)SQL
engineinHadoop
HAWQ(HAdoopWithQueries)
Amassivelyparallelprocessing(MPP)SQL
engineinHadoop• InterfacewiththedatausingPostgreSQL
HAWQ(HAdoopWithQueries)
Amassivelyparallelprocessing(MPP)SQL
engineinHadoop• InterfacewiththedatausingPostgreSQL• Parallel,faulttolerantarchitectureforstoringandprocessingbigdata
Oursystem
• 13slavenodes• 32threadCPUs• Totalmemory:1.5TB• Totalstorage:250TB• Currentdiskusage:1.5TB• Largesttable:~10billionrows
HAWQArchitecture• Hadoopdistributedfile
system(HDFS)– Dataischunked,replicated,
distributed
• Datalocality– Movethecomputa=onto
thedata– Dataisnotshared– HAWQisveryfast,linear
scalability
• CaninterfacewiththerestoftheHadoopecosystem
HAWQvs.Rela=onalDatabases
• Append-onlytables• Noprimarykeys• Noforeignkeys• Joinsaremoreexpensive• Extract-transform-load(ETL)op=mizedforlargedatafiles– Importrawdata– Transformdataindatabase
TheData
• Internallygenerateddata+publiccancerdatasets(TCGA)
• 11,519pa=ents• 21,591libraries• 31,067analyses• >10billionrows
Variants
• Rawdatafor– Unpaired/soma=cSNVsandIndels– Germline/soma=cCNVs– Soma=clossofheterozygosity– Geneexpression– Homozygousdele=ons
• Post-Processedandfilteredvariantdata
Metadata
• Libraryconstruc=onandsequencing• Analysispipeline• Pa=entdata– Demographics– Biopsydiagnoses– Drugtreatment– Radia=ontreatment
Annota=ons
• dbSNP• COSMIC• ClinVar• SnpEff• Genemodels
Comingsoon
• Otherinternalprojects• Moreexternaldatasets!• Structuralvariants,miRNA...• Disease/Drugontologies• Knowledgebase• Moredata=bejeranalysis!
Accessingthedata
• Customqueriesandpipelines
Accessingthedata
• Customqueriesandpipelines• GeneralpurposeRESTAPIs– Python– SQLAlchemyObjectRela=onalModel– PyramidRESTframework
• Webinterface– Query– Filter– Analyze
Queryselector
Results
TheFuture
Letthedatabasedothework!
TheFuture
Letthedatabasedothework!• Whygiveupyourpipeline?– speed– flexibility
Tasksthatcouldbedoneonthevariantdatabase
• Annota=ons• Filtering• Sta=s=calanalysisandanaly=cs• Correla=ons• MachineLearning
scalable,in-databaseanaly=cs
Thanks!
VariantDBDevelopersMarcelBernardJoshuaDaviesDarrylD’SouzaNavjashanSinghJamesZhouSimonChan
PIPE/BioApps/LIMSMorganByeKarenEddyPatrickPlejner
SystemsHansenWongRudyZhouLanceBailey
BrandonPierceRichardCorbejEricChuahYussanneMa