The Human Variant Database - GitHub Pages · • Two parallel goals: – Personalized Oncogenomics...

Preview:

Citation preview

TheHumanVariantDatabase

MyaWarrenMichaelSmithGenomeSciencesCentre

VancouverBC

Bioinforma=csisBigData

•  Humangenomehas– 3billionnucleo=debases– 60thousandgenes– 10-20thousandproteins

•  Bioinforma=cstakesadvantageof– Highperformancecompu=ng– Sophis=catedalgorithms– Math/Sta=s=cs– Machinelearning

Ourmission

•  Twoparallelgoals:–  PersonalizedOncogenomicsProgram

Usepa'entgenomicstodiagnoseandiden'fytherapiesforeachpa'ent’suniquedisease

–  CancerresearchFindnewpa9ernsinthegenomicsdatatoiden'fynoveltargetsfortherapy,learnfundamentaltruthsaboutcancer

Ourmission

•  Twoparallelgoals:–  PersonalizedOncogenomicsProgram

Usepa'entgenomicstodiagnoseandiden'fytherapiesforeachpa'ent’suniquedisease

–  CancerresearchFindnewpa9ernsinthegenomicsdatatoiden'fynoveltargetsfortherapy,learnfundamentaltruthsaboutcancer

•  Thedatabasesupportsthesegoalsthrough:–  Fastqueryingandexplora=onofpa=entgenomics,clinicalcovariates

– Dataminingandanalysisofpa=entcohorts

HAWQ(HAdoopWithQueries)

Amassivelyparallelprocessing(MPP)SQL

engineinHadoop

HAWQ(HAdoopWithQueries)

Amassivelyparallelprocessing(MPP)SQL

engineinHadoop•  InterfacewiththedatausingPostgreSQL

HAWQ(HAdoopWithQueries)

Amassivelyparallelprocessing(MPP)SQL

engineinHadoop•  InterfacewiththedatausingPostgreSQL•  Parallel,faulttolerantarchitectureforstoringandprocessingbigdata

Oursystem

•  13slavenodes•  32threadCPUs•  Totalmemory:1.5TB•  Totalstorage:250TB•  Currentdiskusage:1.5TB•  Largesttable:~10billionrows

HAWQArchitecture•  Hadoopdistributedfile

system(HDFS)–  Dataischunked,replicated,

distributed

•  Datalocality–  Movethecomputa=onto

thedata–  Dataisnotshared–  HAWQisveryfast,linear

scalability

•  CaninterfacewiththerestoftheHadoopecosystem

HAWQvs.Rela=onalDatabases

•  Append-onlytables•  Noprimarykeys•  Noforeignkeys•  Joinsaremoreexpensive•  Extract-transform-load(ETL)op=mizedforlargedatafiles–  Importrawdata– Transformdataindatabase

TheData

•  Internallygenerateddata+publiccancerdatasets(TCGA)

•  11,519pa=ents•  21,591libraries•  31,067analyses•  >10billionrows

Variants

•  Rawdatafor– Unpaired/soma=cSNVsandIndels– Germline/soma=cCNVs– Soma=clossofheterozygosity– Geneexpression– Homozygousdele=ons

•  Post-Processedandfilteredvariantdata

Metadata

•  Libraryconstruc=onandsequencing•  Analysispipeline•  Pa=entdata– Demographics– Biopsydiagnoses– Drugtreatment– Radia=ontreatment

Annota=ons

•  dbSNP•  COSMIC•  ClinVar•  SnpEff•  Genemodels

Comingsoon

•  Otherinternalprojects•  Moreexternaldatasets!•  Structuralvariants,miRNA...•  Disease/Drugontologies•  Knowledgebase•  Moredata=bejeranalysis!

Accessingthedata

•  Customqueriesandpipelines

Accessingthedata

•  Customqueriesandpipelines•  GeneralpurposeRESTAPIs– Python– SQLAlchemyObjectRela=onalModel– PyramidRESTframework

•  Webinterface– Query– Filter– Analyze

Queryselector

Results

TheFuture

Letthedatabasedothework!

TheFuture

Letthedatabasedothework!• Whygiveupyourpipeline?– speed– flexibility

Tasksthatcouldbedoneonthevariantdatabase

•  Annota=ons•  Filtering•  Sta=s=calanalysisandanaly=cs•  Correla=ons•  MachineLearning

scalable,in-databaseanaly=cs

Thanks!

VariantDBDevelopersMarcelBernardJoshuaDaviesDarrylD’SouzaNavjashanSinghJamesZhouSimonChan

PIPE/BioApps/LIMSMorganByeKarenEddyPatrickPlejner

SystemsHansenWongRudyZhouLanceBailey

BrandonPierceRichardCorbejEricChuahYussanneMa

Recommended