24
The Human Variant Database Mya Warren Michael Smith Genome Sciences Centre Vancouver BC

The Human Variant Database - GitHub Pages · • Two parallel goals: – Personalized Oncogenomics Program Use pa'ent genomics to diagnose and iden'fy therapies for each pa'ent’s

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: The Human Variant Database - GitHub Pages · • Two parallel goals: – Personalized Oncogenomics Program Use pa'ent genomics to diagnose and iden'fy therapies for each pa'ent’s

TheHumanVariantDatabase

MyaWarrenMichaelSmithGenomeSciencesCentre

VancouverBC

Page 2: The Human Variant Database - GitHub Pages · • Two parallel goals: – Personalized Oncogenomics Program Use pa'ent genomics to diagnose and iden'fy therapies for each pa'ent’s

Bioinforma=csisBigData

•  Humangenomehas– 3billionnucleo=debases– 60thousandgenes– 10-20thousandproteins

•  Bioinforma=cstakesadvantageof– Highperformancecompu=ng– Sophis=catedalgorithms– Math/Sta=s=cs– Machinelearning

Page 3: The Human Variant Database - GitHub Pages · • Two parallel goals: – Personalized Oncogenomics Program Use pa'ent genomics to diagnose and iden'fy therapies for each pa'ent’s

Ourmission

•  Twoparallelgoals:–  PersonalizedOncogenomicsProgram

Usepa'entgenomicstodiagnoseandiden'fytherapiesforeachpa'ent’suniquedisease

–  CancerresearchFindnewpa9ernsinthegenomicsdatatoiden'fynoveltargetsfortherapy,learnfundamentaltruthsaboutcancer

Page 4: The Human Variant Database - GitHub Pages · • Two parallel goals: – Personalized Oncogenomics Program Use pa'ent genomics to diagnose and iden'fy therapies for each pa'ent’s

Ourmission

•  Twoparallelgoals:–  PersonalizedOncogenomicsProgram

Usepa'entgenomicstodiagnoseandiden'fytherapiesforeachpa'ent’suniquedisease

–  CancerresearchFindnewpa9ernsinthegenomicsdatatoiden'fynoveltargetsfortherapy,learnfundamentaltruthsaboutcancer

•  Thedatabasesupportsthesegoalsthrough:–  Fastqueryingandexplora=onofpa=entgenomics,clinicalcovariates

– Dataminingandanalysisofpa=entcohorts

Page 5: The Human Variant Database - GitHub Pages · • Two parallel goals: – Personalized Oncogenomics Program Use pa'ent genomics to diagnose and iden'fy therapies for each pa'ent’s

HAWQ(HAdoopWithQueries)

Amassivelyparallelprocessing(MPP)SQL

engineinHadoop

Page 6: The Human Variant Database - GitHub Pages · • Two parallel goals: – Personalized Oncogenomics Program Use pa'ent genomics to diagnose and iden'fy therapies for each pa'ent’s

HAWQ(HAdoopWithQueries)

Amassivelyparallelprocessing(MPP)SQL

engineinHadoop•  InterfacewiththedatausingPostgreSQL

Page 7: The Human Variant Database - GitHub Pages · • Two parallel goals: – Personalized Oncogenomics Program Use pa'ent genomics to diagnose and iden'fy therapies for each pa'ent’s

HAWQ(HAdoopWithQueries)

Amassivelyparallelprocessing(MPP)SQL

engineinHadoop•  InterfacewiththedatausingPostgreSQL•  Parallel,faulttolerantarchitectureforstoringandprocessingbigdata

Page 8: The Human Variant Database - GitHub Pages · • Two parallel goals: – Personalized Oncogenomics Program Use pa'ent genomics to diagnose and iden'fy therapies for each pa'ent’s

Oursystem

•  13slavenodes•  32threadCPUs•  Totalmemory:1.5TB•  Totalstorage:250TB•  Currentdiskusage:1.5TB•  Largesttable:~10billionrows

Page 9: The Human Variant Database - GitHub Pages · • Two parallel goals: – Personalized Oncogenomics Program Use pa'ent genomics to diagnose and iden'fy therapies for each pa'ent’s

HAWQArchitecture•  Hadoopdistributedfile

system(HDFS)–  Dataischunked,replicated,

distributed

•  Datalocality–  Movethecomputa=onto

thedata–  Dataisnotshared–  HAWQisveryfast,linear

scalability

•  CaninterfacewiththerestoftheHadoopecosystem

Page 10: The Human Variant Database - GitHub Pages · • Two parallel goals: – Personalized Oncogenomics Program Use pa'ent genomics to diagnose and iden'fy therapies for each pa'ent’s

HAWQvs.Rela=onalDatabases

•  Append-onlytables•  Noprimarykeys•  Noforeignkeys•  Joinsaremoreexpensive•  Extract-transform-load(ETL)op=mizedforlargedatafiles–  Importrawdata– Transformdataindatabase

Page 11: The Human Variant Database - GitHub Pages · • Two parallel goals: – Personalized Oncogenomics Program Use pa'ent genomics to diagnose and iden'fy therapies for each pa'ent’s

TheData

•  Internallygenerateddata+publiccancerdatasets(TCGA)

•  11,519pa=ents•  21,591libraries•  31,067analyses•  >10billionrows

Page 12: The Human Variant Database - GitHub Pages · • Two parallel goals: – Personalized Oncogenomics Program Use pa'ent genomics to diagnose and iden'fy therapies for each pa'ent’s

Variants

•  Rawdatafor– Unpaired/soma=cSNVsandIndels– Germline/soma=cCNVs– Soma=clossofheterozygosity– Geneexpression– Homozygousdele=ons

•  Post-Processedandfilteredvariantdata

Page 13: The Human Variant Database - GitHub Pages · • Two parallel goals: – Personalized Oncogenomics Program Use pa'ent genomics to diagnose and iden'fy therapies for each pa'ent’s

Metadata

•  Libraryconstruc=onandsequencing•  Analysispipeline•  Pa=entdata– Demographics– Biopsydiagnoses– Drugtreatment– Radia=ontreatment

Page 14: The Human Variant Database - GitHub Pages · • Two parallel goals: – Personalized Oncogenomics Program Use pa'ent genomics to diagnose and iden'fy therapies for each pa'ent’s

Annota=ons

•  dbSNP•  COSMIC•  ClinVar•  SnpEff•  Genemodels

Page 15: The Human Variant Database - GitHub Pages · • Two parallel goals: – Personalized Oncogenomics Program Use pa'ent genomics to diagnose and iden'fy therapies for each pa'ent’s

Comingsoon

•  Otherinternalprojects•  Moreexternaldatasets!•  Structuralvariants,miRNA...•  Disease/Drugontologies•  Knowledgebase•  Moredata=bejeranalysis!

Page 16: The Human Variant Database - GitHub Pages · • Two parallel goals: – Personalized Oncogenomics Program Use pa'ent genomics to diagnose and iden'fy therapies for each pa'ent’s

Accessingthedata

•  Customqueriesandpipelines

Page 17: The Human Variant Database - GitHub Pages · • Two parallel goals: – Personalized Oncogenomics Program Use pa'ent genomics to diagnose and iden'fy therapies for each pa'ent’s

Accessingthedata

•  Customqueriesandpipelines•  GeneralpurposeRESTAPIs– Python– SQLAlchemyObjectRela=onalModel– PyramidRESTframework

•  Webinterface– Query– Filter– Analyze

Page 18: The Human Variant Database - GitHub Pages · • Two parallel goals: – Personalized Oncogenomics Program Use pa'ent genomics to diagnose and iden'fy therapies for each pa'ent’s

Queryselector

Page 19: The Human Variant Database - GitHub Pages · • Two parallel goals: – Personalized Oncogenomics Program Use pa'ent genomics to diagnose and iden'fy therapies for each pa'ent’s

Results

Page 20: The Human Variant Database - GitHub Pages · • Two parallel goals: – Personalized Oncogenomics Program Use pa'ent genomics to diagnose and iden'fy therapies for each pa'ent’s

TheFuture

Letthedatabasedothework!

Page 21: The Human Variant Database - GitHub Pages · • Two parallel goals: – Personalized Oncogenomics Program Use pa'ent genomics to diagnose and iden'fy therapies for each pa'ent’s

TheFuture

Letthedatabasedothework!• Whygiveupyourpipeline?– speed– flexibility

Page 22: The Human Variant Database - GitHub Pages · • Two parallel goals: – Personalized Oncogenomics Program Use pa'ent genomics to diagnose and iden'fy therapies for each pa'ent’s

Tasksthatcouldbedoneonthevariantdatabase

•  Annota=ons•  Filtering•  Sta=s=calanalysisandanaly=cs•  Correla=ons•  MachineLearning

Page 23: The Human Variant Database - GitHub Pages · • Two parallel goals: – Personalized Oncogenomics Program Use pa'ent genomics to diagnose and iden'fy therapies for each pa'ent’s

scalable,in-databaseanaly=cs

Page 24: The Human Variant Database - GitHub Pages · • Two parallel goals: – Personalized Oncogenomics Program Use pa'ent genomics to diagnose and iden'fy therapies for each pa'ent’s

Thanks!

VariantDBDevelopersMarcelBernardJoshuaDaviesDarrylD’SouzaNavjashanSinghJamesZhouSimonChan

PIPE/BioApps/LIMSMorganByeKarenEddyPatrickPlejner

SystemsHansenWongRudyZhouLanceBailey

BrandonPierceRichardCorbejEricChuahYussanneMa