Bioinformatics for the 100,000 Genomes...

Preview:

Citation preview

Bioinformaticsforthe100,000GenomesProject

AugustoRendónaugusto.rendon@genomicsengland.co.ukDirectorofBioinformatics|GenomicsEnglandPrincipalResearchAssociate|UniversityofCambridge

Barcelona,2016-11-02

Outline

• IntroductiontotheUK’s100,000genomesproject• Analysesinrarediseases• Analysesincancer• BioinformaticsPlatform• Datamodelsandflows• Databases• Interpretation

Inceptionofthe100,000genomesproject(2012,2014)

“Ifwegetthisright,wecouldtransformhowwediagnoseandtreatourmostcomplexdiseasesnotonlyherebutacrosstheworld”(December2012)

“IamdeterminedtodoallIcantosupportthehealthandscientificsectortounlockthepowerofDNA,turninganimportantscientificbreakthroughintosomethingthatwillhelpdeliverbettertests,betterdrugsandaboveallbettercareforpatients.”(August2014)

• Sequence100,000genomes

• Cancerandraregeneticdisease

• Capturedatadeliveredelectronically,storeitsecurelyandanalyseitwithinanEnglishdatacentre(readinglibrary)

• Combinegenomeswithextractedclinicalinformationforanalysis,interpretation,andaggregation

• Createcapacity,capabilityandlegacyinpersonalisedmedicinefortheUK

GoalsoftheGenomicsEnglandproject

1.TobringbenefittoNHSpatients

2.Toenablenewscientificdiscoveryandmedicalinsights

3.Tokickstart thedevelopmentofaUKgenomicsindustry

4.Tocreateanethicalandtransparentprogrammebasedonconsent

GenomicsEnglandproject

http://www.genomicsengland.co.uk/library-and-resources/

Recruitmentandclinicalinterfacevia13“GMCs”,ScotlandandNorthernIreland

• GenomicMedicineCentres• NetworksofNHShospitalsincludinggenomicslabs

• 13“Leadorganisation”plus71“LocalDeliveryPartners”

• ContractedbyNHSEngland• Coverrecruitment,dataandreturnofresults

• Scotland• Doingownsequencing

• NorthernIreland• SimilartoaGMC• ContractedbyNIpayer

+

7

Feedbacktoparticipants

AdditionalfindingGenes

Requirements:

• Atreatableorpreventablecondition.

• Reliablydetectedbynextgenerationsequencing.

• Eachgenewillhaveacuratedlistofhighconfidence,highpenetrancevariants.

Otherconditionsmaybeaddedifclinicallyappropriateandtechnicallyfeasible.

ParticipantsrecruitedinRD• About400 RDparticipantscurrently

recruitedperweek• 5,000 participantsrecruitedtotheRDpilot

FamilySize

*DatafromMainProgramme

Recruitmentbytumourtype

10

AdultGlioma,19,2%

Bladder,28,3%

Breast,321,29%

Childhood,1,0%

Colorectal,264,24%EndometrialCarcinoma,20,2%

Lung,139,13%

MalignantMelanoma,3,0%

Ovarian,91,8%

Prostate,95,9%

Renal,65,6%

Sarcoma,44,4%TesticularGermCellTumours,1,0%

>14,896genomessequenced(Nov1)

NBasesx109

(Q30

-nod

up)

%Autosomalcoverage>=15x(Q30-nodup)

Germlinedataonly

• Median%Autosomalcoverage>=15X=97.4%• About1.4PBofdata

125150

AnalysesinRareDiseasesGeneticbasedtest

Checksofreporteddatavsgenetics

• Sexchecks• Coverage-based(WGS)• XchromosomeheterozygosityandYchromosomegenotypingrate(array)

• PredictedminorkaryotypesincludeXO,XXY,XYY• Relatednesschecks

• Mendelianinconsistencyrate(whereatleastoneparentsequenced)

• Estimatedidentitybydescentsharingforallpairsincohortandworkingonafamilyonlyworkflow- PLINKandPC-Relate

• Canidentifyrarephenomena,e.g.large-scaleuniparentalisodisomy

Coveragebasedsexchecks

Relatednesschecking

15

AnalysesinCancerAssessingthequalityofsamplepreparationprotocols

FreshFrozen(FF)vsFormalinFixedParaffinEmbedded(FFPE)

FF• Costlyandnotwidelyavailable• Difficulttocapturetumour• HighqualityDNA

FFPE• Routinelyused• Digitalpathologyfortumour

selection• Lowqualityandquantityof

DNA

ATdropout GCdropout

FFsample 0.00 0.06 lowcoverageforGC-richregions

FFPEsample

0.16 -0.26 trendisreversedwithpoorcoverageof AT-richregions

ATrich GCrich

AT/CGdropouteffectoncopynumbervariantcalling

FFPEGCdropout

FFGCdropoutFFPEATdropout

FFATdropout

FF ATdrop

Purity

RMSDcov

FFPE ATdrop

Purity

RMSDcov

4.7 0.6 13.1 5.8 0.6 18.9

4.0 0.4 13.2 5.4 0.4 24.5

4.3 0.5 14.3 6.6 0.5 22.4

4.4 0.4 12.9 15.8 NA 50.7

3.1 0.4 14.8 5.4 0.4 23.1

FreshfrozenandFFPEpairedsamples:abilitytocallCNVs

OverlappingSNVsinFFandFFPEsamplesfrompairedVAF<5%filteredout

FFPEalsoaffectssmallvariantcallingProp

ortio

nofvariants

GMC1OtherGMCs

Comparingsequencequalitymetricsacrosslabs

Afterstandardisingonoptimised FFPEprotocol

Bioinformaticsplatform

GELbioinformaticsplatform

DesignGoals• Scalability:abletooperateonseveralhundredwholegenomesperday• Traceability:abletokeeptheprovenanceofeveryartefactproducedintheprocess• Knowledgeaccumulation:abletocaptureandaggregatetheknowledge,decisionscapturedduringtheinterpretationinordertogeneratebetterknowledgebases• Serviceoriented:componentstalktoeachotherviawelldefinedAPIsanddataformats

Hospita

lsGe

nomicsE

ngland

Interp.provide

rs

ClinicalDataintakeservice

InterpretationplatformservicesGenomeintakeservice

Workflowmanagement

Metadata Variants

ReferenceKnowledge

GxPassociations

Interpretation

Tracking

Samedatamodel,manymanifestationsHowtoensurethatallthedataiscoherentlystoredandeasilyretrievable?

• InspiredbyModel-DrivenArchitectureapproaches• Models(schemas)controlledingithub includingboilerplatefunctionstovalidatedataagainstmodel• Documentationauto-generatedoutofthemodel• ServicescommunicateusingJSONderivedfromthemodel• Datawrittenagainsttheschemaauto-generatedfromthemodelinthemetadatastoreusingdocumentstores

Datamodelsintheplatform

• Useofavro foritsinterfacedefinitionlanguage,JSONoutofthebox,automaticcodegenerationofclassestohandlethesedata• Models(andauxiliarylibraries)availablehere:https://github.com/genomicsengland/GelReportModels/tree/releases/schemas/IDLs• Documentationforthemasterbranchhere:https://genomicsengland.github.io/GelReportModels/index.html• Bioinformaticsmodelshere:https://github.com/opencb/biodata• ForreadsandvariantsweuseprotocolbufferscompatiblewithGA4GHstandards

InterpretedGenomeRD

Bertha:Distributedworkflowmanagementsystem(reallyanenterpriseservicebusforgenomicdata)

Producer ConsumerExchangepublishes routes consumesQueue

MessageBroker

TrackingDB

JobScheduler

Dashboard

DeliveryAPI

Auditor

Orchestrator

GridConsumer

• Restarts• Scatter-gather• Singleandgroupprocesses• Multipleconcurrentworkflows

(workinprogress)

https://github.com/genomicsengland/bertha

bertha_default 1.1.0

Single Sample QC & Processing

Analysis

Intake QC

Multi Sample QC

Cross Sample Contamination

Single-Sample QC Check Point

Identity by DecentMendelian Inconsistency Rate

Sex Check

Somatic VCF re-headering

Tumour Cross Sample ContaminationCross Species Contamination Depth of Coverage Concordance check

Intake QC Check Point

Merge Array Genotypes

Multi-Sample QC Check Point

Consent Check Point

Variant Calling

Variant Normalisation

Tumour PloidyTumour PurityTumour ClonalityMutation SignatureViral InsertionsActionable Mutation CoverageSNV & Indel RefinementMutation BurdenInbreeding Coefficient Homozygosity Runs

Variant Annotation

Variant Tiering

Interpretation Dispatch Exomiser

Delivery API

Integrity Check

MD5 Check

Validate BAM Picard

Filtered Bamstats Unfiltered Bamstats Q30 Bamstats VCF QC

Fix Permissions

Plot Filtered Bamstats Generate Filtered Metrics Bamstats Plot Unfiltered Bamstats Generate Q30 Metrics Bamstats

QC Stats Post-processing

Workflowdiagramme

Dataintake

SingleSampleQC&Processing

Multi-sampleQC

Analysis

SequencereceivedIntakeAPI

InterpretationRequestDispatched

Interpretationapproach

• VirtualGenePanels• Initiallyassignedbyaclinician• Workingonautomatedpanelsuggestions

• Variantfiltering• AlleleFrequency:variantisrare• Segregation:variantsegregateswithconditioninfamily• Panelmembership(includingmodeofinheritance)• Differentforcancer

• Interpretation• Automatedpathogenicityscoring• Manualreview

• SeveralmanualQCpoints

Panelapp:Crowdsourcingcurationofgenediseaseassociations

https://bioinfo.extge.co.uk/crowdsourcing/PanelApp/

StatusofPanels

• 190panels• 97>=v1panels• 3,512genes• 435registeredreviewers• 15,149genelevelreviews• RecognisedbytheUKgenetictestingnetwork• Curationreachingapointofdiminishingreturns

1

10

100

1000

10000

0 50 100 150

Numberofreviews

Reviewers

Automatedpanelsuggestion

HP1

HP2

HP3

HP2

HP3

HP1

HP5

G1,G2,G3

G1,G4,G5

G6,G7,G8PanelZ

PanelY

PanelX

DiseaseX

X

Y

Z

Diseases,coreHPtermsandpanels

HP4

G1,G2,G3X

G1,G4,G5Y

G6,G7,G8ZHP4

G6

G7

HP2

HP3

HP4

Diseases,HPannotationandgenes

AlsogetaQCscoreforhowphenotypicallysimilarpatientistorecruiteddisease

RDpilotbenchmarking• 1831participantswithHPOterms,assignedpanels(2674total)andcoredisease• 847/1831(46%)haveexactlysamepanels• 728/1831(40%)havesamepanelsplus1or2extra• 256/1831(14%)aremissingsomeofmedicalreviewpanels

7November2016 360

200

400

600

800

1000

1200

-900 -800 -700 -600 -500 -400 -300 -200 -100 0 100 200 300 400 500 600 700 800 900 More

Freq

uency

Bin

Genegainsorlosses

Filteringintherarediseasesprogramme

Domain1

Variantsinavirtualpanelofactionablegenes(between20and40).Actionablegenesaredefinedasgeneswithshortvariantsassociatedwiththerapeutic,prognosticordiagnosticactionsbyGenomOncology (MyCancerGenome)

Matchingatthevariantlevel

Domain2

VariantsinthegenesfromCancerGeneCensus- 534genes.

Domain3

Variantsinallothergenes

Frequencyfilters:excludecommonvariants(1000G,ExAC,GEL)Consequencefilters:excludesynonymousvariants

Filteringinthecancerprogramme

Twopartreports:Actionableand“Interesting”

Supp

lemen

taryanalysis

StructuralvariantsMutationaldensityCoverageandcopynumber

Mutationalsignatures

Hypermutation rainplotsMutationcontext

Cellbase• Referencedatastore/AnnotationEngineOpenCGA• Catalog:metadataandclinicaldatastore• Storage:variantdatabaseInterpretationPlatform• Interpretationservice:managevariousproducersandconsumers• Interpretationwarehouse(underconstruction):storesandservesinterpretationdata

Bioinformaticsplatformcomponents

https://github.com/opencb/opencgahttps://github.com/opencb/cellbase

OpenCB familyofapplications

InterfaceLayer

OpenCGACatalog

OpenCGAStorageCellbase

MongoDB MongoDB MongoDB HBASE PosixFS

GenomeBrowser

VariantAnalysis

DataDiscovery

Cellbase

• Knowledgebasemanagement• UsesEnsembl,Uniprot,IntAct,ClinVar,etc.• CurrentdatabaseengineisMongoDB• JSONoutputsagainstwelldefinedmodel• SupportsannotationagainstlocalDBs• Annotatesabout10,000variants/secondperinstance• PythonandRAPIs

http://nar.oxfordjournals.org/content/40/W1/W609.short

AnnotationagainstCellbase

http://bioinfo.hpc.cam.ac.uk/cellbase/webservices/https://github.com/opencb/cellbase

CellBase 4.0- VEP82Consequencetypebenchmark(1kGphase3,83Mvariants)

● VEPannotations:346M● CellBaseannotations:346M● CoincidenceatSOtermlevel(346Mannotations)

– AnnotationsprovidedbyVEPandnotprovidedbyCellBase:3364(99.999%coincidence)

– AnnotationsprovidedbyCellBaseandnotprovidedbyVEP:4918(99.999%coincidence)

● 60%DuetodifferencesonmiRNAdatasources● 39%DifficultieswithVEPoutputformatparsing

● Coincidenceatvariantlevel(83Mvariants)– Variantswithconflictingannotation:4990(99.994%coincidence)

AnnotationforphasedMNVsandCNVs• SupportforCNVsnewinCellBase4.5Beta

• Mainchallenge:supportimprecisecalling-matchagainstalreadyreportedCNVs(populationfrequencies,clinicalvariants)

• Sameannotationdataasfortherestofvariants:consequencetype,populationfrequencies,etc.

• ExampleCNV

• SupportforMNVsandphasedvariantsfromCellBase4.0• Consequencetypedependsonvariantsaffectingthesamecodon

• Variantsareassignedaphaseset(phasedVCFsincludethePStag)-allvariantsonthesamephasesetshallbeprocessedtogether

AnnotationofMNVs

• Example:17:270550:AACAG:TGCAA• ExampleMNV

• Decomposeintosinglephasedvariantsmembersofthesamephaseset:

{"id":"17:270550:A:T","result":[{"codon":"cAA/cTG","proteinVariantAnnotation":{"reference":"GLN","alternate":"LEU"},"sequenceOntologyTerms":[{"accession":"SO:0001583","name":"missense_variant"}

{"id":"17:270551:A:G","result":[{"codon":"cAA/cTG","proteinVariantAnnotation":{"reference":"GLN","alternate":"LEU"},"sequenceOntologyTerms":[{"accession":"SO:0001583","name":"missense_variant"}

{"id":"17:270554:G:A","result":[{"codon":"caG/caA","proteinVariantAnnotation":{"reference":"GLN","alternate":"GLN"},"sequenceOntologyTerms":[{"accession":"SO:0001819","name":"synonymous_variant"}

OpenCGA - Catalog

MetadatastoreandA&AforOpenCGA• Managesroles,groups,acls• Auditlog• LDAPintegration• Arbitraryschemas(annotationsets)

6 node Hadoop cluster:• Transform: 97 min• Load: 80 sec• Merge: 84 sec• Millisecond response

times for regional queries

• Whole genome filtering queries for all individuals within seconds

OpenCGA - Storage

Extensivecapabilitiestoqueryacrossgenotypeandphenotyperelationships

AspirationtobefullyGA4GHcompatiblefromv1.0

Platformforinterpretation(underconstruction)

Key(personal)learnings

• Thereisgreatstrengthinmultidisciplinaryteamswithspecialisation,butthoseindividualsthatcanspanbothbiology/geneticsandsoftwareengineerarepivotal–theconnectthespecialist• Goodsoftwareengineeringpracticesalsoapplytobioinformatics,tonameafew:designing,documenting,Testing,supportandservice.Skippingthemdon’treallysaveyoutime• Ihavebecomeabigfanofusingwellestablishedtechnologieswithrichecosystems(e.g.hadoop)ratherthaninventingnewformats,datastructures,toolchains

Finalthoughts

• Thefutureinhumangeneticswillbeunderpinnedbyacademic/industrialpartnerships;boththetaskandthebenefitsaretoobigtogoatitalone• GenomicMedicineisjustoneofthepilotsofadigitalrevolutioninhealthcarewhereartificialintelligencewillcomplement/replacethediagnosticjourney• Butgenomicsistheeasypart,clinicaldataistherealchallenge

Recommended