53
Bioinformatics for the 100,000 Genomes Project Augusto Rendón [email protected] Director of Bioinformatics | Genomics England Principal Research Associate | University of Cambridge Barcelona, 2016-11-02

Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project

Bioinformaticsforthe100,000GenomesProject

AugustoRendó[email protected]|GenomicsEnglandPrincipalResearchAssociate|UniversityofCambridge

Barcelona,2016-11-02

Page 2: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project

Outline

• IntroductiontotheUK’s100,000genomesproject• Analysesinrarediseases• Analysesincancer• BioinformaticsPlatform• Datamodelsandflows• Databases• Interpretation

Page 3: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project

Inceptionofthe100,000genomesproject(2012,2014)

“Ifwegetthisright,wecouldtransformhowwediagnoseandtreatourmostcomplexdiseasesnotonlyherebutacrosstheworld”(December2012)

“IamdeterminedtodoallIcantosupportthehealthandscientificsectortounlockthepowerofDNA,turninganimportantscientificbreakthroughintosomethingthatwillhelpdeliverbettertests,betterdrugsandaboveallbettercareforpatients.”(August2014)

Page 4: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project

• Sequence100,000genomes

• Cancerandraregeneticdisease

• Capturedatadeliveredelectronically,storeitsecurelyandanalyseitwithinanEnglishdatacentre(readinglibrary)

• Combinegenomeswithextractedclinicalinformationforanalysis,interpretation,andaggregation

• Createcapacity,capabilityandlegacyinpersonalisedmedicinefortheUK

GoalsoftheGenomicsEnglandproject

1.TobringbenefittoNHSpatients

2.Toenablenewscientificdiscoveryandmedicalinsights

3.Tokickstart thedevelopmentofaUKgenomicsindustry

4.Tocreateanethicalandtransparentprogrammebasedonconsent

Page 5: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project

GenomicsEnglandproject

http://www.genomicsengland.co.uk/library-and-resources/

Page 6: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project

Recruitmentandclinicalinterfacevia13“GMCs”,ScotlandandNorthernIreland

• GenomicMedicineCentres• NetworksofNHShospitalsincludinggenomicslabs

• 13“Leadorganisation”plus71“LocalDeliveryPartners”

• ContractedbyNHSEngland• Coverrecruitment,dataandreturnofresults

• Scotland• Doingownsequencing

• NorthernIreland• SimilartoaGMC• ContractedbyNIpayer

+

Page 7: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project

7

Feedbacktoparticipants

Page 8: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project

AdditionalfindingGenes

Requirements:

• Atreatableorpreventablecondition.

• Reliablydetectedbynextgenerationsequencing.

• Eachgenewillhaveacuratedlistofhighconfidence,highpenetrancevariants.

Otherconditionsmaybeaddedifclinicallyappropriateandtechnicallyfeasible.

Page 9: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project

ParticipantsrecruitedinRD• About400 RDparticipantscurrently

recruitedperweek• 5,000 participantsrecruitedtotheRDpilot

FamilySize

*DatafromMainProgramme

Page 10: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project

Recruitmentbytumourtype

10

AdultGlioma,19,2%

Bladder,28,3%

Breast,321,29%

Childhood,1,0%

Colorectal,264,24%EndometrialCarcinoma,20,2%

Lung,139,13%

MalignantMelanoma,3,0%

Ovarian,91,8%

Prostate,95,9%

Renal,65,6%

Sarcoma,44,4%TesticularGermCellTumours,1,0%

Page 11: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project

>14,896genomessequenced(Nov1)

NBasesx109

(Q30

-nod

up)

%Autosomalcoverage>=15x(Q30-nodup)

Germlinedataonly

• Median%Autosomalcoverage>=15X=97.4%• About1.4PBofdata

125150

Page 12: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project

AnalysesinRareDiseasesGeneticbasedtest

Page 13: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project

Checksofreporteddatavsgenetics

• Sexchecks• Coverage-based(WGS)• XchromosomeheterozygosityandYchromosomegenotypingrate(array)

• PredictedminorkaryotypesincludeXO,XXY,XYY• Relatednesschecks

• Mendelianinconsistencyrate(whereatleastoneparentsequenced)

• Estimatedidentitybydescentsharingforallpairsincohortandworkingonafamilyonlyworkflow- PLINKandPC-Relate

• Canidentifyrarephenomena,e.g.large-scaleuniparentalisodisomy

Page 14: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project

Coveragebasedsexchecks

Page 15: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project

Relatednesschecking

15

Page 16: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project

AnalysesinCancerAssessingthequalityofsamplepreparationprotocols

Page 17: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project

FreshFrozen(FF)vsFormalinFixedParaffinEmbedded(FFPE)

FF• Costlyandnotwidelyavailable• Difficulttocapturetumour• HighqualityDNA

FFPE• Routinelyused• Digitalpathologyfortumour

selection• Lowqualityandquantityof

DNA

Page 18: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project

ATdropout GCdropout

FFsample 0.00 0.06 lowcoverageforGC-richregions

FFPEsample

0.16 -0.26 trendisreversedwithpoorcoverageof AT-richregions

ATrich GCrich

AT/CGdropouteffectoncopynumbervariantcalling

FFPEGCdropout

FFGCdropoutFFPEATdropout

FFATdropout

Page 19: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project

FF ATdrop

Purity

RMSDcov

FFPE ATdrop

Purity

RMSDcov

4.7 0.6 13.1 5.8 0.6 18.9

4.0 0.4 13.2 5.4 0.4 24.5

4.3 0.5 14.3 6.6 0.5 22.4

4.4 0.4 12.9 15.8 NA 50.7

3.1 0.4 14.8 5.4 0.4 23.1

FreshfrozenandFFPEpairedsamples:abilitytocallCNVs

Page 20: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project

OverlappingSNVsinFFandFFPEsamplesfrompairedVAF<5%filteredout

FFPEalsoaffectssmallvariantcallingProp

ortio

nofvariants

Page 21: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project

GMC1OtherGMCs

Comparingsequencequalitymetricsacrosslabs

Afterstandardisingonoptimised FFPEprotocol

Page 22: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project

Bioinformaticsplatform

Page 23: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project

GELbioinformaticsplatform

DesignGoals• Scalability:abletooperateonseveralhundredwholegenomesperday• Traceability:abletokeeptheprovenanceofeveryartefactproducedintheprocess• Knowledgeaccumulation:abletocaptureandaggregatetheknowledge,decisionscapturedduringtheinterpretationinordertogeneratebetterknowledgebases• Serviceoriented:componentstalktoeachotherviawelldefinedAPIsanddataformats

Page 24: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project

Hospita

lsGe

nomicsE

ngland

Interp.provide

rs

ClinicalDataintakeservice

InterpretationplatformservicesGenomeintakeservice

Workflowmanagement

Metadata Variants

ReferenceKnowledge

GxPassociations

Interpretation

Tracking

Page 25: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project

Samedatamodel,manymanifestationsHowtoensurethatallthedataiscoherentlystoredandeasilyretrievable?

• InspiredbyModel-DrivenArchitectureapproaches• Models(schemas)controlledingithub includingboilerplatefunctionstovalidatedataagainstmodel• Documentationauto-generatedoutofthemodel• ServicescommunicateusingJSONderivedfromthemodel• Datawrittenagainsttheschemaauto-generatedfromthemodelinthemetadatastoreusingdocumentstores

Page 26: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project

Datamodelsintheplatform

• Useofavro foritsinterfacedefinitionlanguage,JSONoutofthebox,automaticcodegenerationofclassestohandlethesedata• Models(andauxiliarylibraries)availablehere:https://github.com/genomicsengland/GelReportModels/tree/releases/schemas/IDLs• Documentationforthemasterbranchhere:https://genomicsengland.github.io/GelReportModels/index.html• Bioinformaticsmodelshere:https://github.com/opencb/biodata• ForreadsandvariantsweuseprotocolbufferscompatiblewithGA4GHstandards

Page 27: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project
Page 28: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project

InterpretedGenomeRD

Page 29: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project

Bertha:Distributedworkflowmanagementsystem(reallyanenterpriseservicebusforgenomicdata)

Producer ConsumerExchangepublishes routes consumesQueue

MessageBroker

TrackingDB

JobScheduler

Dashboard

DeliveryAPI

Auditor

Orchestrator

GridConsumer

• Restarts• Scatter-gather• Singleandgroupprocesses• Multipleconcurrentworkflows

(workinprogress)

https://github.com/genomicsengland/bertha

Page 30: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project

bertha_default 1.1.0

Single Sample QC & Processing

Analysis

Intake QC

Multi Sample QC

Cross Sample Contamination

Single-Sample QC Check Point

Identity by DecentMendelian Inconsistency Rate

Sex Check

Somatic VCF re-headering

Tumour Cross Sample ContaminationCross Species Contamination Depth of Coverage Concordance check

Intake QC Check Point

Merge Array Genotypes

Multi-Sample QC Check Point

Consent Check Point

Variant Calling

Variant Normalisation

Tumour PloidyTumour PurityTumour ClonalityMutation SignatureViral InsertionsActionable Mutation CoverageSNV & Indel RefinementMutation BurdenInbreeding Coefficient Homozygosity Runs

Variant Annotation

Variant Tiering

Interpretation Dispatch Exomiser

Delivery API

Integrity Check

MD5 Check

Validate BAM Picard

Filtered Bamstats Unfiltered Bamstats Q30 Bamstats VCF QC

Fix Permissions

Plot Filtered Bamstats Generate Filtered Metrics Bamstats Plot Unfiltered Bamstats Generate Q30 Metrics Bamstats

QC Stats Post-processing

Workflowdiagramme

Dataintake

SingleSampleQC&Processing

Multi-sampleQC

Analysis

SequencereceivedIntakeAPI

InterpretationRequestDispatched

Page 31: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project
Page 32: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project

Interpretationapproach

• VirtualGenePanels• Initiallyassignedbyaclinician• Workingonautomatedpanelsuggestions

• Variantfiltering• AlleleFrequency:variantisrare• Segregation:variantsegregateswithconditioninfamily• Panelmembership(includingmodeofinheritance)• Differentforcancer

• Interpretation• Automatedpathogenicityscoring• Manualreview

• SeveralmanualQCpoints

Page 33: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project

Panelapp:Crowdsourcingcurationofgenediseaseassociations

https://bioinfo.extge.co.uk/crowdsourcing/PanelApp/

Page 34: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project

StatusofPanels

• 190panels• 97>=v1panels• 3,512genes• 435registeredreviewers• 15,149genelevelreviews• RecognisedbytheUKgenetictestingnetwork• Curationreachingapointofdiminishingreturns

1

10

100

1000

10000

0 50 100 150

Numberofreviews

Reviewers

Page 35: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project

Automatedpanelsuggestion

HP1

HP2

HP3

HP2

HP3

HP1

HP5

G1,G2,G3

G1,G4,G5

G6,G7,G8PanelZ

PanelY

PanelX

DiseaseX

X

Y

Z

Diseases,coreHPtermsandpanels

HP4

G1,G2,G3X

G1,G4,G5Y

G6,G7,G8ZHP4

G6

G7

HP2

HP3

HP4

Diseases,HPannotationandgenes

AlsogetaQCscoreforhowphenotypicallysimilarpatientistorecruiteddisease

Page 36: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project

RDpilotbenchmarking• 1831participantswithHPOterms,assignedpanels(2674total)andcoredisease• 847/1831(46%)haveexactlysamepanels• 728/1831(40%)havesamepanelsplus1or2extra• 256/1831(14%)aremissingsomeofmedicalreviewpanels

7November2016 360

200

400

600

800

1000

1200

-900 -800 -700 -600 -500 -400 -300 -200 -100 0 100 200 300 400 500 600 700 800 900 More

Freq

uency

Bin

Genegainsorlosses

Page 37: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project

Filteringintherarediseasesprogramme

Page 38: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project

Domain1

Variantsinavirtualpanelofactionablegenes(between20and40).Actionablegenesaredefinedasgeneswithshortvariantsassociatedwiththerapeutic,prognosticordiagnosticactionsbyGenomOncology (MyCancerGenome)

Matchingatthevariantlevel

Domain2

VariantsinthegenesfromCancerGeneCensus- 534genes.

Domain3

Variantsinallothergenes

Frequencyfilters:excludecommonvariants(1000G,ExAC,GEL)Consequencefilters:excludesynonymousvariants

Filteringinthecancerprogramme

Twopartreports:Actionableand“Interesting”

Page 39: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project

Supp

lemen

taryanalysis

StructuralvariantsMutationaldensityCoverageandcopynumber

Mutationalsignatures

Hypermutation rainplotsMutationcontext

Page 40: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project

Cellbase• Referencedatastore/AnnotationEngineOpenCGA• Catalog:metadataandclinicaldatastore• Storage:variantdatabaseInterpretationPlatform• Interpretationservice:managevariousproducersandconsumers• Interpretationwarehouse(underconstruction):storesandservesinterpretationdata

Bioinformaticsplatformcomponents

https://github.com/opencb/opencgahttps://github.com/opencb/cellbase

Page 41: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project

OpenCB familyofapplications

InterfaceLayer

OpenCGACatalog

OpenCGAStorageCellbase

MongoDB MongoDB MongoDB HBASE PosixFS

GenomeBrowser

VariantAnalysis

DataDiscovery

Page 42: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project

Cellbase

• Knowledgebasemanagement• UsesEnsembl,Uniprot,IntAct,ClinVar,etc.• CurrentdatabaseengineisMongoDB• JSONoutputsagainstwelldefinedmodel• SupportsannotationagainstlocalDBs• Annotatesabout10,000variants/secondperinstance• PythonandRAPIs

http://nar.oxfordjournals.org/content/40/W1/W609.short

Page 43: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project

AnnotationagainstCellbase

http://bioinfo.hpc.cam.ac.uk/cellbase/webservices/https://github.com/opencb/cellbase

Page 44: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project

CellBase 4.0- VEP82Consequencetypebenchmark(1kGphase3,83Mvariants)

● VEPannotations:346M● CellBaseannotations:346M● CoincidenceatSOtermlevel(346Mannotations)

– AnnotationsprovidedbyVEPandnotprovidedbyCellBase:3364(99.999%coincidence)

– AnnotationsprovidedbyCellBaseandnotprovidedbyVEP:4918(99.999%coincidence)

● 60%DuetodifferencesonmiRNAdatasources● 39%DifficultieswithVEPoutputformatparsing

● Coincidenceatvariantlevel(83Mvariants)– Variantswithconflictingannotation:4990(99.994%coincidence)

Page 45: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project

AnnotationforphasedMNVsandCNVs• SupportforCNVsnewinCellBase4.5Beta

• Mainchallenge:supportimprecisecalling-matchagainstalreadyreportedCNVs(populationfrequencies,clinicalvariants)

• Sameannotationdataasfortherestofvariants:consequencetype,populationfrequencies,etc.

• ExampleCNV

• SupportforMNVsandphasedvariantsfromCellBase4.0• Consequencetypedependsonvariantsaffectingthesamecodon

• Variantsareassignedaphaseset(phasedVCFsincludethePStag)-allvariantsonthesamephasesetshallbeprocessedtogether

Page 46: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project

AnnotationofMNVs

• Example:17:270550:AACAG:TGCAA• ExampleMNV

• Decomposeintosinglephasedvariantsmembersofthesamephaseset:

{"id":"17:270550:A:T","result":[{"codon":"cAA/cTG","proteinVariantAnnotation":{"reference":"GLN","alternate":"LEU"},"sequenceOntologyTerms":[{"accession":"SO:0001583","name":"missense_variant"}

{"id":"17:270551:A:G","result":[{"codon":"cAA/cTG","proteinVariantAnnotation":{"reference":"GLN","alternate":"LEU"},"sequenceOntologyTerms":[{"accession":"SO:0001583","name":"missense_variant"}

{"id":"17:270554:G:A","result":[{"codon":"caG/caA","proteinVariantAnnotation":{"reference":"GLN","alternate":"GLN"},"sequenceOntologyTerms":[{"accession":"SO:0001819","name":"synonymous_variant"}

Page 47: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project

OpenCGA - Catalog

MetadatastoreandA&AforOpenCGA• Managesroles,groups,acls• Auditlog• LDAPintegration• Arbitraryschemas(annotationsets)

Page 48: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project

6 node Hadoop cluster:• Transform: 97 min• Load: 80 sec• Merge: 84 sec• Millisecond response

times for regional queries

• Whole genome filtering queries for all individuals within seconds

OpenCGA - Storage

Extensivecapabilitiestoqueryacrossgenotypeandphenotyperelationships

Page 49: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project

AspirationtobefullyGA4GHcompatiblefromv1.0

Page 50: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project

Platformforinterpretation(underconstruction)

Page 51: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project
Page 52: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project

Key(personal)learnings

• Thereisgreatstrengthinmultidisciplinaryteamswithspecialisation,butthoseindividualsthatcanspanbothbiology/geneticsandsoftwareengineerarepivotal–theconnectthespecialist• Goodsoftwareengineeringpracticesalsoapplytobioinformatics,tonameafew:designing,documenting,Testing,supportandservice.Skippingthemdon’treallysaveyoutime• Ihavebecomeabigfanofusingwellestablishedtechnologieswithrichecosystems(e.g.hadoop)ratherthaninventingnewformats,datastructures,toolchains

Page 53: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project

Finalthoughts

• Thefutureinhumangeneticswillbeunderpinnedbyacademic/industrialpartnerships;boththetaskandthebenefitsaretoobigtogoatitalone• GenomicMedicineisjustoneofthepilotsofadigitalrevolutioninhealthcarewhereartificialintelligencewillcomplement/replacethediagnosticjourney• Butgenomicsistheeasypart,clinicaldataistherealchallenge