G2P: Using machine learning to understand and predict ...2 Abstract To facilitate precision medicine...

G2P: Using machine learning tounderstandandpredictgenescausingrareneurologicaldisorders

JuanA.Botía1,2,SebastianGuelfi1,DavidZhang1,KarishmaD’Sa1,3,ReginaReynolds1,DanielOnah1, EllenM.McDonagh4, Antonio RuedaMartin4, Arianna Tucci4,5, Augusto Rendon4,6,HenryHoulden1*,JohnHardy1andMinaRyten1

1. Reta Lila Weston Research Laboratories, Department of Molecular Neuroscience,UniversityCollegeLondon(UCL)InstituteofNeurology,London,UK

2. DepartamentodeIngenieríadelaInformaciónylasComunicaciones.UniversidaddeMurcia,Murcia,Spain

3. Department of Medical & Molecular Genetics, School of Medical Sciences, King'sCollegeLondon,Guy'sHospital,London,UK

4. GenomicsEngland,GenomicsEngland,QueenMaryUniversityofLondon,London,UK5. DepartmentofMedicalBiotechnologyandTranslationalMedicine,Universitàdegli

Studi,Milan,viaFratelliCervi93,Segrate,20090,Milan,Italy6. DepartmentofHaematology,UniversityofCambridge,Cambridge,UK

*OnbehalfoftheNeurologyGenomicsEnglandClinicalInterpretationPartnership

.CC-BY-NC-ND 4.0 International licenseauthor/funder. It is made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the. https://doi.org/10.1101/288845doi: bioRxiv preprint

Abstract

Tofacilitateprecisionmedicineandneuroscienceresearch,wedevelopedamachine-learning

technique that scores the likelihood that a gene,whenmutated,will causeaneurological

phenotype. We analysed 1126 genes relating to 25 subtypes of Mendelian neurological

diseasedefinedbyGenomicsEngland(March2017)togetherwith154gene-specificfeatures

capturinggeneticvariation,genestructureandtissue-specificexpressionandco-expression.

Werandomlyre-sampledgeneswithnoknowndiseaseassociationtodevelopbootstrapped

decision-treemodels,whichwereintegratedtogenerateadecisiontree-basedensemblefor

eachdiseasesubtype.Genesgeneratinglargernumbersofdistincttranscriptsandwithhigher

probabilityofhavingmissensemutationsinnormalindividualsweresignificantlymorelikely

tocauseneurologicaldiseases.Usingmouse-mutantphenotypicdatawetestedtheaccuracy

ofgene-phenotypepredictionsandfoundthat for88%ofalldiseasesubtypestherewasa

significant enrichment of relevant phenotypic abnormalities when predicted genes were

mutatedinmiceandinmanycasesmutationsproducedspecificandmatchingphenotypes.

Furthermore,usingonlynewlyidentifiedgenesincludedintheGenomicsEnglandNovember

2017 release, we assessed our gene-phenotype predictions and showed an 8.3 fold

enrichment relative to chance for correct predictions. Thus, we demonstrate both the

explanatoryandpredictivepowerofmachine-learning-basedmodelsinneurologicaldisease.

Introduction

The last10yearshas seenamazingprogress in the identificationofgenesassociatedwith

Mendelianformsofneurologicaldiseases1.Eachindividualgenediscoveryisvaluableforthe

patients affected, for the disease-based researchers and for the wider neuroscience

community. However, these gene discoveries are perhaps even more important when

consideredcollectively.Firstandforemostthegrowthofrarediseasegeneticshasmadeit

clearthatdysfunctionofthecentralandperipheralnervoussystemisafrequentoutcomeof

genetic disorders with approximately 50% of all rare diseases (3000-3500 conditions)

presentingwithsomeformofneurologicalabnormality2.Furthermore, ithasbecomeclear

that neurological disorders exhibit remarkable genetic heterogeneity. There are over 30

genes,whichwhenmutated give rise to hereditary spastic paraplegia3 and over 80 genes

associatedwithCharcotMarieToothdisease4.Finally,ithasbecomeapparentthatvariable

expressivityandatypicalpresentationsarenottheexception,buttherulewhenconsidering

neurogeneticdisorders2.

Theseobservationshaveledustohypothesisethatthereissomething“special”aboutgenes

associatedwithneurogeneticdisorders.Ormoreformallystated,wehypothesisethatgenes

withsignificanteffectsonthestructureandfunctionofthenervoussystemsharecommon

characteristicsnotpresentamongstgeneswhichwhenmutateddonotaffect thenervous

system.Clearlyidentifyingthesecommoncharacteristics(assumingtheyarepresent)would

beusefulfortwomainreasons.Firstly,understandingthecorecharacteristicsofgenesthat

give rise to neurological disorders would potentially help us to better understand the

pathophysiology of at least a subset of disorders. Secondly, the identification of these

commoncharacteristicswouldallowustoscanthegenomeforothergeneswiththesame

key properties, but which have not yet been associated with neurological disease simply

because the patient with this particular neurogenetic condition has not had appropriate

genetictestingorthesignificanceofthepathogenicvariantcouldnotberecognised.Since

these predicted gene-disease associations would efficiently encapsulate prior knowledge,

theycouldbeusedtoprioritisegeneticvariantsforfurtherinvestigationwhenmorestandard

analyseshavefailedtoyieldaresult,anincreasingclinicalchallenge.

While the identification of the core characteristics of a neuro-disease genemay seem an

impossible task, there is an, ever increasing, quantity of public omic data, which has the

potentialtoaddressthisquestion.Overthelast3yearsgenome-widedatasetshavebecome

everlarger,moreaccessible,moreconsistentandmorecomprehensivebothintermsofthe

depth of information offered and the quantity of data utilised. Public data sets include

information on rare variant frequency across all genes5, gene and isoform structure6, and

tissue-specificgeneexpression7.Hiddenwithintheserichdatasetscouldbekeyinformation

regardingtherelationshipbetweengenesandtheirfunctionaleffectsinhealthanddisease.

MachineLearning(ML)iswellsuitedtothetaskofidentifyingsuchhiddenphenomenaand

hasbeenusedsuccessfullytoaddresssimilarclassificationproblems,suchastheeffectsof

geneticvariantsonexon inclusion8andthepredictionofgenesassociatedwithautosomal

dominantdisorders9.Inthispaper,weaimtouseML-basedapproachestogeneratemodels

to predict new genes associated with rare neurogenetic disorders.We base our learning

experience on highly curated gene panels developed for the diagnosis of awide range of

neurological disorders. In this way, we aim to not only obtain novel insights into the

pathophysiological processes underlying some neurogenetic disorders, but also provide

predictiveinformation,whichcanberapidlytranslatedforthebenefitofpatients.

Results

Ageneticmapofhumanneurogeneticdisorders

Given that one of the major motivations of this study was the use of gene-phenotype

associationpredictionasameansofimprovingthediagnosticyieldofwholeexome/genome

sequencing for rare neurogenetic disorders,weused genepanels generatedbyGenomics

England (https://panelapp.genomicsengland.co.uk; March 31st 2017) for “Neurology and

neurodevelopmentaldisorders”.Thesegenepanelsarehighlycuratedwithover45clinicians

andscientistscontributing to thecurationprocess (SupplementaryTable1). Furthermore,

they are managed collectively to ensure quality across panels and aim to generate

conservative “diagnostic-grade” gene sets, chosen because variants within the genes are

capable of causing/explaining the specific neurological phenotype. Only gene panels with

morethan10“Green”(diagnostic-grade)geneswereusedinouranalysiswiththeexception

of the panel for intellectual disability, which contained 735 high confidence genes, but

encompassesaphenotypicallydiversediseaseset.Thisresultedintheinclusionof25panels

inadditiontoapanel includingallneuro-relatedgenestogether (SupplementaryTable2).

Collectively this equated to 1140 unique genes.We noted a high degree of gene overlap

amongstpanels,promptingustogenerateageneticmapoftheinter-relationshipsbetween

neurogenetic disorders based on “gene-sharing”. Interestingly, this “geneticmap” broadly

reflected recogniseddisease categories as exemplified by the clustering of neuromuscular

disorders(Figure1a).

Predictorsderivedfromgenomicandtranscriptomicdataarelargelyindependent

Eachofthe1140knownneurogeneticdisease-relatedgeneswereconsideredforMLmodel

developmentandweredescribedtogetherwithallotherproteincodinggenes(asdefinedby

Ensemblversion72)usingasetof“predictors”.Weleveragedfeaturesextractedfromlarge,

publicomicdatasetstoobtainthesepredictorsofgenestatus(diseaseversusnon-disease).

Wefocusedongenome-widedatasets,whichdonotincorporateanypriorknowledge,either

intheformofcurationorbiological/diseaseinformation,andwhichcouldprovidegene-based

metrics. This was a deliberate strategy in order to enable ML models to provide novel

explanatoryinformationnotcapturedbycurrentdisease-relatedpathwaysormodels.

Morespecifically,wecollatedthefollowingthreetypesofgene-specificinformation:i)gene-

specific variant frequency, ii) gene and transcript structure, and iii) tissue-specific

expression/co-expression of genes (Online Methods & Supplementary Table 3). Since

deleteriousvariantsareundernegativeselection,geneswithlessgeneticvariationthanwould

beexpectedunderarandommutationmodelmightbesupposedtobeofparticulardisease-

relevance.Thisinformationisalreadycapturedatthegene-levelforspecificmutationclasses

bytheExAC5databaseparametersExACpLI(theprobabilityofagenebeingintolerantofboth

heterozygousandhomozygousloss-of-functionvariants),ExACpRec(theprobabilityofbeing

intolerantofonlyhomozygousloss-of-functionvariants),andExACpMiss(aZ-scorereflecting

thelikelihoodofbeingintoleranttomissensemutations,seeOnlineMethods).Conversely,

tolerancetodeleteriousmutationsascapturedbyExACpNull(theprobabilityofbeingtolerant

ofbothheterozygousandhomozygousloss-of-functionvariants)mightbeexpectedtobean

attributeofnon-diseasegenes.AllfouroftheExACparameterswereincludedaspredictors

inourlearningset.

Wealsoexploredthepossibilitythatthecomplexityofgenestructureandtranscriptioncould

beusedtodistinguishbetweendiseaseandnon-diseasegenes.Weextractedorgenerated

measuresofstructuralcomplexityfromGENCODE(Ensemblversion72)andtheHEXEvent10

database.Thesemeasureofcomplexityincludedgenelength,introniclength,thenumberof

alternative transcripts generated, thenumberofpossibleuniqueexon-exon junctions, the

numberofalternative3’and5’sitesusedandthenumbersofcodingandnon-codinggenes

overlappingwiththegeneofinterest(OnlineMethods).

Giventhatdisease-associatedgenesareoftenhighlyexpressedinthetissuesofrelevancefor

thedisease11–16,weincludedmeasuresoftissue-specificexpressioninourlearningdataset.

Using transcriptomicdatageneratedby theGTEx25projectandcovering47human tissues

(including13brainregions,tibialnerveandskeletalmuscleamongstothers)wegenerated

measuresoftissue-specificgene-expressionandco-expression(OnlineMethods).Foreachof

the 47 human tissues we created a co-expression network using Weighted Gene Co-

expressionNetworkAnalysis17optimizedbyk-means18.Thisprovidedestimatesofeachgene’s

globalandlocalconnectivityinrelationtoallotherexpressedgenesinthetissue(ascaptured

bytheterms“adjacency”and“modulemembership”,OnlineMethods).Thisallowedusto

generatemeasures of the tissue-specific network properties of each gene and created an

additional141binarypredictors.

Interestingly,inspectionofthissetofpredictorsdemonstratedthattheabsolutenumberof

geneswith evidence of tissue-specific expression/co-expressionwas highly variable across

tissues (Figure 1b). While 10,483 genes showed testes-specific gene expression or co-

expression,only1699geneswereexpressedinatissue-specificmannerinliver.Furthermore,

the relative importance of tissue-specific gene expression, as distinct from co-expression

(Figure1b&1c)wasalsoveryvariableacrossthe47GTExtissues.Ifwefocusonbraintissue,

whereas49.7%oftissue-specificgenesforcerebellumweredefinedascerebellar-specificdue

tohighabsoluteexpressioninthetissue,inputamenthemajorityofspecificgeneexpression

wasdetectedthroughtheuniquelocalandglobalnetworkpropertiesofgenes(86.12%,Figure

Finally,weexploredall154predictorsacrossthemajorclasses,namelygenetic,structuraland

expression-basedpredictors,forevidenceofcorrelationamongstpredictors(Supplementary

Table4).Predictorswithinamajorclass(geneticvariation,genestructure,geneexpression,

geneco-expression)weremorecorrelatedtoeachotherthantopredictorsofadifferentclass

suggestingthatthemajorpredictorclassesusedwerecapturingorthogonaltypesofgene-

specificinformation(Figure2).

Measuresofgenecomplexityandintolerancetomissensevariantsdistinguishbetweendiseaseandnon-diseasegenesacrossmostdisorders

Asdescribedabovewegenerated154potentiallyusefulpredictorsofdisease-genestatefor

eachgenewithinthehumangenome. Inthefirst instancewewantedto findoutwhether

therewereanystatisticaldifferencesinpredictorvaluesbetweenestablisheddisease-related

andnon-diseasegenesusingsinglepredictorsalone.Thistestingwasperformedforeachof

the25diseasegenepanelsseparatelyandforallpredictorswhethernumericalorcategorical

innature(OnlineMethods).

Usingthisapproachwefoundsignificantdifferences(FDRcorrectedp-value<0.05)inatleast

asinglepredictorfor22ofthe25diseasegenepanels(88.0%,SupplementaryTable5).Of

the threemain typesofgene-specificpredictorsanalysed (gene-specificvariant frequency,

gene and transcript structure, and tissue-specific expression/co-expression of genes), we

foundthatgeneandtranscriptstructureappearedtobethemostuseful.Ofall25disease

genepanels72.0%showedasignificantdifferenceinpredictorvaluesforatleastoneofthe

measuresofgeneandtranscriptstructure.Consideringtranscriptcountalone(thenumberof

transcriptsproducedbyagene),wefoundthat56%ofallthediseasepanels(minimumFDR

correctedp-value=5.22x10-18for“Epilepsy-Plus”)hadasignificantlyhighertranscriptcount

fordisease-relatedascomparedtonon-diseasegenes(Figure3a,SupplementaryTable5).

Thesecondmost importantcategoryofpredictorwasgene-specificvariantfrequencywith

64.0%ofalldiseasepanelsshowingasignificantdifferenceinpredictorvaluesforatleastone

measure,withExACpMissaccountingforthelargestproportionofpanels(40.0%,Figure3b).

While tissue-specific expression/co-expression of genes could be highly discriminatory

predictors, this was restricted to small sets of related disorders. For example, specific

expressioninskeletalmusclewasahighlysignificantpredictorforarangeofprimarymuscle

disorders, including “Distalmyopathies” (p-value 2.78 x 10-32), “Congenitalmyopathy” (p-

value2.35x10-22)and“Rhabdomyolosisandmetabolicmuscledisease”(p-value2.94x10-15).

Similarly, specific expression in the frontal cortex could be used to distinguish between

epilepsy-related and non-disease genes (p-value 1.26 x 10-10 and p-value 3.53 x 10-7

respectively).

Thus,thisanalysisdemonstratedthatwhilegenesassociatedwithneurologicaldiseaseshave

commoncharacteristicsofbroadrelevancewithcomplexityoftranscriptstructurebeingthe

singlemost important discriminatory genic feature, there are also features operating in a

moredisease-specificmanner.

ML-basedclassifiersfordiseasegenepredictioncanbeoptimizedforaccuracyattheexpenseofgenomecoverage

Themajoraimofthisstudywastogeneratemachinelearningclassifiers,basedongenetic,

structural and expression-based predictors, to distinguish between genes which when

mutatedproduceaspecificneurologicalphenotypeversusthosethatdonot.Wewantedto

createsuchclassifiersforeachofthe25disordersasdefinedbyGenomicsEngland.Thiswas

challengingbecausenotonlydowerequireclassifierscapableofprovidingexplanatoryaswell

as predictive information, but the generation process had to be robust despite the large

disparityinthesizeofdisease(meannumberofgenesperpanel=45,range18-111)andnon-

diseasegenesets(4913genes,91.59%ofthetotalsetoflearningexamples).

Ourstrategywastoaddresstheseissuesintwoseparatesteps.Inthefirststepweevaluated

a range of learning paradigms to select one with a reasonably good trade-off between

predictiveandexplanatorypower.Inthesecondstep,wecreatedanensemblecomposedof

manysingleMLmodelsbasedonthe learningparadigmofchoice (as identified instep1).

Morespecifically,thefirststepinvolvedtestingarangeofMLapproachescoveringthemain

supervisedlearningparadigmsandavailablethroughtheCaretRpackage19(OnlineMethods).

WeobtainedROCestimates(asinglemeasure,whichconvenientlyintegratessensitivityand

specificity)foreachalgorithmappliedtoeachofthe25genepanels(SupplementaryFigures

1&2).Onthebasisof theseresults,weselectedJ4820 (OnlineMethods),adecisiontree-

basedalgorithm,duetoitsexplanatorycapabilitiesandhighpredictivepower.

Inthesecondstep,weaddressedtheimbalanceofpositiveandnegativelearningexamples

(averageratioofpositivetonegativelearningexamples=1:109)bycreatinganensembleof

J48modelsobtainedfromlearningdatasetswithequalnumbersofpositiveandnegativegene

examples.We created each learning dataset by randomly re-sampling non-disease genes

(Figure4a),whilekeepingalldiseasegenesunchanged.Inthiswayforeverydiseasecategory

wegenerated200separateJ48trees,eachtrainedusingthesamediseasegeneset,buta

different, randomly generated “non-disease” gene set. We integrated the 200 classifiers

generatedperpanelusingavotingapproachandoperatedasimplemajoritytodeterminethe

outcomeforagivengene.Thismeantthatforagivengene,wepredictedagene-phenotype

relationshipaspresent ifmostof theclassifiers “voteddisease”andabsent ifmostof the

classifiers“votednon-disease”.Weassessedwhethertheintegrationofdiseaseclassifiersto

generate a classifier ensemble improved performance. We found that for all 25 disease

categoriesusingaclassifierensembleimprovedtheROCvalues.Onaveragetherewasa4.1%

improvementinROCvalues(rangeof%improvement=0.6%-7.6%),relativetousingasingle

J48tree(SupplementaryFigure3).

Furthermore,theintegrationofdiseaseclassifiersusingasimplevotingapproachallowedus

to adjust the stringency of our classification system by changing the percentage of votes

required(termedthe“stringencyparameter,s”)toconfidentlyassignageneas“disease”or

“non-disease”. In this way, we converted a 2-output ensemble into a 3-output ensemble

adding “uncertain” as the third possible outcome, used when there was not enough

agreement amongst votes. To illustrate the effect of the stringency parameter (“s”) on

predictionof“disease”genes,wetestedarangeof“s”valuesacrossalldiseaseensembles

andallproteincodinggenes(Figure4b).Thisdemonstratedthatasthestringencyincreased,

the proportion of genes that could be classified either as “Disease” or “Non-disease”

decreased.Thus,increasingourconfidenceingene-phenotypepredictions,comesatthecost

of genome coveragewith the proportion of genes for whichwe generate an “uncertain”

outcomerisingrapidly.Therefore,dependingonthedownstreamuseofclassifiers(e.g.for

useinadiagnosticversusaresearchsetting)wewouldenvisagesituationswhereitwouldbe

sensible toadjust thestringency.For this reasonwehave releasedaweb interfacewhere

predictionscanbeobtainedatvariablelevelsofstringencyforalldiseases(www.XXXXXX).

Predictedgenesareenrichedforrelevantgeneontologytermsandcell-specificmarkers

Afterremovinggenesusedtotrainclassifiers,weappliedall26oftheclassifierensemblesto

theentiresetofhumanproteincodinggenesasdefinedwithinGENCODE(16,012).Usinga

stringencyof0.9foreachclassifier(i.e.predictDiseaseorNon-diseaseonlywhenthemajority

of identical predictions is equal or above 90%),we generated 4,834 new gene-phenotype

predictions(Figure5a,SupplementaryTable6).Onaverage,eachclassifierpredicted272new

diseasegenes(equatingto1.7%ofthegeneclassificationsattempted).Aswouldbeexpected

fromthehighnumbersofoverlappinggenesacrossdiseasepanels,therewasahighdegree

of sharing amongst gene predictions from different classifier ensembles. Nonetheless, all

classifierensemblesgeneratedpredictionswhichweredisease-specific(Figure5a).Wenoted

thatthenumberofnew“disease”predictionswaspositivelycorrelated(Pearsoncorrelation

=0.71,p-value<4.03x10-5)withthesizeoftheoriginaldiseasepanel(i.e.thenumberof

genesalreadyknowntobeassociatedwithdisease,Figure5a).Nosignificantcorrelationwas

detected when we considered non-disease predictions and this is consistent with our

expectation that the success ofML-based classifiers depends on the adequacy of positive

trainingexamples.

Giventhatnocellular,pathwayordisease-based informationwasusedwithinthe learning

data,wewantedtoinvestigatethebiologicalcharacteristicsofourpredictedgenes.Wedid

this firstbyassessingpredictedgenes forenrichmentofgeneontology terms21,aswellas

REACTOME22 and KEGG23 pathway annotation (Figure 5b, Online Methods). Using this

approachandconsideringdiseasegenepredictionsproducedbyeachensembleseparately,

we identified 2,345 unique significant terms (at an FDR corrected p-value of <0.05;

Supplementary Table 7). Interestingly genepredictionsmadeby the “all in one” classifier

(based on the entire set of disease genes) was highly enriched for terms relevant to the

nervous system with the top REACTOME pathway terms being “Neuronal System” (FDR-

correctedp-value=5.12x10-15),“TransmissionacrossChemicalSynapses”(FDR-correctedp-

value=5.12x10-9)and“Axonguidance”(FDR-correctedp-value=3.57x10-8,Supplementary

Table7).

Inspection of disease-specific enrichments also demonstrated the relevance of predicted

genes.Forexample,amongstgenespredictedbytheclassifierfor“Malformationsofcortical

development” there were significant enrichments for 966 terms alone. Focusing on

REACTOMEpathwayenrichments,weidentifiedalargenumberofsignificanttermsrelating

togrowthfactorsignallingandneuronalmigration(Figure5b,OnlineMethods).Inthecase

of the former, this included“Downstreamsignalingof activatedFGFR2” (FDR-correctedp-

value=2.21x10-9),asignallingsystemwhichhasbeenspecificallyimplicatedinthefunction

of outer radial glia24. This analysis also highlighted genes involved in axon guidance (FDR

corrected p-value = 3.93 x 10-16) and synapse formation (“Transmission across Chemical

Synapses”, FDR-corrected p-value = 2.49 x 10-16; “Glutamate Binding, Activation of AMPA

ReceptorsandSynapticPlasticity”,FDR-correctedp-value=4.77x10-12),consistentwiththe

known importance of neuronalmigration abnormalities in the pathophysiology of cortical

malformations25.Furthermore,acloserinvestigationofthesharingofsignificantREACTOME

termsacrosspredictedgenesets(generatedwithdifferentclassifiers)revealedthepossibility

of shared pathophysiological processes underlying disorders. For example, enrichment

analysesforgenespredictedbyboththe“Cerebrovasculardisease”and“Malformationsof

corticaldevelopment”classifiers,werehighlyenriched forgenes involved ingrowth factor

signalling(Figure5b),despitetherebeingnooverlapinthegenesusedaspositivelearning

examplesforclassifiergeneration.

We also examined predicted gene sets for evidence of enrichment of cell-specific gene

markers. This analysiswas performedusing gene expression signatures derived fromRNA

sequencing of purified cell types isolated from mouse cerebral cortex26 and covered

astrocytes, neurons, oligodendrocyte precursors, newly formed oligodendrocytes,

myelinating oligodendrocytes and endothelial cells. Using this approach and considering

diseasegenepredictionsproducedbyeachensembleseparately,weidentified22significant

cell-specific enrichments (FDR < 0.05, Supplementary Table 8, OnlineMethods). Inmany

cases,theseenrichmentswereconsistentwiththephenotypicfeaturesofthedisorder.For

example,genespredictedbytheclassifierfor“Cerebrovasculardisorders”weresignificantly

enriched for endothelial cell markers, while gene predictions for “Inherited whitematter

disorders”wereenrichedformarkersofoligodendrocyteprecursorcells.

Thus, these analyses provide evidence that the classifiers we generated capture some

importantelementsofthebiologyofneurogeneticdisordersandsoarecapableofgenerating

predictedgenes,whichsharethosekeybiologicalproperties.

Disruptingpredictedgenestendstoproduceneurologically-relevantphenotypesinmice

Inordertoexploretheaccuracyofgenepredictionsfurther,wemadeuseoftheMGIMouse-

PhenotypeDatabase27todeterminewhetherdisruptionofmouseorthologuesofpredicted

humangenesproducedphenotypesrelevanttotheneurologicaldiseasesstudied.Considering

the genes predicted by each of the 26 classifiers separately, we identified 1588 unique

significantly enrichedMammalian PhenotypeOntology (MPO) termsofwhich 28.7%were

directlyrelevanttothecentraland/ortheperipheralnervoussystem(definedassub-terms

of MP:0005386, behaviour/neurological phenotype; MP:0003631, nervous system;

MP:0005369,musclephenotype).Usinggenespredictedbythe“allinone”classifier(based

ontheentiresetofdiseasegenes),we identified173termswithsignificantenrichment in

mousemodelsystems(SupplementaryTable9),withthemostsignificantMPOtermsbeing

“abnormal synaptic transmission” (FDR-corrected p-value = 1.20 x 10-10) and “abnormal

nervoussystemphysiology”(FDR-correctedp-value=1.22x10-8).

WealsoexploredtheenrichmentofMPOtermsamongstmousemodelsrelevanttodisease-

specificclassifiers.Despitethechallengesinherentinthisanalysis,namelythatthisisacross-

species analysis using data frommice with both spontaneous and targeted mutations of

multiple types, we were able to identify examples where disruption of predicted genes

mirroredtheexpecteddiseasephenotypewithprecision.Forexample,usingtheGenomics

EnglandPanelfor“Malformationsofcorticaldevelopment”,whichconsistsof45genes,we

predictedafurther589genes(usingstringency0.9).Ofthe585relevantmouseorthologues,

mousephenotypicdatawasavailable for427genes.Using this informationwe foundthat

disruption of these genes in mouse models resulted in phenotypes highly enriched for

“abnormalcerebralhemispheremorphology”(FDR-correctedp-value=1.14x10-13),andmore

specifically“abnormalcerebralcortexmorphology”(FDR-correctedp-value=2.72x10-7).

GenespredictedusingML-basedclassifiersareenrichedfortruegene-phenotypeassociationsinhumans

Clearly the most important means of assessing the value of ML-based classifiers for the

predictionofgenescontributingtoneurogeneticconditions,istheidentificationofmutations

in predicted genes in patients with matching neurological phenotypes. We assessed this

systematicallybymakinguseoftheregularre-versioningofGenomicsEnglandpanelsandthe

continuousupdates available through theOnlineMendelian Inheritance inMan catalogue

(OMIM,www.OMIM.org).Inthecaseoftheformer,weidentifiedallhighconfidencegene-

phenotyperelationshipsaddedtoPanelAppafter theMarch31st2017versionreleaseand

untilNovember1st2017.Inthecaseofthelatter,wecollatedallgene-phenotypeassociations

included between 1st April 2017 and 5th January 2018. After excluding all OMIM gene-

phenotype associations classed as provisional, we manually curated the remaining

associations to ensure that the phenotypes could be mapped accurately to a Genomics

Englanddiseasepanel.Usingthisapproachweidentifiedatotalof32new,highconfidence

gene-phenotypeassociationsforwhichtherelevantML-basedclassifierprovideda“Disease”

or“Non-disease”prediction(applyingastringencyof0.9).Wecorrectlypredicted“Disease”

statusin17caseswiththeremaining15genespredictedas“Non-disease”(Supplementary

Table 10, ratio of “Disease”/”Non-disease” = 113.3%). Given that on average the ratio of

“Disease”to“Non-disease”predictionsamongsttherelevantML-basedclassifiersobtained

onthewholeproteincodinggenomeis13.9%,thisrepresentedan8.2foldenrichmentover

chance.

Furthermore,exploringindividualexamplesenabledustorecognisetheexplanatoryvalueof

classifierensembles.Forexample,usingtheML-basedensemblefor“Cerebellarhypoplasia”,

which was trained with 39 known disease genes, we correctly predict that mutations in

TBC1D23 are a cause of the disorder28. Interestingly, inspection of all the decision trees

demonstrated that the two most important predictors were cerebellum-specific gene

expressionandgenecomplexity,ascapturedbythenumberoftranscriptsproducedbyagene

(Figure6a,OnlineMethods).Infact,accountingforboththeusageofapredictor(Appearance

Index)anditsdepth(ameasureofitsaverageproximitytotheroot)acrossall200decision

trees,demonstratedthattheseweremoreimportantpredictorsthangene-specificmeasures

ofvariantfrequency,suchasExACpLiandExACpMissscores,whicharemorecommonlyused

toassessthepathogenicityofvariants.ConsistentwiththesefindingswenotedthatTBC1D23

hashighermRNAexpressionincerebellumascomparedtootherbrainregionsnotonlyinthe

adultbrainbutthroughoutmostofbraindevelopment(Figure6b).Furthermore,itproduces

ahigherthanexpectednumberoftranscriptcountwhenconsideringallproteincodinggenes

(Figure6c).

While these findings were encouraging, the limited availability of relevant novel gene-

phenotypeassociations,meantthatanymeasureofpredictionaccuracywouldbeunstable.

Withthisinmind,wealsoinvestigatedgene-phenotypeassociationsconsideredtohavelower

levelsofsupportingevidence,termed“Amber”(borderline)or“Red”(low)geneswithinthe

Genomics England PanelApp (PanelApp Handbook Version 5.7). For this list of 329 gene-

phenotypeassociationswecalculatedthepercentageof“Disease”votes(“Disease”quality

score)madeforeachgenebytherelevantclassifieranddemonstratedthatthesevalueswere

highly skewed towards higher values (Figure 6d, dark blue density plot). In contrast, the

distributionofcorrespondingvalues foracontrolgeneset,consistingofallproteincoding

geneswiththeexceptionofthoselistedintheNovember1st.2017GenomicsEnglandrelease

of neurogenetics panels, was negatively skewed (Figure 6d, light blue density plot). The

difference in these distributions of “Disease” quality scores was highly significant (Mann-

Whitneytwo-sampleunpairedtestp-value=2.2x10-16).Giventhatthelistoftestgeneswas

likely to be highly enriched for true gene-phenotype associations, this finding provides

additionalevidencethattheclassifierensembleswehavegeneratedcapturesomeifnotall

thekeycharacteristicsofdisease-associatedgenes.

Discussion

TheoverallaimofthisstudywastotestthehypothesisthatgenescontributingtoMendelian

formsofneurogeneticdisorderscanbeidentifiedusingarelativelylimitedsetofgene-based

predictors,whichdonotincorporatedirectlyorindirectlyanypriorbiologicalknowledge.We

achieve this aim through the generation of machine learning classifier ensembles for 25

neurogenetic disorders and their resulting predictions. We demonstrate that our set of

disease-specificgenepredictionsareenrichedforGOtermsandmolecularpathwaysalready

implicated in neurogenetic diseases. Furthermore,we show thatwhenmutated inmouse

modelsystemsthesepredictedgenesetsproducephenotypes,whicharehighlyenrichedfor

neurological abnormalities and which can mirror the expected human phenotype with

accuracy.Finallyandmostimportantlywedemonstratean8.2foldenrichmentoverchance

indiseasegenepredictionsforalimitedsetofrecentlyidentifiedneuro-diseasegenes.Given

thatourgenepredictionscouldpotentiallybeusedtoincreasetheyieldofdiagnosticwhole

exomesequencingbyhelping investigatorsprioritisegenesof interest,wehavemadeour

predictedgenesetsavailableandeasilysearchablethroughawebapplicationG2P(Gene2

Phenotype,accessibleatXXXXX).

Moving beyond the predictive value of our ML-based classifier ensembles, we provide

evidence of the explanatory power of this approach. In particular, the finding that genes

contributing toMendelian forms of neurogenetic tend to be complex in structure,with a

higherthanexpectednumberoftranscriptsanduniqueexon-exonjunctionsannotatedper

gene.Interestingly,thiswouldsuggestthatthesegenescouldbeparticularlypronetosplicing

mutations,whicharemoredifficulttorecogniseandassess.

Mostimportantly,anddespitethelimitationsofourapproachwedemonstratethevalueof

machine learning approaches in understanding neurogenetic disorders now that a critical

massofgenome-wideannotationandgenediscoverydataexists.Giventhatbothtypesof

dataareonlysettoincreaseinqualityandquantityacrossawiderangeofdisorderswewould

envisagethatML-basedapproachesofthiskindarelikelytoincreaseinimpactoverthenext

5years.

References

1. Boycott, K. M., Vanstone, M. R., Bulman, D. E. & MacKenzie, A. E. Rare-disease

genetics in the era of next-generation sequencing: discovery to translation. Nat. Rev.

Genet. 14, 681–691 (2013).

2. Warman Chardon, J., Beaulieu, C., Hartley, T., Boycott, K. M. & Dyment, D. A. Axons

to Exons: the Molecular Diagnosis of Rare Neurological Diseases by Next-Generation

Sequencing. Curr. Neurol. Neurosci. Rep. 15, (2015).

3. Klebe, S., Stevanin, G. & Depienne, C. Clinical and genetic heterogeneity in hereditary

spastic paraplegias: From SPG1 to SPG72 and still counting. Rev. Neurol. (Paris) 171,

505–530 (2015).

4. Fridman, V. & Reilly, M. Inherited Neuropathies. Semin. Neurol. 35, 407–423 (2015).

5. Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature

536, 285–291 (2016).

6. Harrow, J. et al. GENCODE: The reference human genome annotation for The ENCODE

Project. Genome Res. 22, 1760–1774 (2012).

7. The GTEx Consortium et al. The Genotype-Tissue Expression (GTEx) pilot analysis:

Multitissue gene regulation in humans. Science 348, 648–660 (2015).

8. Xiong, H. Y. et al. The human splicing code reveals new insights into the genetic

determinants of disease. Science 347, 1254806–1254806 (2015).

9. Quinodoz, M. et al. DOMINO: Using Machine Learning to Predict Genes Associated

with Dominant Disorders. Am. J. Hum. Genet. 101, 623–629 (2017).

10. Busch, A. & Hertel, K. J. HEXEvent: a database of Human EXon splicing Events.

Nucleic Acids Res. 41, D118–D124 (2013).

11. Lage, K. et al. A large-scale analysis of tissue-specific pathology and gene expression of

human disease genes and complexes. Proc. Natl. Acad. Sci. 105, 20870–20875 (2008).

12. Kumar, V., Sanseau, P., Simola, D. F., Hurle, M. R. & Agarwal, P. Systematic Analysis

of Drug Targets Confirms Expression in Disease-Relevant Tissues. Sci. Rep. 6, (2016).

13. Kitsak, M. et al. Tissue Specificity of Human Disease Module. Sci. Rep. 6, (2016).

14. Wells, A. et al. The anatomical distribution of genetic associations. Nucleic Acids Res.

43, 10804–10820 (2015).

15. Cornish, A. J., Filippis, I., David, A. & Sternberg, M. J. E. Exploring the cellular basis of

human disease through a large-scale mapping of deleterious genes to cell types. Genome

Med. 7, (2015).

16. Ganegoda, G., Wang, J., Wu, F.-X. & Li, M. Prediction of disease genes using tissue-

specified gene-gene network. BMC Syst. Biol. 8, S3 (2014).

17. Langfelder, P. & Horvath, S. WGCNA: an R package for weighted correlation network

analysis. BMC Bioinformatics 9, 559 (2008).

18. Botía, J. A. et al. An additional k-means clustering step improves the biological features

of WGCNA gene co-expression networks. BMC Syst. Biol. 11, (2017).

19. Max Kuhn. Building Predictive Models in R Using the caret Package. J. Stat. Softw. 28,

20. Data mining: practical machine learning tools and techniques. (Elsevier, 2017).

21. The Gene Ontology Consortium. Expansion of the Gene Ontology knowledgebase and

resources. Nucleic Acids Res. 45, D331–D338 (2017).

22. Fabregat, A. et al. The Reactome Pathway Knowledgebase. Nucleic Acids Res. 46,

D649–D655 (2018).

23. Kanehisa, M., Sato, Y., Kawashima, M., Furumichi, M. & Tanabe, M. KEGG as a

reference resource for gene and protein annotation. Nucleic Acids Res. 44, D457–D462

(2016).

24. de la Torre-Ubieta, L. et al. The Dynamic Landscape of Open Chromatin during Human

Cortical Neurogenesis. Cell 172, 289–304.e18 (2018).

25. Barkovich, A. J., Dobyns, W. B. & Guerrini, R. Malformations of Cortical Development

and Epilepsy. Cold Spring Harb. Perspect. Med. 5, a022392–a022392 (2015).

26. Soreq, L. et al. Major Shifts in Glial Regional Identity Are a Transcriptional Hallmark of

Human Brain Aging. Cell Rep. 18, 557–570 (2017).

27. Blake, J. A. et al. Mouse Genome Database (MGD)-2017: community knowledge

resource for the laboratory mouse. Nucleic Acids Res. 45, D723–D729 (2017).

28. Vasli, N. et al. Recessive mutations in the kinase ZAK cause a congenital myopathy with

fibre type disproportion. Brain 140, 37–48 (2017).

29. Johnson, M. B. et al. Functional and Evolutionary Insights into Human Brain

Development through Global Transcriptome Analysis. Neuron 62, 494–509 (2009).

30. Kang, H. J. et al. Spatio-temporal transcriptome of the human brain. Nature 478, 483–

489 (2011).

31. Chakraborty, S., Panda, A. & Ghosh, T. C. Exploring the evolutionary rate differences

between human disease and non-disease genes. Genomics 108, 18–24 (2016).

32. Lawrence, M. et al. Software for Computing and Annotating Genomic Ranges. PLoS

Comput. Biol. 9, e1003118 (2013).

33. Johnson, W. E., Li, C. & Rabinovic, A. Adjusting batch effects in microarray expression

data using empirical Bayes methods. Biostatistics 8, 118–127 (2007).

34. Libbrecht, M. W. & Noble, W. S. Machine learning applications in genetics and

genomics. Nat. Rev. Genet. 16, 321–332 (2015).

35. Reimand, J., Kull, M., Peterson, H., Hansen, J. & Vilo, J. g:Profiler--a web-based toolset

for functional profiling of gene lists from large-scale experiments. Nucleic Acids Res. 35,

W193–W200 (2007).

36. Saito, R. et al. A travel guide to Cytoscape plugins. Nat. Methods 9, 1069–1076 (2012).

37. Zhang, Y. et al. An RNA-sequencing transcriptome and splicing database of glia,

neurons, and vascular cells of the cerebral cortex. J. Neurosci. 34, 11929–11947 (2014).

38. Durinck, S. et al. BioMart and Bioconductor: a powerful link between biological

databases and microarray data analysis. Bioinformatics. 21, 3439–3440 (2005).

39. Yoav Benjamini and Yosef Hochberg. Controlling the False Discovery Rate: A Practical

and Powerful Approach to Multiple Testing. J. R. Stat. Soc. Ser. B Methodol. 57, (1995).

FigureLegends

Figure1

A genetic "map" of human neurogenetic disorders: a) The genetic overlap betweenGenomicsEnglandpanelsisdisplayed,suchthateachnoderepresents1panelandaminimumof5genesisrequiredforanedge.Theradiusofnodesandwidthofedgesreflectsthenumberofgeneswithineachpanelandthenumberofsharedgenesbetweenpanelsrespectively.b)Barcharttoshowforeachofthe47GTExtissuesconsidered,howmanygenesexpressedbythetissueshowevidenceoftissue-specificityonthebasisofexpression,adjacencyormodulemembership values. We highlight in light grey tissues that appear as significant in theunivariate test for anyof these three features.c)Bar chart to show thatbrain tissuesarecharacterised by high variability in the relative proportion of tissue-specific expression,adjacencyandmodulemembership.Cerebellarhemisphereshowsthehighestproportionoftissue-specificsignalsduetoabsoluteexpression,whereasputamenischaracterisedbythehighest proportion of genes with tissue-specific co-expression. Genomics England panelnamesareshortenedasfollows:Amyotrophic lateralsclerosismotorneurondisease=ALSMND; Arthrogryposis = AMC; Brain channelopathy = BC; Cerebellar hypoplasia = CH;Cerebrovasculardisorders=CD;CharcotMarieToothdisease=CMT;Congenitalmusculardystrophy = CMD; Congenital myaesthenia = CMS; Congenital myopathy = CMP; Distalmyopathies=DM;Earlyonsetdementiaencompassingfrontotemporaldementiaandpriondisease = EODM FTD PrD; Early onset dystonia = EODS; Epilepsy Plus = EP; Epilepticencephalopathy=EE;Hereditaryataxia=HA;Hereditaryspasticparaplegia=HSP;Inheritedwhite matter disorders = IWMD; Intracerebral calcification disorders = ICD; Limb girdlemusculardystrophy=LGMD;Malformationsofcorticaldevelopment=MCD;Paediatricmotor

neuronopathies=PMN;ParkinsonDiseaseandComplexParkinsonism=PD;Rhabdomyolysisand metabolic muscle disorders = RMMD; Skeletal Muscle Channelopathies = SMC; andStructuralbasalgangliadisorders=SBGD.

Figure2

Thethreemajortypesofpredictors,namelygeneticvariation,geneandtranscriptstructure,andgeneexpression/co-expression,arelargelyindependent:CorrelationmatrixplottoshowthePearson’scorrelationcoefficientsbetweenthepredictorcategories,namelygeneticvariation(orange),geneandtranscriptstructure(green),tissue-specificgeneexpression(darkblue),tissue-specificmodulemembership(blue)andtissue-specificadjacency(lightblue).

Figure3

Transcript complexity and reduced tolerance tomissensemutations are a distinguishingfeatureofgenesassociatedwithmanytypesofrareneurogeneticdisorders:a)Plotshowingthatofthe25GenomicsEnglandpanelsconsidered56%hadasignificantp-value(redscale)oncomparingtranscriptcountvaluesbetweengenesinthepanelsandotherproteincodinggenes.b)Plotshowingthatofthe25GenomicsEnglandgenepanelsconsidered40%hadasignificantp-value(redscale)oncomparingExACpMissvaluesbetweengenesinthepanelsandotherproteincodinggenes.GenomicsEnglandpanelnamesareshortenedas follows:Amyotrophiclateralsclerosismotorneurondisease=ALSMND;Arthrogryposis=AMC;Brainchannelopathy = BC; Cerebellar hypoplasia = CH; Cerebrovascular disorders = CD; CharcotMarieToothdisease=CMT;Congenitalmusculardystrophy=CMD;Congenitalmyaesthenia= CMS; Congenital myopathy = CMP; Distal myopathies = DM; Early onset dementiaencompassing fronto temporal dementia and prion disease = EODM FTD PrD; Early onsetdystonia=EODS;EpilepsyPlus=EP;Epilepticencephalopathy=EE;Hereditaryataxia=HA;Hereditaryspasticparaplegia=HSP;Inheritedwhitematterdisorders=IWMD;Intracerebralcalcification disorders = ICD; Limb girdle muscular dystrophy = LGMD; Malformations ofcorticaldevelopment=MCD;Paediatricmotorneuronopathies=PMN;ParkinsonDiseaseandComplex Parkinsonism = PD; Rhabdomyolysis and metabolic muscle disorders = RMMD;SkeletalMuscleChannelopathies=SMC;andStructuralbasalgangliadisorders=SBGD.

Figure4

ConstructionandcalibrationofMLclassifierensemblesforgene-phenotypeprediction:a)DiagramtoshowtheworkflowusedtogenerateMLclassifierensembles.Theseensemblesare based on the creation of 200 learning data sets per a disease panel. Each data set isgeneratedbyresamplingthecontrolgenesettomaintaina1:1ratioofdiseaseandcontrolgenes.Thefinalmodelof200decisitiontreesareintegratedintoavotingschemethatcanbeused to prediction novel gene-phenotype associations with an estimate of confidence

(bottom right part) and to identify the genic features contributingmost commonly to thepredictionprocess(bottomleftpartofthefigure).b)Plottoshowtheeffectofapplyingastringency parameter on ML predictions generated from decision tree ensembles. Asstringencyvalues increase(x-axis)therelativeproportionsof"Disease","Non-disease"and"Uncertain"alsochange.Inparticular,theproportionofuncertainpredictionsincreases.

Figure5

Properties of gene-phenotype predictions generated by disease-specific ML classifierensembles: a) The number of number of genes containedwithin eachGenomics EnglanddiseasepanelandusedforthegenerationofMLclassifierensemblesisshowninthelowerpanel (positive example set). The total number of new genes predicted per panel using astringency of 0.9 is shown in the middle panel. The upper panel shows the number ofpredictedgenesperpanel,whichwereexclusivelypredictedbythedisease-specificclassifier.b) Plot to show that genes predicted to causemalformations of cortical development aresignificantlyenrichedforrelevantbiologicalprocesses.REACTOMEpathways(circles)withthe

mostsignificantenrichmentsamongstgenespredictedtocause"Malformationsofcorticaldevelopment" (labelled rounded square) are shown. REACTOME pathwayswhich are alsohighlighted by testing for enrichment of predicted genes for other disorders, namely "Cerebrovascular disorders" and "Brain channelopathies”, are also displayed (sharedenrichments=bluecircle;enrichmentsuniquetoeither“Cerebrovasculardisorders"or"Brainchannelopathies"=greycircles).

Figure6

Evidence for enrichment of true gene-phenotype associations amongst gene predictionsmadebydisease-specific classifiers:a)Cerebellum-specificgeneexpressionand transcriptcomplexity are the most important predictors used within the “Cerebellar hypoplasia”classifierensemble.Onthex-axisweploteachofthepredictorsused.Predictorimportanceisplottedonthey-axis,asdefinedasthesumoftheappearance indexofapredictor (thepercentage of times it is used to predict “Disease” across the 200 decision trees whichcomprisetheensembleclassifier,range=0.0-1.0)andthedepthofapredictor(ameasureofitsaverageproximitytotheroot, range=0.0–1.0).Thisplotdemonstratesthatthemostimportant predictors in the “Cerebellar Hypoplasia” ensemble classifier were cerebellar-specificgeneexpressionandtranscriptcount(thenumberoftranscriptsproducedbyageneas annotated in GENCODE version 72). b) Graph to show mRNA expression levels forTBC1D2329,30in6brainregionsduringthecourseofhumanbraindevelopment,basedonexonarrayexperimentsandplottedona log2 scale (yaxis). Thebrain regionsanalyzedare thestriatum(STR),amygdala(AMY),neocortex(NCX),hippocampus(HIP),mediodorsalnucleusofthethalamus(MD),andcerebellarcortex(CBC).ThisshowshigherexpressionofTBC1D23mRNAexpressionincerebellumascomparedtoallotherbrainregionsfromthelateprenatalperiodonwards.c)Densityplotshowingthedistributionoftranscriptcounts(thenumberoftranscriptsproducedbyageneasannotatedinEnsembleversion72)forallprotein-codinggenesandTBC1D23specifically(redline).d)Probabilitydensityplotstoshowthepercentageof"Disease"predictionsproducedbydecisiontreemodelsfora“test”setofgenesenrichedfortruegene-phenotypeassociations(darkblue)anda“control”setdefinedasthesetofallproteincodinggenes(withtheexceptionofthoselistedintheNovember1st.2017GenomicsEngland release of neurogenetics panels). The “test” set included all genes classified by

GenomicsEnglandashavingmoderate(amber)orlow(red)evidenceforassociationwithaneurogeneticdisorderaswellashighevidencegenesaddedtoapanelafterMarch31st2017anduntilNovember1st2017.Noneofthegenesusedinthis“test”setwereusedforlearningtheclassifierensembles.Therewasaclearandhighlysignificant(Mann-Whitneytwo-sampleunpaired testp-value=2.2x10-16)difference in thedistributionof “Disease”quality scorescomparing“test”and“control”genesets.

SupplementaryTables

SupplementaryTable1TableprovidingthenamesandaffiliationsofallindividualscontributingtothecurationofGenomicsEnglanddisease-associatedpanelsusedinthisstudy.

SupplementaryTable2Tabledescribingalldisease-associatedpanelsusedandtheircontents.

SupplementaryTable3Tableprovidingallpredictorsusedtodescribegeneswithomics.

SupplementaryTable4Tableprovidingr2andp-valuesfortherelationshipbetweenall154predictors.

SupplementaryTable5Tableshowingthepredictorswhichdifferedsignificantlybetweenandnon-diseasegenesforeachdisease-associatedpanel.

SupplementaryTable6TableprovidingalldiseasegenepredictionsmadeusingMLclassifierensembleswithastringencyof90%.

SupplementaryTable7Tableprovidingsignificantgeneontology,KEGGandREACTOMEpathwayenrichmentsforgenepredictionsmadeusingdisease-specificMLclassifierensembles.

SupplementaryTable8Tableprovidingsignificantcelltype-specificenrichmentsforgenepredictionsmadeusingdisease-specificMLclassifierensembles.

SupplementaryTable9TableprovidingsignificantMammalianPhenotypeOntologytermsforgenepredictionsmadeusingdisease-specificMLclassifierensembles.

SupplementaryTable10Tableshowingthepredictionsfor32new,highconfidencegene-phenotypeassociationsforwhichtherelevantML-basedclassifierprovideda“Disease”or“Non-disease”predictionatastringencyof0.9.

SupplementaryFigures

SupplementaryFigure1

ROC values for all Genomics England panels for each Caret Algorithm used for modellearning:BoxandwhiskerplotstoshowthedistributionofROCvalues(xaxis)forpredictionsgeneratedusingthetrainingdata(eachGEpanelgenescorrespondstoapointinthesample)perclassifier(yaxis).Whilemanyalgorithmsperformedwell,wechoseJ48becauseithasagoodtradeoffbetweenROCandexplanatorycapabilities.

ROC values for all Caret Algorithms used formodel learning on eachGenomics Englandpanel: In this box and whisker plot we show the distribution of ROC values (x axis) forpredictionsgeneratedusingthetrainingdataforasinglepanel(yaxis,eachGEpanelcomeswithitssizeingenes).Eachalgorithmcorrespondstoapointinthesampleset.TheplotshowsthatthelearningROCperformanceforallpanelsshowaperformancebetterthanrandombutwithroomforimprovement.

ImprovementofROCvaluesthroughtheuseofanensembleof200treesversusasingletree:Boxandwhiskerplotstodemonstratethatthereisanimprovement inROCvalues(xaxis,%ofimprovementinROC)ontestdataforallGEpanels(yaxis)whenusinganensembleofdecisiontreesascomparedtoasingletreetoidentifygene-phenotypeassociations.

Onlinemethods

IdentificationofgenescausingMendelianneurogeneticdisordersWeobtainedourlistofgenescausingMendelianneurogeneticdisordersfromtheGenomicsEngland Panel App (https://panelapp.genomicsengland.co.uk/; March 31st 2017). Weconsideredallpanelsincludedunderthelevel2heading“NeurologyandNeurodevelopmentaldisorders”.Wefurtheranalysedallgenepanelswithmorethan10“Green”(diagnostic-grade)genes with the exception of “Intellectual Disability” due to the very broad phenotypicspectrumassociatedwiththisgenepanel.Thisresultedintheuseof25genepanels,equatingto1140uniquegenes(SupplementaryTable2).

IdentificationofcontrolgenesetsGiven that we aim to use ML-based classifiers not only to predict new gene-phenotypeassociations,buttoprovideexplanatoryinformationaswell,werequirea“control”geneset,whichcanserveasnegativeexamplesthatbetterdelineatethroughcontrastthecorefeaturesofdiseasegenes.Wedefineacontrolgene,asanygenewithnocurrentlyknownassociationtoanydisease(whetherneurologicalinnatureorotherwise).Inthissettingweonlyconsiderdisease associations based onMendelian inheritance and accept that some of the genesdefinedas“control”maybeidentifiedinthefutureashavingadiseaseassociation.Thus,asinthecaseofChakrabortyandcolleagues31,wegeneratealistof“control”genesbyexcludingall genes currently contained within the OnlineMendelian Inheritance in Man database(https://www.omim.org/), Genetic Association Database(https://geneticassociationdb.nih.gov/) and the Human Gene Mutation Database(http://www.hgmd.cf.ac.uk/ac/index.php)fromtheEnsembl(version72)geneset.Usingthisapproachweidentify4260“control”genes.

Extractionofgene-basedmeasuresofgeneticconstraintThe Exome Aggregation Consortium (ExAC, http://exac.broadinstitute.org/) database hasbeenusedtomodelthedifferencebetweentheexpectedandobservedfrequencyofloss-of-functionandmissensemutationsinallgenes.Thisinformationhasbeencapturedthroughthegene-specificparameters,ExACpLi(theprobabilityofbeingloss-of-function,lof,intolerantofboth heterozygous and homozygous lof variants), ExACpRec (the probability of beingintolerantofhomozygous,butnotheterozygouslofvariants),ExACpNull(theprobabilityofbeingtolerantofbothheterozygousandhomozygouslofvariants)andExACpMiss(correctedmissenseZscore,takingintoaccountthathigherZscoresindicatethatthetranscriptismoreintolerantofvariation).ThesefourmeasuresofgeneticconstraintweredownloadedfromtheExACconsortiumsite(ftp://ftp.broadinstitute.org/pub/ExAC_release/release0.3/functional_gene_constraint/;releasedJanuary13th2017)foruseaspredictors.

ExtractionandgenerationofmeasuresofgenecomplexityAllmeasuresofgenecomplexitywereeitherextracteddirectlyorgeneratedfromEnsemblversion72.Theintroniclengthofeachgene(IntronicLength)wasdeterminedusinganad-hocscriptthatcalculatesintronlengthinbasepairsconsideringallpossibletranscriptspresentinthe gene. The total number of unique exon-exon junctions generated by each gene(numJunctions)wascalculatedusingtherefGenomeRpackagewithalltheknownjunctionssummed toobtain a valueper gene. Thenumberof genesoverlapping a geneof interest(countsOverlap)wasestimatedusingtheRpackageGRanges32,suchthatforeachgenethenumberofgenesoverlapping1bpormorewiththegeneofinterest,regardlessofstrand,wascounted. The number of overlapping protein coding genes (countsProtCodOverlap) was

generated in a similar manner except that only genes classed as “protein coding” withinEnsembl(version72)wereconsidered.

Generationofmeasuresoftissue-specificgeneexpression&co-expressionInordertogenerateasetofgene-basedmeasuresoftissue-specificgeneexpressionandco-expression, we used the GTEx7 V6 gene expression dataset (accessible fromhttps://www.gtexportal.org/home/)whichcomprises54tissuesandexpressiondatafor8557samplescovering56318Ensemblgenes.WefirstfilteredtheGTExdatasetbytissuesampleavailability using only those tissues with more than 60 samples available leading to theconsiderationof47tissues.Weinitiallyanalysedeachofthe47tissuesamplesetsseparatelybyfilteringgenesonthebasisofanRPKM>0.1(observedin>80%ofthesamplesforagiventissue).Wethencorrectedforbatcheffects,age,genderandRINusingComBat33.Finally,weusedtheresidualsoftheselinearregressionmodelstoconstructgeneco-expressionnetworksfor each tissue using WGCNA17 and post-processing with k-means18 to improve geneclustering.Thus,eachof the47tissueshadacorrespondingnetworkconsistingofasetofgene modules. For each gene in a network we obtained its module membership andadjacency,whereModuleMembership for a genegwasdefinedas the correlationof theresidual geneexpression for geneg and theeigengeneof themodule it belonged to, andadjacencywasdefined foragenegas thesumofallvalues for its row/columnwithin theTopologicalOverlapMatrixcreatedbyWGCNA.

Weusedtheresultinggeneexpressionandco-expressiondatatoobtain3measuresoftissue-specific expression. For each tissue any given gene was defined as having tissue-specificexpression if thegeneexpression inthattissuewas3.5foldhigherthantheaveragevalueacrossallother47GTExtissues.Similarly,agenewasdefinedashavingtissue-specificmodulemembershiporadjacency,ifitsmodulemembershiporadjacencywithintheco-expressionnetwork,was3.5foldhigherthanthatacrossallotherremainingGTExtissues.

UnivariateanalysisonsinglepredictorsWewantedto testwhethereachsinglepredictorhadanypredictivepoweronthetaskofidentifyingdiseasegenesamongstthewholeproteincodinggenome.Forsuchapurpose,wedefinedeachGenomicsEnglandgenepanel’s(GP)testdatasetas

𝐷#$ = {{𝑃(𝑔), 𝑦-}: 𝑔 ∈ 𝐺𝑃∗, 𝑃(𝑔) = (𝑃3(𝑔), 𝑃4(𝑔), … , 𝑃6(𝑔)), 𝑦- ∈ +, − },

suchthatGP*referstoasetofgenescompoundofthegenesintheGenomicsEnglandGPgenepanelplustherestofthewholeproteincodinggenome(seebelow),thePi(g)isthevalueofthei-thpredictorforgenegandygthelabelofgenegforthatgeneset, +,− ,where+means gene belonging to the Genomics England panel and – means otherwise. We usepredictorsofadifferentnature,namelycategoricalpredictors(includingallpredictorsrelatingto tissue-specific expression and co-expression) and numeric predictors (including allpredictorsrelatingtogeneticconstraintsandgenecomplexity).Accordingly,weappliedtwodifferentstatisticalteststolookforsignificantdifferencesbetweendiseaseandnon-diseasegenes.ForcategoricalpredictorsweusedFisher’sexacttests(FET)withacontrolgenesetconsistingofallgeneswithinEnsembl(version72)withtissue-specificdataavailableandwiththeexceptionofgeneswithin theGenomicsEngland“Neurologyandneurodevelopmentaldisorders” panels (17,315 genes). For numeric predictors we usedMann-Whitney U tests(MWU)withacontrolgenesetconsistingofallproteincodinggeneswithinEnsemblwithanon-nullvalueforthepredictorandwiththeexceptionofgeneswithintheGenomicsEngland

“Neurologyandneurodevelopmentaldisorders”panels(17,918genes).FETsanMWUtestswereperformedunderthenull thattherewasnosignificantdifference inpredictorvaluesfoundindiseasegenepanelsincomparisontothecontrolgeneset.WereportBonferroni-Hochbergcorrectedp-valuesinallcases.

Machineleaningtogeneratedisease-specificclassifiersWe used a supervised machine learning (ML) approach34 to inductively model gene-phenotypeassociationsandsogenerateclassifiersthatcandistinguishbetweengenesthathaveanassociationwithaneurologicalphenotypeandthosethatdonot.Formallystated,this meant that given a set of genes already causally linked to a specific neurologicalphenotypeanddefinedbyaGenomicsEnglandpanel,GP,wesoughtaclassifierCGP,suchthat,for anygene𝑔, it predictedwhether it toowouldhaveagene-phenotypeassociationanddenoteitintheformofafunction,asfollows:

𝐶#$: 𝑃 → {+,−},

and 𝑃 is a set of 𝑃; predictors such that 𝑃(𝑔) = (𝑃3(𝑔), 𝑃4(𝑔), … , 𝑃6(𝑔)) is a vector ofattributes of gene𝑔 and the output of the classifier is then 𝐶#$ 𝑔 = + if the classifierpredicts g to have a gene-phenotype association for phenotype GP and 𝐶#$ 𝑔 = − otherwise.

TheglobalMLanalysisisdividedintotwomainphases.Inthe1stphaseweseekforthebestpossible learning algorithm we have available given requirements on prediction accuracy(mainlyROCvalueson10-foldcross-validation)andmodelinterpretability.Inthe2ndphase,weuseahighnumberofsuchmodelsintoavotingbasedintegrationensembletoaccountfortheacuteimbalanceintheproportionofpositiveandnegativeexamplesinthelearningdata.

1stphase:identificationofthebestMLalgorithm

Using the Caret R package19 we constructed simple classifiers𝐶#$ for 14 ML techniquesincludingthemainsupervisedlearningparadigmsandallthe25GenomicsEnglandworkingpanels.ThelearningdataforeachpanelwascompoundbyalltheHighEvidencediseasegenesbelonging to thepanelandall theChakrabortycontrolgenes,4260. TheCaretalgorithmsusedwere: (1) trees/rules based algorithms (C5.0Tree, LMT, J48,ONeR, Rpart), (2) neuralnetwork based algorithms (NNet, GlmNet), (3) linear regression based algorithms (PLR,RRLDA, svmLinear), a bayes reasoning approachwithNaiveBayes and (5) ensemble basedalgorithms(AdaBoost,Boruta,Rborist).WeevaluatedeachCaretcandidateMLalgorithmoneach𝐶#$ dataset using repeated cross-validation (10 fold) and automatic hyperparameterevaluationasprovidedbyCaret.ROC,specificityandsensitivityforallCaretalgorithmsappliedtoallGenomicsEnglandpanelswereevaluated(SupplementaryFigure1,resultssegregatedbyalgorithm;SupplementaryFigure2,resultssegregatedbyGenomicsEnglandpanel).WhileRboristandBorutashowedsignificantlybetterROCvaluesthanJ48(P<0.0008andP<0.02respectivelywithapairedt-test),J48stillhadahighpredictivepower(only9-10%lowerthanRborist and Boruta). Given that this algorithm had much greater explanatory power, allsubsequentanalyseswereperformedwithJ48.

2ndphase:constructionofthebestpredictorbasedonJ48

WeusedtheJ48algorithmwithinawrapper(i.e.anensemble inMLterminology)thatwedeveloped inorder todealwith thedisproportionatenumberofnegativeexamples inourlearningset.ThiswrapperinvolvedthegenerationandintegrationofmanyMLmodelsfrom

multiple learning data sets, allwith the same set of positive examples, butwith differentnegativeexamplesgeneratedthroughrandomre-samplingoftheChakrabortycontrolgeneset.Formallystated,thismeantthatgivenapanel𝐺𝑃,anditslearningdatadefinedas:

𝐷#$ = 𝑃 𝑔 , 𝑦- : 𝑔 ∈ 𝐺𝑃∗, 𝑃 𝑔 = 𝑃3 𝑔 , 𝑃4 𝑔 , … , 𝑃6 𝑔 , 𝑦- ∈ +,− ,

suchthatGP*refersnowtoasetofgenescompoundofthegenesintheGenomicsEnglandGPgenepanelaspositiveexamplesplustheChakrabortygenesasnegativeones,wecreated (𝐷#$3, 𝐷#$4, … , 𝐷#$4<<) learningdatasets,withidenticalpredictors 𝑃 butnow,the {𝑦-} hada1:1 ratioofpositive (+) andnegativeexamples (-). Thesetofpositiveexamples ineachlearningdatasetwasidenticaltothatin 𝐷#$, butthenegativeexamplesweresampleswithreplacementfromtheChakrabortycontrolgenes.

We applied J48 to each of the learning datasets (𝐷#$3, 𝐷#$4, … , 𝐷#$4<<) to obtain (𝐶#$3, 𝐶#$4, … , 𝐶#$4<<) classifiers, which we aggregated into a single classifier, 𝐶#$ =Δ(𝐶#$3, 𝐶#$4, … , 𝐶#$4<<), whereΔ wasafunctiontointegrateresponsesfromall individualclassifiers.Δhastobedesignedtakingintoaccountthattheensemblecompoundofthe200MLmodelswithhavealimitedpredictivepowerforanumberofreasons.Firstly,thenumberofdiseasegenestolearnfromwillbelimited,andmostlikelyincomplete,forallpanels.Andsecondly,likelyamongstthegenesweconsiderascontrolstherewillbediseasegeneswedonotknowyet.Therefore,wemustcomeupwithastrategythatdealswiththat.Andwedefinea stringency parameter to control the eagerness with which the ensemble generates aprediction,i.e.DiseaseorNon-diseasebutalsotoincludeathirdoutcome:Uncertain.Andwedefine a parameter called stringency, let it be denoted with s in [0,1] that works in thefollowingway: given a gene g, and the 200models generated from a gene panel GP,wegenerate200predictions𝐶#$; 𝑔 , 𝑖 = 1, … ,200.EitherDiseaseorNondisease.LetpdandpntheproportionofDiseaseandNon-diseasepredictionsforg,suchthatpd+pn=1.Ifpd>pn,theoutcomeofΔ(𝐶#$; 𝑔 , 𝑖 = 1, … ,200)willbeDiseasewhenpd>s.Analogouslywhenpn>pd.TheoutcomewillbeUncertainotherwise.

TestingforannotationtermenrichmentamongstpredictedgenesWe used gProfileR35 to investigate enrichment of Gene Ontology, REACTOME and KEGGpathways annotation terms amongst predicted gene sets. We included IEA (InferredElectronic Annotations) and used the gSCS test developed by the authors to assess forannotation term enrichment. The graphical representation of the REACTOME termenrichmentwasbasedonthemostsignificanttermsasreportedbygProfileRandbyusingCytoscape3.536.TheenrichmentofMammalianPhenotypeOntologytermsamongstmousemodelswithmutationsinmouseorthologuesofpredicteddiseasegeneswasperformedusingtheToppFunfunctionwithinToppGene(https://toppgene.cchmc.org/,REF)andapplyingthedefaultsettings.

Testingforenrichmentofgenesexpressedinacell-specificmanneramongstpredictedgenesWeobtainedcell-typespecificgenelistsrelevanttobrainfromSoreqetal.26.Theselistswerebased on the analysis of RNA-sequencing data from purified cells isolated from mousecerebralcortex37andcoveredastrocytes,neurons,oligodendrocyteprecursors,newlyformedoligodendrocytes,myelinatingoligodendrocytes,microgliaandendothelialcells.Genesthatappearedinmorethanonecelltype-specificgenelistwereremovedfromtheanalysis,andtheremaininggeneswereconvertedtohumanorthologsusingthebiomaRtpackage38.We

usedtheFisher’sExactTesttocheckforenrichmentofcell-specificmarkerswitheachsetofpredicted genes. The false discovery rates (FDR) were calculated using the Benjamini-Hochbergprocedure39andanFDRthresholdof5%wasusedtoassessthesignificanceoftheenrichmentobserved.

Assessingtheimportanceofpredictorsindisease-specificclassifierensemblesGivena setof trees fromanensemble,wemeasure the importanceofeachMLpredictorwithintheensembleonthebasisof its relevancewhenclassifyinggenesas“Disease”.Webase the calculation of importance on two different properties of a predictorwhen usedwithinasetoftrees,namelythefrequencyofappearanceacrossthetreesandtheaveragedepthofthepredictorwithineachtree.Clearly,whenapredictorappearshighinatree(i.e.neartotherootoreventherootitself)anditssubtreeleadstoahigherproportionofgenesclassifiedas“Disease”,thepredictorhasrelevancefordiseaseclassification.Moreover,whentheattributeisusedmanytimesacrosstreesitisalsoimportant.Wecalculatetwomeasurestoreflecttheseproperties.Notethateachnon-leafnodeonatreeisatest(positivetestinthiscase)forapredictor.Leafnodesarelabels,either“Disease”or“Non-disease”.Inregardtodepth,givenapredictorpandtreet,withamaximumofnnodesbetweentherootandanyleave,depthisdefinedforthepredictorpinthetreetasthemeanofthep1,p2,…,pmvalueswhenpappearsmtimesatthetreetandpiisnminusthenumberofnodesfromtheroottothatnodeatthei-thappearanceofthepredictorwithinthetree.Depthisalwaysin[0,1].With respect toappearance, theappearanceof apredictorp in a treesensemble issimplyitsfrequencyofappearanceasrootnodeofasubtreewithhigherproportionofgenesclassifiedas“Disease”forthatsubtreegeneratingavaluein[0,1].Asasinglemeasure,wesimplyaddthesetwomeasures.Asasinglemeasurewesimplyaddthesetwovalues.

CodeavailabilitystatementThecodewehavedevelopedforthispaperisavailablerightnowuponrequest.

Weareworkingonmakingiteasiertouse.Itwillbereadybeforepublication.

DataavailabilitystatementAllthenewdataproducedinthispaperhasbeenmadeavailablethroughthesupplementarymaterialsaccessiblefromthesubmission.

AcknowledgementsMinaRyten,David ZhangandKarishmaD’Sawere supportedby theUKMedicalResearchCouncil(MRC)throughtheawardofTenure-trackClinicianScientistFellowshiptoMinaRyten(MR/N008324/1).SebastianGuelfiwassupportedbyAlzheimer’sResearchUKthroughtheawardofaPhDFellowship(ARUK-PhD2014-16).ReginaReynoldswassupportedthroughtheawardofaLeonardWolfsonDoctoralTrainingFellowshipinNeurodegeneration.

Henry Houlden was supported by the UKMedical Research Council, theWellcome Trust,MuscularDystrophyUK,MuscularDystrophyAssociationUSAandtheNationalInstituteforHealthResearch.

Thisresearchwasmadepossiblethroughaccesstothedataandfindingsgeneratedbythe100,000GenomesProject.The100,000GenomesProjectismanagedbyGenomicsEnglandLimited (a wholly owned company of the Department of Health). The 100,000 GenomesProject is funded by the National Institute for Health Research and NHS England. TheWellcome Trust, Cancer ResearchUK and theMedical Research Council have also fundedresearch infrastructure. The100,000GenomesProjectusesdataprovidedbypatientsandcollectedbytheNationalHealthServiceaspartoftheircareandsupport.

AuthorcontributionsJuanBotíageneratedthemachinelearningclassifierensembles.JuanBotía,SebastianGuelfi,David Zhang, Karishma D’Sa, Regina Reynolds andMina Ryten assessed and analysed theoutputofclassifierensembles.DanielOnahandJuanBotíadesignedandimplementedawebservertoaccesspredictions.

EllenMcDonaghandAntonioRuedaMartinconceived,designed,builtandcreatedPanelAppwhere 800 registered reviewers worldwide have supplied data or curated 167 panels forgenome analysis and helped establish 213 reviewable panels with 4104 genes. AugustoRendoncontributedtothedesignofPanelAppandcoordinatedthedevelopmentandcurationefforts.

Arianna Tucci and Henry Houlden (on behalf of the Neurology Genomics England ClinicalInterpretationPartnership)assistedandledthecurationofgenepanels.

JuanBotía,DavidZhang,JohnHardyandMinaRytenpreparedthemanuscript.JuanBotíaandMinaRytenconceivedanddesignedtheproject.

ConflictsofInterestTheauthorshavenoconflictsofinteresttodeclare.

G2P: Using machine learning to understand and predict ...2 Abstract To facilitate precision medicine...

Documents

Machine Learning-Deep Learning Fondement et biaisfiles.meetup.com/19203187/FondetBiaisMLDLonline.pdf · Machine Learning-Deep Learning Fondement et biais Machine learning Aix-Marseille

Machine Learning - apiacoa.orgapiacoa.org/publications/teaching/machine-learning/machine... · Machine Learning Fabrice Rossi ... I chess program ... Machine programming I climbing

Electronic G2P Payment Systems

Machine Learning Chapter 11. 2 Machine Learning What is learning?

Machine Learning in Logistics: Machine Learning Algorithmsltu.diva-portal.org/smash/get/diva2:1118764/FULLTEXT02.pdf · Machine Learning in Logistics: Machine Learning ... Machine

Introducing Machine Learning · 4 ntroducing Machine Learning How Machine Learning Works Machine learning uses two types of techniques: supervised learning, which trains a model on

Jixie Zhang For g2p Collaboration Hall A Collabortion Meeting, June 2012 Status Update G2P(E08-027)

Learning with large datasets Machine Learning Large scale machine learning

Machine Learning for NLP - Ethics and Machine Learning

HALLA G2P readiness review

Section 1Section 1 Machine Learning basic concepts · Machine Learning TutorialMachine Learning Tutorial ... Section 1Section 1 Machine Learning basic concepts ... Learning Tutorial

Machine Learning on Spark - UC Berkeley AMP Campampcamp.berkeley.edu/.../Machine-Learning-on-Spark... · Machine Learning on Spark Shivaram Venkataraman ... Machine learning algorithms

CS7267 Machine Learning introduction to machine learning

Machine Learning: Machine Learning: Introduction Introduction

Machine Learning with MATLAB · 2 Agenda Machine Learning Overview Machine Learning with MATLAB –Unsupervised Learning Clustering –Supervised Learning Classification Regression

The g2p Experiment

Machine Learning - 0. Overview€¦ · Machine Learning 1. What is Machine Learning? One area of research, many names (and aspects) machine learning historically, stresses learning

Machine Learning Probabilistic Machine Learning · Machine Learning Probabilistic Machine Learning learning as inference, Bayesian Kernel Ridge regression = Gaussian Processes, Bayesian

Machine Learning with MATLAB - … · 2 Agenda Machine Learning Overview Machine Learning with MATLAB –Unsupervised Learning Clustering –Supervised Learning Classification

Y. R. Roblin, G2P review, 1/17/2011 2011 HALLA G2P readiness review Yves Roblin, CASA