View
6
Download
0
Category
Preview:
Citation preview
1
G2P: Using machine learning tounderstandandpredictgenescausingrareneurologicaldisorders
JuanA.Botía1,2,SebastianGuelfi1,DavidZhang1,KarishmaD’Sa1,3,ReginaReynolds1,DanielOnah1, EllenM.McDonagh4, Antonio RuedaMartin4, Arianna Tucci4,5, Augusto Rendon4,6,HenryHoulden1*,JohnHardy1andMinaRyten1
1. Reta Lila Weston Research Laboratories, Department of Molecular Neuroscience,UniversityCollegeLondon(UCL)InstituteofNeurology,London,UK
2. DepartamentodeIngenieríadelaInformaciónylasComunicaciones.UniversidaddeMurcia,Murcia,Spain
3. Department of Medical & Molecular Genetics, School of Medical Sciences, King'sCollegeLondon,Guy'sHospital,London,UK
4. GenomicsEngland,GenomicsEngland,QueenMaryUniversityofLondon,London,UK5. DepartmentofMedicalBiotechnologyandTranslationalMedicine,Universitàdegli
Studi,Milan,viaFratelliCervi93,Segrate,20090,Milan,Italy6. DepartmentofHaematology,UniversityofCambridge,Cambridge,UK
*OnbehalfoftheNeurologyGenomicsEnglandClinicalInterpretationPartnership
.CC-BY-NC-ND 4.0 International licenseauthor/funder. It is made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the. https://doi.org/10.1101/288845doi: bioRxiv preprint
2
Abstract
Tofacilitateprecisionmedicineandneuroscienceresearch,wedevelopedamachine-learning
technique that scores the likelihood that a gene,whenmutated,will causeaneurological
phenotype. We analysed 1126 genes relating to 25 subtypes of Mendelian neurological
diseasedefinedbyGenomicsEngland(March2017)togetherwith154gene-specificfeatures
capturinggeneticvariation,genestructureandtissue-specificexpressionandco-expression.
Werandomlyre-sampledgeneswithnoknowndiseaseassociationtodevelopbootstrapped
decision-treemodels,whichwereintegratedtogenerateadecisiontree-basedensemblefor
eachdiseasesubtype.Genesgeneratinglargernumbersofdistincttranscriptsandwithhigher
probabilityofhavingmissensemutationsinnormalindividualsweresignificantlymorelikely
tocauseneurologicaldiseases.Usingmouse-mutantphenotypicdatawetestedtheaccuracy
ofgene-phenotypepredictionsandfoundthat for88%ofalldiseasesubtypestherewasa
significant enrichment of relevant phenotypic abnormalities when predicted genes were
mutatedinmiceandinmanycasesmutationsproducedspecificandmatchingphenotypes.
Furthermore,usingonlynewlyidentifiedgenesincludedintheGenomicsEnglandNovember
2017 release, we assessed our gene-phenotype predictions and showed an 8.3 fold
enrichment relative to chance for correct predictions. Thus, we demonstrate both the
explanatoryandpredictivepowerofmachine-learning-basedmodelsinneurologicaldisease.
.CC-BY-NC-ND 4.0 International licenseauthor/funder. It is made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the. https://doi.org/10.1101/288845doi: bioRxiv preprint
3
Introduction
The last10yearshas seenamazingprogress in the identificationofgenesassociatedwith
Mendelianformsofneurologicaldiseases1.Eachindividualgenediscoveryisvaluableforthe
patients affected, for the disease-based researchers and for the wider neuroscience
community. However, these gene discoveries are perhaps even more important when
consideredcollectively.Firstandforemostthegrowthofrarediseasegeneticshasmadeit
clearthatdysfunctionofthecentralandperipheralnervoussystemisafrequentoutcomeof
genetic disorders with approximately 50% of all rare diseases (3000-3500 conditions)
presentingwithsomeformofneurologicalabnormality2.Furthermore, ithasbecomeclear
that neurological disorders exhibit remarkable genetic heterogeneity. There are over 30
genes,whichwhenmutated give rise to hereditary spastic paraplegia3 and over 80 genes
associatedwithCharcotMarieToothdisease4.Finally,ithasbecomeapparentthatvariable
expressivityandatypicalpresentationsarenottheexception,buttherulewhenconsidering
neurogeneticdisorders2.
Theseobservationshaveledustohypothesisethatthereissomething“special”aboutgenes
associatedwithneurogeneticdisorders.Ormoreformallystated,wehypothesisethatgenes
withsignificanteffectsonthestructureandfunctionofthenervoussystemsharecommon
characteristicsnotpresentamongstgeneswhichwhenmutateddonotaffect thenervous
system.Clearlyidentifyingthesecommoncharacteristics(assumingtheyarepresent)would
beusefulfortwomainreasons.Firstly,understandingthecorecharacteristicsofgenesthat
give rise to neurological disorders would potentially help us to better understand the
pathophysiology of at least a subset of disorders. Secondly, the identification of these
commoncharacteristicswouldallowustoscanthegenomeforothergeneswiththesame
key properties, but which have not yet been associated with neurological disease simply
because the patient with this particular neurogenetic condition has not had appropriate
genetictestingorthesignificanceofthepathogenicvariantcouldnotberecognised.Since
these predicted gene-disease associations would efficiently encapsulate prior knowledge,
theycouldbeusedtoprioritisegeneticvariantsforfurtherinvestigationwhenmorestandard
analyseshavefailedtoyieldaresult,anincreasingclinicalchallenge.
While the identification of the core characteristics of a neuro-disease genemay seem an
impossible task, there is an, ever increasing, quantity of public omic data, which has the
potentialtoaddressthisquestion.Overthelast3yearsgenome-widedatasetshavebecome
everlarger,moreaccessible,moreconsistentandmorecomprehensivebothintermsofthe
.CC-BY-NC-ND 4.0 International licenseauthor/funder. It is made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the. https://doi.org/10.1101/288845doi: bioRxiv preprint
4
depth of information offered and the quantity of data utilised. Public data sets include
information on rare variant frequency across all genes5, gene and isoform structure6, and
tissue-specificgeneexpression7.Hiddenwithintheserichdatasetscouldbekeyinformation
regardingtherelationshipbetweengenesandtheirfunctionaleffectsinhealthanddisease.
MachineLearning(ML)iswellsuitedtothetaskofidentifyingsuchhiddenphenomenaand
hasbeenusedsuccessfullytoaddresssimilarclassificationproblems,suchastheeffectsof
geneticvariantsonexon inclusion8andthepredictionofgenesassociatedwithautosomal
dominantdisorders9.Inthispaper,weaimtouseML-basedapproachestogeneratemodels
to predict new genes associated with rare neurogenetic disorders.We base our learning
experience on highly curated gene panels developed for the diagnosis of awide range of
neurological disorders. In this way, we aim to not only obtain novel insights into the
pathophysiological processes underlying some neurogenetic disorders, but also provide
predictiveinformation,whichcanberapidlytranslatedforthebenefitofpatients.
.CC-BY-NC-ND 4.0 International licenseauthor/funder. It is made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the. https://doi.org/10.1101/288845doi: bioRxiv preprint
5
Results
Ageneticmapofhumanneurogeneticdisorders
Given that one of the major motivations of this study was the use of gene-phenotype
associationpredictionasameansofimprovingthediagnosticyieldofwholeexome/genome
sequencing for rare neurogenetic disorders,weused genepanels generatedbyGenomics
England (https://panelapp.genomicsengland.co.uk; March 31st 2017) for “Neurology and
neurodevelopmentaldisorders”.Thesegenepanelsarehighlycuratedwithover45clinicians
andscientistscontributing to thecurationprocess (SupplementaryTable1). Furthermore,
they are managed collectively to ensure quality across panels and aim to generate
conservative “diagnostic-grade” gene sets, chosen because variants within the genes are
capable of causing/explaining the specific neurological phenotype. Only gene panels with
morethan10“Green”(diagnostic-grade)geneswereusedinouranalysiswiththeexception
of the panel for intellectual disability, which contained 735 high confidence genes, but
encompassesaphenotypicallydiversediseaseset.Thisresultedintheinclusionof25panels
inadditiontoapanel includingallneuro-relatedgenestogether (SupplementaryTable2).
Collectively this equated to 1140 unique genes.We noted a high degree of gene overlap
amongstpanels,promptingustogenerateageneticmapoftheinter-relationshipsbetween
neurogenetic disorders based on “gene-sharing”. Interestingly, this “geneticmap” broadly
reflected recogniseddisease categories as exemplified by the clustering of neuromuscular
disorders(Figure1a).
Predictorsderivedfromgenomicandtranscriptomicdataarelargelyindependent
Eachofthe1140knownneurogeneticdisease-relatedgeneswereconsideredforMLmodel
developmentandweredescribedtogetherwithallotherproteincodinggenes(asdefinedby
Ensemblversion72)usingasetof“predictors”.Weleveragedfeaturesextractedfromlarge,
publicomicdatasetstoobtainthesepredictorsofgenestatus(diseaseversusnon-disease).
Wefocusedongenome-widedatasets,whichdonotincorporateanypriorknowledge,either
intheformofcurationorbiological/diseaseinformation,andwhichcouldprovidegene-based
metrics. This was a deliberate strategy in order to enable ML models to provide novel
explanatoryinformationnotcapturedbycurrentdisease-relatedpathwaysormodels.
Morespecifically,wecollatedthefollowingthreetypesofgene-specificinformation:i)gene-
specific variant frequency, ii) gene and transcript structure, and iii) tissue-specific
.CC-BY-NC-ND 4.0 International licenseauthor/funder. It is made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the. https://doi.org/10.1101/288845doi: bioRxiv preprint
6
expression/co-expression of genes (Online Methods & Supplementary Table 3). Since
deleteriousvariantsareundernegativeselection,geneswithlessgeneticvariationthanwould
beexpectedunderarandommutationmodelmightbesupposedtobeofparticulardisease-
relevance.Thisinformationisalreadycapturedatthegene-levelforspecificmutationclasses
bytheExAC5databaseparametersExACpLI(theprobabilityofagenebeingintolerantofboth
heterozygousandhomozygousloss-of-functionvariants),ExACpRec(theprobabilityofbeing
intolerantofonlyhomozygousloss-of-functionvariants),andExACpMiss(aZ-scorereflecting
thelikelihoodofbeingintoleranttomissensemutations,seeOnlineMethods).Conversely,
tolerancetodeleteriousmutationsascapturedbyExACpNull(theprobabilityofbeingtolerant
ofbothheterozygousandhomozygousloss-of-functionvariants)mightbeexpectedtobean
attributeofnon-diseasegenes.AllfouroftheExACparameterswereincludedaspredictors
inourlearningset.
Wealsoexploredthepossibilitythatthecomplexityofgenestructureandtranscriptioncould
beusedtodistinguishbetweendiseaseandnon-diseasegenes.Weextractedorgenerated
measuresofstructuralcomplexityfromGENCODE(Ensemblversion72)andtheHEXEvent10
database.Thesemeasureofcomplexityincludedgenelength,introniclength,thenumberof
alternative transcripts generated, thenumberofpossibleuniqueexon-exon junctions, the
numberofalternative3’and5’sitesusedandthenumbersofcodingandnon-codinggenes
overlappingwiththegeneofinterest(OnlineMethods).
Giventhatdisease-associatedgenesareoftenhighlyexpressedinthetissuesofrelevancefor
thedisease11–16,weincludedmeasuresoftissue-specificexpressioninourlearningdataset.
Using transcriptomicdatageneratedby theGTEx25projectandcovering47human tissues
(including13brainregions,tibialnerveandskeletalmuscleamongstothers)wegenerated
measuresoftissue-specificgene-expressionandco-expression(OnlineMethods).Foreachof
the 47 human tissues we created a co-expression network using Weighted Gene Co-
expressionNetworkAnalysis17optimizedbyk-means18.Thisprovidedestimatesofeachgene’s
globalandlocalconnectivityinrelationtoallotherexpressedgenesinthetissue(ascaptured
bytheterms“adjacency”and“modulemembership”,OnlineMethods).Thisallowedusto
generatemeasures of the tissue-specific network properties of each gene and created an
additional141binarypredictors.
Interestingly,inspectionofthissetofpredictorsdemonstratedthattheabsolutenumberof
geneswith evidence of tissue-specific expression/co-expressionwas highly variable across
tissues (Figure 1b). While 10,483 genes showed testes-specific gene expression or co-
.CC-BY-NC-ND 4.0 International licenseauthor/funder. It is made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the. https://doi.org/10.1101/288845doi: bioRxiv preprint
7
expression,only1699geneswereexpressedinatissue-specificmannerinliver.Furthermore,
the relative importance of tissue-specific gene expression, as distinct from co-expression
(Figure1b&1c)wasalsoveryvariableacrossthe47GTExtissues.Ifwefocusonbraintissue,
whereas49.7%oftissue-specificgenesforcerebellumweredefinedascerebellar-specificdue
tohighabsoluteexpressioninthetissue,inputamenthemajorityofspecificgeneexpression
wasdetectedthroughtheuniquelocalandglobalnetworkpropertiesofgenes(86.12%,Figure
1c).
Finally,weexploredall154predictorsacrossthemajorclasses,namelygenetic,structuraland
expression-basedpredictors,forevidenceofcorrelationamongstpredictors(Supplementary
Table4).Predictorswithinamajorclass(geneticvariation,genestructure,geneexpression,
geneco-expression)weremorecorrelatedtoeachotherthantopredictorsofadifferentclass
suggestingthatthemajorpredictorclassesusedwerecapturingorthogonaltypesofgene-
specificinformation(Figure2).
Measuresofgenecomplexityandintolerancetomissensevariantsdistinguishbetweendiseaseandnon-diseasegenesacrossmostdisorders
Asdescribedabovewegenerated154potentiallyusefulpredictorsofdisease-genestatefor
eachgenewithinthehumangenome. Inthefirst instancewewantedto findoutwhether
therewereanystatisticaldifferencesinpredictorvaluesbetweenestablisheddisease-related
andnon-diseasegenesusingsinglepredictorsalone.Thistestingwasperformedforeachof
the25diseasegenepanelsseparatelyandforallpredictorswhethernumericalorcategorical
innature(OnlineMethods).
Usingthisapproachwefoundsignificantdifferences(FDRcorrectedp-value<0.05)inatleast
asinglepredictorfor22ofthe25diseasegenepanels(88.0%,SupplementaryTable5).Of
the threemain typesofgene-specificpredictorsanalysed (gene-specificvariant frequency,
gene and transcript structure, and tissue-specific expression/co-expression of genes), we
foundthatgeneandtranscriptstructureappearedtobethemostuseful.Ofall25disease
genepanels72.0%showedasignificantdifferenceinpredictorvaluesforatleastoneofthe
measuresofgeneandtranscriptstructure.Consideringtranscriptcountalone(thenumberof
transcriptsproducedbyagene),wefoundthat56%ofallthediseasepanels(minimumFDR
correctedp-value=5.22x10-18for“Epilepsy-Plus”)hadasignificantlyhighertranscriptcount
fordisease-relatedascomparedtonon-diseasegenes(Figure3a,SupplementaryTable5).
Thesecondmost importantcategoryofpredictorwasgene-specificvariantfrequencywith
.CC-BY-NC-ND 4.0 International licenseauthor/funder. It is made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the. https://doi.org/10.1101/288845doi: bioRxiv preprint
8
64.0%ofalldiseasepanelsshowingasignificantdifferenceinpredictorvaluesforatleastone
measure,withExACpMissaccountingforthelargestproportionofpanels(40.0%,Figure3b).
While tissue-specific expression/co-expression of genes could be highly discriminatory
predictors, this was restricted to small sets of related disorders. For example, specific
expressioninskeletalmusclewasahighlysignificantpredictorforarangeofprimarymuscle
disorders, including “Distalmyopathies” (p-value 2.78 x 10-32), “Congenitalmyopathy” (p-
value2.35x10-22)and“Rhabdomyolosisandmetabolicmuscledisease”(p-value2.94x10-15).
Similarly, specific expression in the frontal cortex could be used to distinguish between
epilepsy-related and non-disease genes (p-value 1.26 x 10-10 and p-value 3.53 x 10-7
respectively).
Thus,thisanalysisdemonstratedthatwhilegenesassociatedwithneurologicaldiseaseshave
commoncharacteristicsofbroadrelevancewithcomplexityoftranscriptstructurebeingthe
singlemost important discriminatory genic feature, there are also features operating in a
moredisease-specificmanner.
ML-basedclassifiersfordiseasegenepredictioncanbeoptimizedforaccuracyattheexpenseofgenomecoverage
Themajoraimofthisstudywastogeneratemachinelearningclassifiers,basedongenetic,
structural and expression-based predictors, to distinguish between genes which when
mutatedproduceaspecificneurologicalphenotypeversusthosethatdonot.Wewantedto
createsuchclassifiersforeachofthe25disordersasdefinedbyGenomicsEngland.Thiswas
challengingbecausenotonlydowerequireclassifierscapableofprovidingexplanatoryaswell
as predictive information, but the generation process had to be robust despite the large
disparityinthesizeofdisease(meannumberofgenesperpanel=45,range18-111)andnon-
diseasegenesets(4913genes,91.59%ofthetotalsetoflearningexamples).
Ourstrategywastoaddresstheseissuesintwoseparatesteps.Inthefirststepweevaluated
a range of learning paradigms to select one with a reasonably good trade-off between
predictiveandexplanatorypower.Inthesecondstep,wecreatedanensemblecomposedof
manysingleMLmodelsbasedonthe learningparadigmofchoice (as identified instep1).
Morespecifically,thefirststepinvolvedtestingarangeofMLapproachescoveringthemain
supervisedlearningparadigmsandavailablethroughtheCaretRpackage19(OnlineMethods).
WeobtainedROCestimates(asinglemeasure,whichconvenientlyintegratessensitivityand
specificity)foreachalgorithmappliedtoeachofthe25genepanels(SupplementaryFigures
.CC-BY-NC-ND 4.0 International licenseauthor/funder. It is made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the. https://doi.org/10.1101/288845doi: bioRxiv preprint
9
1&2).Onthebasisof theseresults,weselectedJ4820 (OnlineMethods),adecisiontree-
basedalgorithm,duetoitsexplanatorycapabilitiesandhighpredictivepower.
Inthesecondstep,weaddressedtheimbalanceofpositiveandnegativelearningexamples
(averageratioofpositivetonegativelearningexamples=1:109)bycreatinganensembleof
J48modelsobtainedfromlearningdatasetswithequalnumbersofpositiveandnegativegene
examples.We created each learning dataset by randomly re-sampling non-disease genes
(Figure4a),whilekeepingalldiseasegenesunchanged.Inthiswayforeverydiseasecategory
wegenerated200separateJ48trees,eachtrainedusingthesamediseasegeneset,buta
different, randomly generated “non-disease” gene set. We integrated the 200 classifiers
generatedperpanelusingavotingapproachandoperatedasimplemajoritytodeterminethe
outcomeforagivengene.Thismeantthatforagivengene,wepredictedagene-phenotype
relationshipaspresent ifmostof theclassifiers “voteddisease”andabsent ifmostof the
classifiers“votednon-disease”.Weassessedwhethertheintegrationofdiseaseclassifiersto
generate a classifier ensemble improved performance. We found that for all 25 disease
categoriesusingaclassifierensembleimprovedtheROCvalues.Onaveragetherewasa4.1%
improvementinROCvalues(rangeof%improvement=0.6%-7.6%),relativetousingasingle
J48tree(SupplementaryFigure3).
Furthermore,theintegrationofdiseaseclassifiersusingasimplevotingapproachallowedus
to adjust the stringency of our classification system by changing the percentage of votes
required(termedthe“stringencyparameter,s”)toconfidentlyassignageneas“disease”or
“non-disease”. In this way, we converted a 2-output ensemble into a 3-output ensemble
adding “uncertain” as the third possible outcome, used when there was not enough
agreement amongst votes. To illustrate the effect of the stringency parameter (“s”) on
predictionof“disease”genes,wetestedarangeof“s”valuesacrossalldiseaseensembles
andallproteincodinggenes(Figure4b).Thisdemonstratedthatasthestringencyincreased,
the proportion of genes that could be classified either as “Disease” or “Non-disease”
decreased.Thus,increasingourconfidenceingene-phenotypepredictions,comesatthecost
of genome coveragewith the proportion of genes for whichwe generate an “uncertain”
outcomerisingrapidly.Therefore,dependingonthedownstreamuseofclassifiers(e.g.for
useinadiagnosticversusaresearchsetting)wewouldenvisagesituationswhereitwouldbe
sensible toadjust thestringency.For this reasonwehave releasedaweb interfacewhere
predictionscanbeobtainedatvariablelevelsofstringencyforalldiseases(www.XXXXXX).
.CC-BY-NC-ND 4.0 International licenseauthor/funder. It is made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the. https://doi.org/10.1101/288845doi: bioRxiv preprint
10
Predictedgenesareenrichedforrelevantgeneontologytermsandcell-specificmarkers
Afterremovinggenesusedtotrainclassifiers,weappliedall26oftheclassifierensemblesto
theentiresetofhumanproteincodinggenesasdefinedwithinGENCODE(16,012).Usinga
stringencyof0.9foreachclassifier(i.e.predictDiseaseorNon-diseaseonlywhenthemajority
of identical predictions is equal or above 90%),we generated 4,834 new gene-phenotype
predictions(Figure5a,SupplementaryTable6).Onaverage,eachclassifierpredicted272new
diseasegenes(equatingto1.7%ofthegeneclassificationsattempted).Aswouldbeexpected
fromthehighnumbersofoverlappinggenesacrossdiseasepanels,therewasahighdegree
of sharing amongst gene predictions from different classifier ensembles. Nonetheless, all
classifierensemblesgeneratedpredictionswhichweredisease-specific(Figure5a).Wenoted
thatthenumberofnew“disease”predictionswaspositivelycorrelated(Pearsoncorrelation
=0.71,p-value<4.03x10-5)withthesizeoftheoriginaldiseasepanel(i.e.thenumberof
genesalreadyknowntobeassociatedwithdisease,Figure5a).Nosignificantcorrelationwas
detected when we considered non-disease predictions and this is consistent with our
expectation that the success ofML-based classifiers depends on the adequacy of positive
trainingexamples.
Giventhatnocellular,pathwayordisease-based informationwasusedwithinthe learning
data,wewantedtoinvestigatethebiologicalcharacteristicsofourpredictedgenes.Wedid
this firstbyassessingpredictedgenes forenrichmentofgeneontology terms21,aswellas
REACTOME22 and KEGG23 pathway annotation (Figure 5b, Online Methods). Using this
approachandconsideringdiseasegenepredictionsproducedbyeachensembleseparately,
we identified 2,345 unique significant terms (at an FDR corrected p-value of <0.05;
Supplementary Table 7). Interestingly genepredictionsmadeby the “all in one” classifier
(based on the entire set of disease genes) was highly enriched for terms relevant to the
nervous system with the top REACTOME pathway terms being “Neuronal System” (FDR-
correctedp-value=5.12x10-15),“TransmissionacrossChemicalSynapses”(FDR-correctedp-
value=5.12x10-9)and“Axonguidance”(FDR-correctedp-value=3.57x10-8,Supplementary
Table7).
Inspection of disease-specific enrichments also demonstrated the relevance of predicted
genes.Forexample,amongstgenespredictedbytheclassifierfor“Malformationsofcortical
development” there were significant enrichments for 966 terms alone. Focusing on
REACTOMEpathwayenrichments,weidentifiedalargenumberofsignificanttermsrelating
.CC-BY-NC-ND 4.0 International licenseauthor/funder. It is made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the. https://doi.org/10.1101/288845doi: bioRxiv preprint
11
togrowthfactorsignallingandneuronalmigration(Figure5b,OnlineMethods).Inthecase
of the former, this included“Downstreamsignalingof activatedFGFR2” (FDR-correctedp-
value=2.21x10-9),asignallingsystemwhichhasbeenspecificallyimplicatedinthefunction
of outer radial glia24. This analysis also highlighted genes involved in axon guidance (FDR
corrected p-value = 3.93 x 10-16) and synapse formation (“Transmission across Chemical
Synapses”, FDR-corrected p-value = 2.49 x 10-16; “Glutamate Binding, Activation of AMPA
ReceptorsandSynapticPlasticity”,FDR-correctedp-value=4.77x10-12),consistentwiththe
known importance of neuronalmigration abnormalities in the pathophysiology of cortical
malformations25.Furthermore,acloserinvestigationofthesharingofsignificantREACTOME
termsacrosspredictedgenesets(generatedwithdifferentclassifiers)revealedthepossibility
of shared pathophysiological processes underlying disorders. For example, enrichment
analysesforgenespredictedbyboththe“Cerebrovasculardisease”and“Malformationsof
corticaldevelopment”classifiers,werehighlyenriched forgenes involved ingrowth factor
signalling(Figure5b),despitetherebeingnooverlapinthegenesusedaspositivelearning
examplesforclassifiergeneration.
We also examined predicted gene sets for evidence of enrichment of cell-specific gene
markers. This analysiswas performedusing gene expression signatures derived fromRNA
sequencing of purified cell types isolated from mouse cerebral cortex26 and covered
astrocytes, neurons, oligodendrocyte precursors, newly formed oligodendrocytes,
myelinating oligodendrocytes and endothelial cells. Using this approach and considering
diseasegenepredictionsproducedbyeachensembleseparately,weidentified22significant
cell-specific enrichments (FDR < 0.05, Supplementary Table 8, OnlineMethods). Inmany
cases,theseenrichmentswereconsistentwiththephenotypicfeaturesofthedisorder.For
example,genespredictedbytheclassifierfor“Cerebrovasculardisorders”weresignificantly
enriched for endothelial cell markers, while gene predictions for “Inherited whitematter
disorders”wereenrichedformarkersofoligodendrocyteprecursorcells.
Thus, these analyses provide evidence that the classifiers we generated capture some
importantelementsofthebiologyofneurogeneticdisordersandsoarecapableofgenerating
predictedgenes,whichsharethosekeybiologicalproperties.
.CC-BY-NC-ND 4.0 International licenseauthor/funder. It is made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the. https://doi.org/10.1101/288845doi: bioRxiv preprint
12
Disruptingpredictedgenestendstoproduceneurologically-relevantphenotypesinmice
Inordertoexploretheaccuracyofgenepredictionsfurther,wemadeuseoftheMGIMouse-
PhenotypeDatabase27todeterminewhetherdisruptionofmouseorthologuesofpredicted
humangenesproducedphenotypesrelevanttotheneurologicaldiseasesstudied.Considering
the genes predicted by each of the 26 classifiers separately, we identified 1588 unique
significantly enrichedMammalian PhenotypeOntology (MPO) termsofwhich 28.7%were
directlyrelevanttothecentraland/ortheperipheralnervoussystem(definedassub-terms
of MP:0005386, behaviour/neurological phenotype; MP:0003631, nervous system;
MP:0005369,musclephenotype).Usinggenespredictedbythe“allinone”classifier(based
ontheentiresetofdiseasegenes),we identified173termswithsignificantenrichment in
mousemodelsystems(SupplementaryTable9),withthemostsignificantMPOtermsbeing
“abnormal synaptic transmission” (FDR-corrected p-value = 1.20 x 10-10) and “abnormal
nervoussystemphysiology”(FDR-correctedp-value=1.22x10-8).
WealsoexploredtheenrichmentofMPOtermsamongstmousemodelsrelevanttodisease-
specificclassifiers.Despitethechallengesinherentinthisanalysis,namelythatthisisacross-
species analysis using data frommice with both spontaneous and targeted mutations of
multiple types, we were able to identify examples where disruption of predicted genes
mirroredtheexpecteddiseasephenotypewithprecision.Forexample,usingtheGenomics
EnglandPanelfor“Malformationsofcorticaldevelopment”,whichconsistsof45genes,we
predictedafurther589genes(usingstringency0.9).Ofthe585relevantmouseorthologues,
mousephenotypicdatawasavailable for427genes.Using this informationwe foundthat
disruption of these genes in mouse models resulted in phenotypes highly enriched for
“abnormalcerebralhemispheremorphology”(FDR-correctedp-value=1.14x10-13),andmore
specifically“abnormalcerebralcortexmorphology”(FDR-correctedp-value=2.72x10-7).
GenespredictedusingML-basedclassifiersareenrichedfortruegene-phenotypeassociationsinhumans
Clearly the most important means of assessing the value of ML-based classifiers for the
predictionofgenescontributingtoneurogeneticconditions,istheidentificationofmutations
in predicted genes in patients with matching neurological phenotypes. We assessed this
systematicallybymakinguseoftheregularre-versioningofGenomicsEnglandpanelsandthe
continuousupdates available through theOnlineMendelian Inheritance inMan catalogue
.CC-BY-NC-ND 4.0 International licenseauthor/funder. It is made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the. https://doi.org/10.1101/288845doi: bioRxiv preprint
13
(OMIM,www.OMIM.org).Inthecaseoftheformer,weidentifiedallhighconfidencegene-
phenotyperelationshipsaddedtoPanelAppafter theMarch31st2017versionreleaseand
untilNovember1st2017.Inthecaseofthelatter,wecollatedallgene-phenotypeassociations
included between 1st April 2017 and 5th January 2018. After excluding all OMIM gene-
phenotype associations classed as provisional, we manually curated the remaining
associations to ensure that the phenotypes could be mapped accurately to a Genomics
Englanddiseasepanel.Usingthisapproachweidentifiedatotalof32new,highconfidence
gene-phenotypeassociationsforwhichtherelevantML-basedclassifierprovideda“Disease”
or“Non-disease”prediction(applyingastringencyof0.9).Wecorrectlypredicted“Disease”
statusin17caseswiththeremaining15genespredictedas“Non-disease”(Supplementary
Table 10, ratio of “Disease”/”Non-disease” = 113.3%). Given that on average the ratio of
“Disease”to“Non-disease”predictionsamongsttherelevantML-basedclassifiersobtained
onthewholeproteincodinggenomeis13.9%,thisrepresentedan8.2foldenrichmentover
chance.
Furthermore,exploringindividualexamplesenabledustorecognisetheexplanatoryvalueof
classifierensembles.Forexample,usingtheML-basedensemblefor“Cerebellarhypoplasia”,
which was trained with 39 known disease genes, we correctly predict that mutations in
TBC1D23 are a cause of the disorder28. Interestingly, inspection of all the decision trees
demonstrated that the two most important predictors were cerebellum-specific gene
expressionandgenecomplexity,ascapturedbythenumberoftranscriptsproducedbyagene
(Figure6a,OnlineMethods).Infact,accountingforboththeusageofapredictor(Appearance
Index)anditsdepth(ameasureofitsaverageproximitytotheroot)acrossall200decision
trees,demonstratedthattheseweremoreimportantpredictorsthangene-specificmeasures
ofvariantfrequency,suchasExACpLiandExACpMissscores,whicharemorecommonlyused
toassessthepathogenicityofvariants.ConsistentwiththesefindingswenotedthatTBC1D23
hashighermRNAexpressionincerebellumascomparedtootherbrainregionsnotonlyinthe
adultbrainbutthroughoutmostofbraindevelopment(Figure6b).Furthermore,itproduces
ahigherthanexpectednumberoftranscriptcountwhenconsideringallproteincodinggenes
(Figure6c).
While these findings were encouraging, the limited availability of relevant novel gene-
phenotypeassociations,meantthatanymeasureofpredictionaccuracywouldbeunstable.
Withthisinmind,wealsoinvestigatedgene-phenotypeassociationsconsideredtohavelower
.CC-BY-NC-ND 4.0 International licenseauthor/funder. It is made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the. https://doi.org/10.1101/288845doi: bioRxiv preprint
14
levelsofsupportingevidence,termed“Amber”(borderline)or“Red”(low)geneswithinthe
Genomics England PanelApp (PanelApp Handbook Version 5.7). For this list of 329 gene-
phenotypeassociationswecalculatedthepercentageof“Disease”votes(“Disease”quality
score)madeforeachgenebytherelevantclassifieranddemonstratedthatthesevalueswere
highly skewed towards higher values (Figure 6d, dark blue density plot). In contrast, the
distributionofcorrespondingvalues foracontrolgeneset,consistingofallproteincoding
geneswiththeexceptionofthoselistedintheNovember1st.2017GenomicsEnglandrelease
of neurogenetics panels, was negatively skewed (Figure 6d, light blue density plot). The
difference in these distributions of “Disease” quality scores was highly significant (Mann-
Whitneytwo-sampleunpairedtestp-value=2.2x10-16).Giventhatthelistoftestgeneswas
likely to be highly enriched for true gene-phenotype associations, this finding provides
additionalevidencethattheclassifierensembleswehavegeneratedcapturesomeifnotall
thekeycharacteristicsofdisease-associatedgenes.
.CC-BY-NC-ND 4.0 International licenseauthor/funder. It is made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the. https://doi.org/10.1101/288845doi: bioRxiv preprint
15
Discussion
TheoverallaimofthisstudywastotestthehypothesisthatgenescontributingtoMendelian
formsofneurogeneticdisorderscanbeidentifiedusingarelativelylimitedsetofgene-based
predictors,whichdonotincorporatedirectlyorindirectlyanypriorbiologicalknowledge.We
achieve this aim through the generation of machine learning classifier ensembles for 25
neurogenetic disorders and their resulting predictions. We demonstrate that our set of
disease-specificgenepredictionsareenrichedforGOtermsandmolecularpathwaysalready
implicated in neurogenetic diseases. Furthermore,we show thatwhenmutated inmouse
modelsystemsthesepredictedgenesetsproducephenotypes,whicharehighlyenrichedfor
neurological abnormalities and which can mirror the expected human phenotype with
accuracy.Finallyandmostimportantlywedemonstratean8.2foldenrichmentoverchance
indiseasegenepredictionsforalimitedsetofrecentlyidentifiedneuro-diseasegenes.Given
thatourgenepredictionscouldpotentiallybeusedtoincreasetheyieldofdiagnosticwhole
exomesequencingbyhelping investigatorsprioritisegenesof interest,wehavemadeour
predictedgenesetsavailableandeasilysearchablethroughawebapplicationG2P(Gene2
Phenotype,accessibleatXXXXX).
Moving beyond the predictive value of our ML-based classifier ensembles, we provide
evidence of the explanatory power of this approach. In particular, the finding that genes
contributing toMendelian forms of neurogenetic tend to be complex in structure,with a
higherthanexpectednumberoftranscriptsanduniqueexon-exonjunctionsannotatedper
gene.Interestingly,thiswouldsuggestthatthesegenescouldbeparticularlypronetosplicing
mutations,whicharemoredifficulttorecogniseandassess.
Mostimportantly,anddespitethelimitationsofourapproachwedemonstratethevalueof
machine learning approaches in understanding neurogenetic disorders now that a critical
massofgenome-wideannotationandgenediscoverydataexists.Giventhatbothtypesof
dataareonlysettoincreaseinqualityandquantityacrossawiderangeofdisorderswewould
envisagethatML-basedapproachesofthiskindarelikelytoincreaseinimpactoverthenext
5years.
.CC-BY-NC-ND 4.0 International licenseauthor/funder. It is made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the. https://doi.org/10.1101/288845doi: bioRxiv preprint
16
References
1. Boycott, K. M., Vanstone, M. R., Bulman, D. E. & MacKenzie, A. E. Rare-disease
genetics in the era of next-generation sequencing: discovery to translation. Nat. Rev.
Genet. 14, 681–691 (2013).
2. Warman Chardon, J., Beaulieu, C., Hartley, T., Boycott, K. M. & Dyment, D. A. Axons
to Exons: the Molecular Diagnosis of Rare Neurological Diseases by Next-Generation
Sequencing. Curr. Neurol. Neurosci. Rep. 15, (2015).
3. Klebe, S., Stevanin, G. & Depienne, C. Clinical and genetic heterogeneity in hereditary
spastic paraplegias: From SPG1 to SPG72 and still counting. Rev. Neurol. (Paris) 171,
505–530 (2015).
4. Fridman, V. & Reilly, M. Inherited Neuropathies. Semin. Neurol. 35, 407–423 (2015).
5. Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature
536, 285–291 (2016).
6. Harrow, J. et al. GENCODE: The reference human genome annotation for The ENCODE
Project. Genome Res. 22, 1760–1774 (2012).
7. The GTEx Consortium et al. The Genotype-Tissue Expression (GTEx) pilot analysis:
Multitissue gene regulation in humans. Science 348, 648–660 (2015).
8. Xiong, H. Y. et al. The human splicing code reveals new insights into the genetic
determinants of disease. Science 347, 1254806–1254806 (2015).
9. Quinodoz, M. et al. DOMINO: Using Machine Learning to Predict Genes Associated
with Dominant Disorders. Am. J. Hum. Genet. 101, 623–629 (2017).
10. Busch, A. & Hertel, K. J. HEXEvent: a database of Human EXon splicing Events.
Nucleic Acids Res. 41, D118–D124 (2013).
11. Lage, K. et al. A large-scale analysis of tissue-specific pathology and gene expression of
human disease genes and complexes. Proc. Natl. Acad. Sci. 105, 20870–20875 (2008).
12. Kumar, V., Sanseau, P., Simola, D. F., Hurle, M. R. & Agarwal, P. Systematic Analysis
of Drug Targets Confirms Expression in Disease-Relevant Tissues. Sci. Rep. 6, (2016).
.CC-BY-NC-ND 4.0 International licenseauthor/funder. It is made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the. https://doi.org/10.1101/288845doi: bioRxiv preprint
17
13. Kitsak, M. et al. Tissue Specificity of Human Disease Module. Sci. Rep. 6, (2016).
14. Wells, A. et al. The anatomical distribution of genetic associations. Nucleic Acids Res.
43, 10804–10820 (2015).
15. Cornish, A. J., Filippis, I., David, A. & Sternberg, M. J. E. Exploring the cellular basis of
human disease through a large-scale mapping of deleterious genes to cell types. Genome
Med. 7, (2015).
16. Ganegoda, G., Wang, J., Wu, F.-X. & Li, M. Prediction of disease genes using tissue-
specified gene-gene network. BMC Syst. Biol. 8, S3 (2014).
17. Langfelder, P. & Horvath, S. WGCNA: an R package for weighted correlation network
analysis. BMC Bioinformatics 9, 559 (2008).
18. Botía, J. A. et al. An additional k-means clustering step improves the biological features
of WGCNA gene co-expression networks. BMC Syst. Biol. 11, (2017).
19. Max Kuhn. Building Predictive Models in R Using the caret Package. J. Stat. Softw. 28,
20. Data mining: practical machine learning tools and techniques. (Elsevier, 2017).
21. The Gene Ontology Consortium. Expansion of the Gene Ontology knowledgebase and
resources. Nucleic Acids Res. 45, D331–D338 (2017).
22. Fabregat, A. et al. The Reactome Pathway Knowledgebase. Nucleic Acids Res. 46,
D649–D655 (2018).
23. Kanehisa, M., Sato, Y., Kawashima, M., Furumichi, M. & Tanabe, M. KEGG as a
reference resource for gene and protein annotation. Nucleic Acids Res. 44, D457–D462
(2016).
24. de la Torre-Ubieta, L. et al. The Dynamic Landscape of Open Chromatin during Human
Cortical Neurogenesis. Cell 172, 289–304.e18 (2018).
25. Barkovich, A. J., Dobyns, W. B. & Guerrini, R. Malformations of Cortical Development
and Epilepsy. Cold Spring Harb. Perspect. Med. 5, a022392–a022392 (2015).
26. Soreq, L. et al. Major Shifts in Glial Regional Identity Are a Transcriptional Hallmark of
Human Brain Aging. Cell Rep. 18, 557–570 (2017).
.CC-BY-NC-ND 4.0 International licenseauthor/funder. It is made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the. https://doi.org/10.1101/288845doi: bioRxiv preprint
18
27. Blake, J. A. et al. Mouse Genome Database (MGD)-2017: community knowledge
resource for the laboratory mouse. Nucleic Acids Res. 45, D723–D729 (2017).
28. Vasli, N. et al. Recessive mutations in the kinase ZAK cause a congenital myopathy with
fibre type disproportion. Brain 140, 37–48 (2017).
29. Johnson, M. B. et al. Functional and Evolutionary Insights into Human Brain
Development through Global Transcriptome Analysis. Neuron 62, 494–509 (2009).
30. Kang, H. J. et al. Spatio-temporal transcriptome of the human brain. Nature 478, 483–
489 (2011).
31. Chakraborty, S., Panda, A. & Ghosh, T. C. Exploring the evolutionary rate differences
between human disease and non-disease genes. Genomics 108, 18–24 (2016).
32. Lawrence, M. et al. Software for Computing and Annotating Genomic Ranges. PLoS
Comput. Biol. 9, e1003118 (2013).
33. Johnson, W. E., Li, C. & Rabinovic, A. Adjusting batch effects in microarray expression
data using empirical Bayes methods. Biostatistics 8, 118–127 (2007).
34. Libbrecht, M. W. & Noble, W. S. Machine learning applications in genetics and
genomics. Nat. Rev. Genet. 16, 321–332 (2015).
35. Reimand, J., Kull, M., Peterson, H., Hansen, J. & Vilo, J. g:Profiler--a web-based toolset
for functional profiling of gene lists from large-scale experiments. Nucleic Acids Res. 35,
W193–W200 (2007).
36. Saito, R. et al. A travel guide to Cytoscape plugins. Nat. Methods 9, 1069–1076 (2012).
37. Zhang, Y. et al. An RNA-sequencing transcriptome and splicing database of glia,
neurons, and vascular cells of the cerebral cortex. J. Neurosci. 34, 11929–11947 (2014).
38. Durinck, S. et al. BioMart and Bioconductor: a powerful link between biological
databases and microarray data analysis. Bioinformatics. 21, 3439–3440 (2005).
39. Yoav Benjamini and Yosef Hochberg. Controlling the False Discovery Rate: A Practical
and Powerful Approach to Multiple Testing. J. R. Stat. Soc. Ser. B Methodol. 57, (1995).
.CC-BY-NC-ND 4.0 International licenseauthor/funder. It is made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the. https://doi.org/10.1101/288845doi: bioRxiv preprint
19
FigureLegends
Figure1
A genetic "map" of human neurogenetic disorders: a) The genetic overlap betweenGenomicsEnglandpanelsisdisplayed,suchthateachnoderepresents1panelandaminimumof5genesisrequiredforanedge.Theradiusofnodesandwidthofedgesreflectsthenumberofgeneswithineachpanelandthenumberofsharedgenesbetweenpanelsrespectively.b)Barcharttoshowforeachofthe47GTExtissuesconsidered,howmanygenesexpressedbythetissueshowevidenceoftissue-specificityonthebasisofexpression,adjacencyormodulemembership values. We highlight in light grey tissues that appear as significant in theunivariate test for anyof these three features.c)Bar chart to show thatbrain tissuesarecharacterised by high variability in the relative proportion of tissue-specific expression,adjacencyandmodulemembership.Cerebellarhemisphereshowsthehighestproportionoftissue-specificsignalsduetoabsoluteexpression,whereasputamenischaracterisedbythehighest proportion of genes with tissue-specific co-expression. Genomics England panelnamesareshortenedasfollows:Amyotrophic lateralsclerosismotorneurondisease=ALSMND; Arthrogryposis = AMC; Brain channelopathy = BC; Cerebellar hypoplasia = CH;Cerebrovasculardisorders=CD;CharcotMarieToothdisease=CMT;Congenitalmusculardystrophy = CMD; Congenital myaesthenia = CMS; Congenital myopathy = CMP; Distalmyopathies=DM;Earlyonsetdementiaencompassingfrontotemporaldementiaandpriondisease = EODM FTD PrD; Early onset dystonia = EODS; Epilepsy Plus = EP; Epilepticencephalopathy=EE;Hereditaryataxia=HA;Hereditaryspasticparaplegia=HSP;Inheritedwhite matter disorders = IWMD; Intracerebral calcification disorders = ICD; Limb girdlemusculardystrophy=LGMD;Malformationsofcorticaldevelopment=MCD;Paediatricmotor
.CC-BY-NC-ND 4.0 International licenseauthor/funder. It is made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the. https://doi.org/10.1101/288845doi: bioRxiv preprint
20
neuronopathies=PMN;ParkinsonDiseaseandComplexParkinsonism=PD;Rhabdomyolysisand metabolic muscle disorders = RMMD; Skeletal Muscle Channelopathies = SMC; andStructuralbasalgangliadisorders=SBGD.
.CC-BY-NC-ND 4.0 International licenseauthor/funder. It is made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the. https://doi.org/10.1101/288845doi: bioRxiv preprint
21
Figure2
Thethreemajortypesofpredictors,namelygeneticvariation,geneandtranscriptstructure,andgeneexpression/co-expression,arelargelyindependent:CorrelationmatrixplottoshowthePearson’scorrelationcoefficientsbetweenthepredictorcategories,namelygeneticvariation(orange),geneandtranscriptstructure(green),tissue-specificgeneexpression(darkblue),tissue-specificmodulemembership(blue)andtissue-specificadjacency(lightblue).
.CC-BY-NC-ND 4.0 International licenseauthor/funder. It is made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the. https://doi.org/10.1101/288845doi: bioRxiv preprint
22
Figure3
Transcript complexity and reduced tolerance tomissensemutations are a distinguishingfeatureofgenesassociatedwithmanytypesofrareneurogeneticdisorders:a)Plotshowingthatofthe25GenomicsEnglandpanelsconsidered56%hadasignificantp-value(redscale)oncomparingtranscriptcountvaluesbetweengenesinthepanelsandotherproteincodinggenes.b)Plotshowingthatofthe25GenomicsEnglandgenepanelsconsidered40%hadasignificantp-value(redscale)oncomparingExACpMissvaluesbetweengenesinthepanelsandotherproteincodinggenes.GenomicsEnglandpanelnamesareshortenedas follows:Amyotrophiclateralsclerosismotorneurondisease=ALSMND;Arthrogryposis=AMC;Brainchannelopathy = BC; Cerebellar hypoplasia = CH; Cerebrovascular disorders = CD; CharcotMarieToothdisease=CMT;Congenitalmusculardystrophy=CMD;Congenitalmyaesthenia= CMS; Congenital myopathy = CMP; Distal myopathies = DM; Early onset dementiaencompassing fronto temporal dementia and prion disease = EODM FTD PrD; Early onsetdystonia=EODS;EpilepsyPlus=EP;Epilepticencephalopathy=EE;Hereditaryataxia=HA;Hereditaryspasticparaplegia=HSP;Inheritedwhitematterdisorders=IWMD;Intracerebralcalcification disorders = ICD; Limb girdle muscular dystrophy = LGMD; Malformations ofcorticaldevelopment=MCD;Paediatricmotorneuronopathies=PMN;ParkinsonDiseaseandComplex Parkinsonism = PD; Rhabdomyolysis and metabolic muscle disorders = RMMD;SkeletalMuscleChannelopathies=SMC;andStructuralbasalgangliadisorders=SBGD.
.CC-BY-NC-ND 4.0 International licenseauthor/funder. It is made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the. https://doi.org/10.1101/288845doi: bioRxiv preprint
23
.CC-BY-NC-ND 4.0 International licenseauthor/funder. It is made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the. https://doi.org/10.1101/288845doi: bioRxiv preprint
24
Figure4
ConstructionandcalibrationofMLclassifierensemblesforgene-phenotypeprediction:a)DiagramtoshowtheworkflowusedtogenerateMLclassifierensembles.Theseensemblesare based on the creation of 200 learning data sets per a disease panel. Each data set isgeneratedbyresamplingthecontrolgenesettomaintaina1:1ratioofdiseaseandcontrolgenes.Thefinalmodelof200decisitiontreesareintegratedintoavotingschemethatcanbeused to prediction novel gene-phenotype associations with an estimate of confidence
.CC-BY-NC-ND 4.0 International licenseauthor/funder. It is made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the. https://doi.org/10.1101/288845doi: bioRxiv preprint
25
(bottom right part) and to identify the genic features contributingmost commonly to thepredictionprocess(bottomleftpartofthefigure).b)Plottoshowtheeffectofapplyingastringency parameter on ML predictions generated from decision tree ensembles. Asstringencyvalues increase(x-axis)therelativeproportionsof"Disease","Non-disease"and"Uncertain"alsochange.Inparticular,theproportionofuncertainpredictionsincreases.
.CC-BY-NC-ND 4.0 International licenseauthor/funder. It is made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the. https://doi.org/10.1101/288845doi: bioRxiv preprint
26
Figure5
Properties of gene-phenotype predictions generated by disease-specific ML classifierensembles: a) The number of number of genes containedwithin eachGenomics EnglanddiseasepanelandusedforthegenerationofMLclassifierensemblesisshowninthelowerpanel (positive example set). The total number of new genes predicted per panel using astringency of 0.9 is shown in the middle panel. The upper panel shows the number ofpredictedgenesperpanel,whichwereexclusivelypredictedbythedisease-specificclassifier.b) Plot to show that genes predicted to causemalformations of cortical development aresignificantlyenrichedforrelevantbiologicalprocesses.REACTOMEpathways(circles)withthe
.CC-BY-NC-ND 4.0 International licenseauthor/funder. It is made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the. https://doi.org/10.1101/288845doi: bioRxiv preprint
27
mostsignificantenrichmentsamongstgenespredictedtocause"Malformationsofcorticaldevelopment" (labelled rounded square) are shown. REACTOME pathwayswhich are alsohighlighted by testing for enrichment of predicted genes for other disorders, namely "Cerebrovascular disorders" and "Brain channelopathies”, are also displayed (sharedenrichments=bluecircle;enrichmentsuniquetoeither“Cerebrovasculardisorders"or"Brainchannelopathies"=greycircles).
.CC-BY-NC-ND 4.0 International licenseauthor/funder. It is made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the. https://doi.org/10.1101/288845doi: bioRxiv preprint
28
Figure6
Evidence for enrichment of true gene-phenotype associations amongst gene predictionsmadebydisease-specific classifiers:a)Cerebellum-specificgeneexpressionand transcriptcomplexity are the most important predictors used within the “Cerebellar hypoplasia”classifierensemble.Onthex-axisweploteachofthepredictorsused.Predictorimportanceisplottedonthey-axis,asdefinedasthesumoftheappearance indexofapredictor (thepercentage of times it is used to predict “Disease” across the 200 decision trees whichcomprisetheensembleclassifier,range=0.0-1.0)andthedepthofapredictor(ameasureofitsaverageproximitytotheroot, range=0.0–1.0).Thisplotdemonstratesthatthemostimportant predictors in the “Cerebellar Hypoplasia” ensemble classifier were cerebellar-specificgeneexpressionandtranscriptcount(thenumberoftranscriptsproducedbyageneas annotated in GENCODE version 72). b) Graph to show mRNA expression levels forTBC1D2329,30in6brainregionsduringthecourseofhumanbraindevelopment,basedonexonarrayexperimentsandplottedona log2 scale (yaxis). Thebrain regionsanalyzedare thestriatum(STR),amygdala(AMY),neocortex(NCX),hippocampus(HIP),mediodorsalnucleusofthethalamus(MD),andcerebellarcortex(CBC).ThisshowshigherexpressionofTBC1D23mRNAexpressionincerebellumascomparedtoallotherbrainregionsfromthelateprenatalperiodonwards.c)Densityplotshowingthedistributionoftranscriptcounts(thenumberoftranscriptsproducedbyageneasannotatedinEnsembleversion72)forallprotein-codinggenesandTBC1D23specifically(redline).d)Probabilitydensityplotstoshowthepercentageof"Disease"predictionsproducedbydecisiontreemodelsfora“test”setofgenesenrichedfortruegene-phenotypeassociations(darkblue)anda“control”setdefinedasthesetofallproteincodinggenes(withtheexceptionofthoselistedintheNovember1st.2017GenomicsEngland release of neurogenetics panels). The “test” set included all genes classified by
.CC-BY-NC-ND 4.0 International licenseauthor/funder. It is made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the. https://doi.org/10.1101/288845doi: bioRxiv preprint
29
GenomicsEnglandashavingmoderate(amber)orlow(red)evidenceforassociationwithaneurogeneticdisorderaswellashighevidencegenesaddedtoapanelafterMarch31st2017anduntilNovember1st2017.Noneofthegenesusedinthis“test”setwereusedforlearningtheclassifierensembles.Therewasaclearandhighlysignificant(Mann-Whitneytwo-sampleunpaired testp-value=2.2x10-16)difference in thedistributionof “Disease”quality scorescomparing“test”and“control”genesets.
.CC-BY-NC-ND 4.0 International licenseauthor/funder. It is made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the. https://doi.org/10.1101/288845doi: bioRxiv preprint
30
SupplementaryTables
SupplementaryTable1TableprovidingthenamesandaffiliationsofallindividualscontributingtothecurationofGenomicsEnglanddisease-associatedpanelsusedinthisstudy.
SupplementaryTable2Tabledescribingalldisease-associatedpanelsusedandtheircontents.
SupplementaryTable3Tableprovidingallpredictorsusedtodescribegeneswithomics.
SupplementaryTable4Tableprovidingr2andp-valuesfortherelationshipbetweenall154predictors.
SupplementaryTable5Tableshowingthepredictorswhichdifferedsignificantlybetweenandnon-diseasegenesforeachdisease-associatedpanel.
SupplementaryTable6TableprovidingalldiseasegenepredictionsmadeusingMLclassifierensembleswithastringencyof90%.
SupplementaryTable7Tableprovidingsignificantgeneontology,KEGGandREACTOMEpathwayenrichmentsforgenepredictionsmadeusingdisease-specificMLclassifierensembles.
SupplementaryTable8Tableprovidingsignificantcelltype-specificenrichmentsforgenepredictionsmadeusingdisease-specificMLclassifierensembles.
SupplementaryTable9TableprovidingsignificantMammalianPhenotypeOntologytermsforgenepredictionsmadeusingdisease-specificMLclassifierensembles.
SupplementaryTable10Tableshowingthepredictionsfor32new,highconfidencegene-phenotypeassociationsforwhichtherelevantML-basedclassifierprovideda“Disease”or“Non-disease”predictionatastringencyof0.9.
.CC-BY-NC-ND 4.0 International licenseauthor/funder. It is made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the. https://doi.org/10.1101/288845doi: bioRxiv preprint
31
SupplementaryFigures
SupplementaryFigure1
ROC values for all Genomics England panels for each Caret Algorithm used for modellearning:BoxandwhiskerplotstoshowthedistributionofROCvalues(xaxis)forpredictionsgeneratedusingthetrainingdata(eachGEpanelgenescorrespondstoapointinthesample)perclassifier(yaxis).Whilemanyalgorithmsperformedwell,wechoseJ48becauseithasagoodtradeoffbetweenROCandexplanatorycapabilities.
SupplementaryFigure2
ROC values for all Caret Algorithms used formodel learning on eachGenomics Englandpanel: In this box and whisker plot we show the distribution of ROC values (x axis) forpredictionsgeneratedusingthetrainingdataforasinglepanel(yaxis,eachGEpanelcomeswithitssizeingenes).Eachalgorithmcorrespondstoapointinthesampleset.TheplotshowsthatthelearningROCperformanceforallpanelsshowaperformancebetterthanrandombutwithroomforimprovement.
SupplementaryFigure3
ImprovementofROCvaluesthroughtheuseofanensembleof200treesversusasingletree:Boxandwhiskerplotstodemonstratethatthereisanimprovement inROCvalues(xaxis,%ofimprovementinROC)ontestdataforallGEpanels(yaxis)whenusinganensembleofdecisiontreesascomparedtoasingletreetoidentifygene-phenotypeassociations.
.CC-BY-NC-ND 4.0 International licenseauthor/funder. It is made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the. https://doi.org/10.1101/288845doi: bioRxiv preprint
32
Onlinemethods
IdentificationofgenescausingMendelianneurogeneticdisordersWeobtainedourlistofgenescausingMendelianneurogeneticdisordersfromtheGenomicsEngland Panel App (https://panelapp.genomicsengland.co.uk/; March 31st 2017). Weconsideredallpanelsincludedunderthelevel2heading“NeurologyandNeurodevelopmentaldisorders”.Wefurtheranalysedallgenepanelswithmorethan10“Green”(diagnostic-grade)genes with the exception of “Intellectual Disability” due to the very broad phenotypicspectrumassociatedwiththisgenepanel.Thisresultedintheuseof25genepanels,equatingto1140uniquegenes(SupplementaryTable2).
IdentificationofcontrolgenesetsGiven that we aim to use ML-based classifiers not only to predict new gene-phenotypeassociations,buttoprovideexplanatoryinformationaswell,werequirea“control”geneset,whichcanserveasnegativeexamplesthatbetterdelineatethroughcontrastthecorefeaturesofdiseasegenes.Wedefineacontrolgene,asanygenewithnocurrentlyknownassociationtoanydisease(whetherneurologicalinnatureorotherwise).Inthissettingweonlyconsiderdisease associations based onMendelian inheritance and accept that some of the genesdefinedas“control”maybeidentifiedinthefutureashavingadiseaseassociation.Thus,asinthecaseofChakrabortyandcolleagues31,wegeneratealistof“control”genesbyexcludingall genes currently contained within the OnlineMendelian Inheritance in Man database(https://www.omim.org/), Genetic Association Database(https://geneticassociationdb.nih.gov/) and the Human Gene Mutation Database(http://www.hgmd.cf.ac.uk/ac/index.php)fromtheEnsembl(version72)geneset.Usingthisapproachweidentify4260“control”genes.
Extractionofgene-basedmeasuresofgeneticconstraintThe Exome Aggregation Consortium (ExAC, http://exac.broadinstitute.org/) database hasbeenusedtomodelthedifferencebetweentheexpectedandobservedfrequencyofloss-of-functionandmissensemutationsinallgenes.Thisinformationhasbeencapturedthroughthegene-specificparameters,ExACpLi(theprobabilityofbeingloss-of-function,lof,intolerantofboth heterozygous and homozygous lof variants), ExACpRec (the probability of beingintolerantofhomozygous,butnotheterozygouslofvariants),ExACpNull(theprobabilityofbeingtolerantofbothheterozygousandhomozygouslofvariants)andExACpMiss(correctedmissenseZscore,takingintoaccountthathigherZscoresindicatethatthetranscriptismoreintolerantofvariation).ThesefourmeasuresofgeneticconstraintweredownloadedfromtheExACconsortiumsite(ftp://ftp.broadinstitute.org/pub/ExAC_release/release0.3/functional_gene_constraint/;releasedJanuary13th2017)foruseaspredictors.
ExtractionandgenerationofmeasuresofgenecomplexityAllmeasuresofgenecomplexitywereeitherextracteddirectlyorgeneratedfromEnsemblversion72.Theintroniclengthofeachgene(IntronicLength)wasdeterminedusinganad-hocscriptthatcalculatesintronlengthinbasepairsconsideringallpossibletranscriptspresentinthe gene. The total number of unique exon-exon junctions generated by each gene(numJunctions)wascalculatedusingtherefGenomeRpackagewithalltheknownjunctionssummed toobtain a valueper gene. Thenumberof genesoverlapping a geneof interest(countsOverlap)wasestimatedusingtheRpackageGRanges32,suchthatforeachgenethenumberofgenesoverlapping1bpormorewiththegeneofinterest,regardlessofstrand,wascounted. The number of overlapping protein coding genes (countsProtCodOverlap) was
.CC-BY-NC-ND 4.0 International licenseauthor/funder. It is made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the. https://doi.org/10.1101/288845doi: bioRxiv preprint
33
generated in a similar manner except that only genes classed as “protein coding” withinEnsembl(version72)wereconsidered.
Generationofmeasuresoftissue-specificgeneexpression&co-expressionInordertogenerateasetofgene-basedmeasuresoftissue-specificgeneexpressionandco-expression, we used the GTEx7 V6 gene expression dataset (accessible fromhttps://www.gtexportal.org/home/)whichcomprises54tissuesandexpressiondatafor8557samplescovering56318Ensemblgenes.WefirstfilteredtheGTExdatasetbytissuesampleavailability using only those tissues with more than 60 samples available leading to theconsiderationof47tissues.Weinitiallyanalysedeachofthe47tissuesamplesetsseparatelybyfilteringgenesonthebasisofanRPKM>0.1(observedin>80%ofthesamplesforagiventissue).Wethencorrectedforbatcheffects,age,genderandRINusingComBat33.Finally,weusedtheresidualsoftheselinearregressionmodelstoconstructgeneco-expressionnetworksfor each tissue using WGCNA17 and post-processing with k-means18 to improve geneclustering.Thus,eachof the47tissueshadacorrespondingnetworkconsistingofasetofgene modules. For each gene in a network we obtained its module membership andadjacency,whereModuleMembership for a genegwasdefinedas the correlationof theresidual geneexpression for geneg and theeigengeneof themodule it belonged to, andadjacencywasdefined foragenegas thesumofallvalues for its row/columnwithin theTopologicalOverlapMatrixcreatedbyWGCNA.
Weusedtheresultinggeneexpressionandco-expressiondatatoobtain3measuresoftissue-specific expression. For each tissue any given gene was defined as having tissue-specificexpression if thegeneexpression inthattissuewas3.5foldhigherthantheaveragevalueacrossallother47GTExtissues.Similarly,agenewasdefinedashavingtissue-specificmodulemembershiporadjacency,ifitsmodulemembershiporadjacencywithintheco-expressionnetwork,was3.5foldhigherthanthatacrossallotherremainingGTExtissues.
UnivariateanalysisonsinglepredictorsWewantedto testwhethereachsinglepredictorhadanypredictivepoweronthetaskofidentifyingdiseasegenesamongstthewholeproteincodinggenome.Forsuchapurpose,wedefinedeachGenomicsEnglandgenepanel’s(GP)testdatasetas
𝐷#$ = {{𝑃(𝑔), 𝑦-}: 𝑔 ∈ 𝐺𝑃∗, 𝑃(𝑔) = (𝑃3(𝑔), 𝑃4(𝑔), … , 𝑃6(𝑔)), 𝑦- ∈ +, − },
suchthatGP*referstoasetofgenescompoundofthegenesintheGenomicsEnglandGPgenepanelplustherestofthewholeproteincodinggenome(seebelow),thePi(g)isthevalueofthei-thpredictorforgenegandygthelabelofgenegforthatgeneset, +,− ,where+means gene belonging to the Genomics England panel and – means otherwise. We usepredictorsofadifferentnature,namelycategoricalpredictors(includingallpredictorsrelatingto tissue-specific expression and co-expression) and numeric predictors (including allpredictorsrelatingtogeneticconstraintsandgenecomplexity).Accordingly,weappliedtwodifferentstatisticalteststolookforsignificantdifferencesbetweendiseaseandnon-diseasegenes.ForcategoricalpredictorsweusedFisher’sexacttests(FET)withacontrolgenesetconsistingofallgeneswithinEnsembl(version72)withtissue-specificdataavailableandwiththeexceptionofgeneswithin theGenomicsEngland“Neurologyandneurodevelopmentaldisorders” panels (17,315 genes). For numeric predictors we usedMann-Whitney U tests(MWU)withacontrolgenesetconsistingofallproteincodinggeneswithinEnsemblwithanon-nullvalueforthepredictorandwiththeexceptionofgeneswithintheGenomicsEngland
.CC-BY-NC-ND 4.0 International licenseauthor/funder. It is made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the. https://doi.org/10.1101/288845doi: bioRxiv preprint
34
“Neurologyandneurodevelopmentaldisorders”panels(17,918genes).FETsanMWUtestswereperformedunderthenull thattherewasnosignificantdifference inpredictorvaluesfoundindiseasegenepanelsincomparisontothecontrolgeneset.WereportBonferroni-Hochbergcorrectedp-valuesinallcases.
Machineleaningtogeneratedisease-specificclassifiersWe used a supervised machine learning (ML) approach34 to inductively model gene-phenotypeassociationsandsogenerateclassifiersthatcandistinguishbetweengenesthathaveanassociationwithaneurologicalphenotypeandthosethatdonot.Formallystated,this meant that given a set of genes already causally linked to a specific neurologicalphenotypeanddefinedbyaGenomicsEnglandpanel,GP,wesoughtaclassifierCGP,suchthat,for anygene𝑔, it predictedwhether it toowouldhaveagene-phenotypeassociationanddenoteitintheformofafunction,asfollows:
𝐶#$: 𝑃 → {+,−},
and 𝑃 is a set of 𝑃; predictors such that 𝑃(𝑔) = (𝑃3(𝑔), 𝑃4(𝑔), … , 𝑃6(𝑔)) is a vector ofattributes of gene𝑔 and the output of the classifier is then 𝐶#$ 𝑔 = + if the classifierpredicts g to have a gene-phenotype association for phenotype GP and 𝐶#$ 𝑔 = − otherwise.
TheglobalMLanalysisisdividedintotwomainphases.Inthe1stphaseweseekforthebestpossible learning algorithm we have available given requirements on prediction accuracy(mainlyROCvalueson10-foldcross-validation)andmodelinterpretability.Inthe2ndphase,weuseahighnumberofsuchmodelsintoavotingbasedintegrationensembletoaccountfortheacuteimbalanceintheproportionofpositiveandnegativeexamplesinthelearningdata.
1stphase:identificationofthebestMLalgorithm
Using the Caret R package19 we constructed simple classifiers𝐶#$ for 14 ML techniquesincludingthemainsupervisedlearningparadigmsandallthe25GenomicsEnglandworkingpanels.ThelearningdataforeachpanelwascompoundbyalltheHighEvidencediseasegenesbelonging to thepanelandall theChakrabortycontrolgenes,4260. TheCaretalgorithmsusedwere: (1) trees/rules based algorithms (C5.0Tree, LMT, J48,ONeR, Rpart), (2) neuralnetwork based algorithms (NNet, GlmNet), (3) linear regression based algorithms (PLR,RRLDA, svmLinear), a bayes reasoning approachwithNaiveBayes and (5) ensemble basedalgorithms(AdaBoost,Boruta,Rborist).WeevaluatedeachCaretcandidateMLalgorithmoneach𝐶#$ dataset using repeated cross-validation (10 fold) and automatic hyperparameterevaluationasprovidedbyCaret.ROC,specificityandsensitivityforallCaretalgorithmsappliedtoallGenomicsEnglandpanelswereevaluated(SupplementaryFigure1,resultssegregatedbyalgorithm;SupplementaryFigure2,resultssegregatedbyGenomicsEnglandpanel).WhileRboristandBorutashowedsignificantlybetterROCvaluesthanJ48(P<0.0008andP<0.02respectivelywithapairedt-test),J48stillhadahighpredictivepower(only9-10%lowerthanRborist and Boruta). Given that this algorithm had much greater explanatory power, allsubsequentanalyseswereperformedwithJ48.
2ndphase:constructionofthebestpredictorbasedonJ48
WeusedtheJ48algorithmwithinawrapper(i.e.anensemble inMLterminology)thatwedeveloped inorder todealwith thedisproportionatenumberofnegativeexamples inourlearningset.ThiswrapperinvolvedthegenerationandintegrationofmanyMLmodelsfrom
.CC-BY-NC-ND 4.0 International licenseauthor/funder. It is made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the. https://doi.org/10.1101/288845doi: bioRxiv preprint
35
multiple learning data sets, allwith the same set of positive examples, butwith differentnegativeexamplesgeneratedthroughrandomre-samplingoftheChakrabortycontrolgeneset.Formallystated,thismeantthatgivenapanel𝐺𝑃,anditslearningdatadefinedas:
𝐷#$ = 𝑃 𝑔 , 𝑦- : 𝑔 ∈ 𝐺𝑃∗, 𝑃 𝑔 = 𝑃3 𝑔 , 𝑃4 𝑔 , … , 𝑃6 𝑔 , 𝑦- ∈ +,− ,
suchthatGP*refersnowtoasetofgenescompoundofthegenesintheGenomicsEnglandGPgenepanelaspositiveexamplesplustheChakrabortygenesasnegativeones,wecreated (𝐷#$3, 𝐷#$4, … , 𝐷#$4<<) learningdatasets,withidenticalpredictors 𝑃 butnow,the {𝑦-} hada1:1 ratioofpositive (+) andnegativeexamples (-). Thesetofpositiveexamples ineachlearningdatasetwasidenticaltothatin 𝐷#$, butthenegativeexamplesweresampleswithreplacementfromtheChakrabortycontrolgenes.
We applied J48 to each of the learning datasets (𝐷#$3, 𝐷#$4, … , 𝐷#$4<<) to obtain (𝐶#$3, 𝐶#$4, … , 𝐶#$4<<) classifiers, which we aggregated into a single classifier, 𝐶#$ =Δ(𝐶#$3, 𝐶#$4, … , 𝐶#$4<<), whereΔ wasafunctiontointegrateresponsesfromall individualclassifiers.Δhastobedesignedtakingintoaccountthattheensemblecompoundofthe200MLmodelswithhavealimitedpredictivepowerforanumberofreasons.Firstly,thenumberofdiseasegenestolearnfromwillbelimited,andmostlikelyincomplete,forallpanels.Andsecondly,likelyamongstthegenesweconsiderascontrolstherewillbediseasegeneswedonotknowyet.Therefore,wemustcomeupwithastrategythatdealswiththat.Andwedefinea stringency parameter to control the eagerness with which the ensemble generates aprediction,i.e.DiseaseorNon-diseasebutalsotoincludeathirdoutcome:Uncertain.Andwedefine a parameter called stringency, let it be denoted with s in [0,1] that works in thefollowingway: given a gene g, and the 200models generated from a gene panel GP,wegenerate200predictions𝐶#$; 𝑔 , 𝑖 = 1, … ,200.EitherDiseaseorNondisease.LetpdandpntheproportionofDiseaseandNon-diseasepredictionsforg,suchthatpd+pn=1.Ifpd>pn,theoutcomeofΔ(𝐶#$; 𝑔 , 𝑖 = 1, … ,200)willbeDiseasewhenpd>s.Analogouslywhenpn>pd.TheoutcomewillbeUncertainotherwise.
TestingforannotationtermenrichmentamongstpredictedgenesWe used gProfileR35 to investigate enrichment of Gene Ontology, REACTOME and KEGGpathways annotation terms amongst predicted gene sets. We included IEA (InferredElectronic Annotations) and used the gSCS test developed by the authors to assess forannotation term enrichment. The graphical representation of the REACTOME termenrichmentwasbasedonthemostsignificanttermsasreportedbygProfileRandbyusingCytoscape3.536.TheenrichmentofMammalianPhenotypeOntologytermsamongstmousemodelswithmutationsinmouseorthologuesofpredicteddiseasegeneswasperformedusingtheToppFunfunctionwithinToppGene(https://toppgene.cchmc.org/,REF)andapplyingthedefaultsettings.
Testingforenrichmentofgenesexpressedinacell-specificmanneramongstpredictedgenesWeobtainedcell-typespecificgenelistsrelevanttobrainfromSoreqetal.26.Theselistswerebased on the analysis of RNA-sequencing data from purified cells isolated from mousecerebralcortex37andcoveredastrocytes,neurons,oligodendrocyteprecursors,newlyformedoligodendrocytes,myelinatingoligodendrocytes,microgliaandendothelialcells.Genesthatappearedinmorethanonecelltype-specificgenelistwereremovedfromtheanalysis,andtheremaininggeneswereconvertedtohumanorthologsusingthebiomaRtpackage38.We
.CC-BY-NC-ND 4.0 International licenseauthor/funder. It is made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the. https://doi.org/10.1101/288845doi: bioRxiv preprint
36
usedtheFisher’sExactTesttocheckforenrichmentofcell-specificmarkerswitheachsetofpredicted genes. The false discovery rates (FDR) were calculated using the Benjamini-Hochbergprocedure39andanFDRthresholdof5%wasusedtoassessthesignificanceoftheenrichmentobserved.
Assessingtheimportanceofpredictorsindisease-specificclassifierensemblesGivena setof trees fromanensemble,wemeasure the importanceofeachMLpredictorwithintheensembleonthebasisof its relevancewhenclassifyinggenesas“Disease”.Webase the calculation of importance on two different properties of a predictorwhen usedwithinasetoftrees,namelythefrequencyofappearanceacrossthetreesandtheaveragedepthofthepredictorwithineachtree.Clearly,whenapredictorappearshighinatree(i.e.neartotherootoreventherootitself)anditssubtreeleadstoahigherproportionofgenesclassifiedas“Disease”,thepredictorhasrelevancefordiseaseclassification.Moreover,whentheattributeisusedmanytimesacrosstreesitisalsoimportant.Wecalculatetwomeasurestoreflecttheseproperties.Notethateachnon-leafnodeonatreeisatest(positivetestinthiscase)forapredictor.Leafnodesarelabels,either“Disease”or“Non-disease”.Inregardtodepth,givenapredictorpandtreet,withamaximumofnnodesbetweentherootandanyleave,depthisdefinedforthepredictorpinthetreetasthemeanofthep1,p2,…,pmvalueswhenpappearsmtimesatthetreetandpiisnminusthenumberofnodesfromtheroottothatnodeatthei-thappearanceofthepredictorwithinthetree.Depthisalwaysin[0,1].With respect toappearance, theappearanceof apredictorp in a treesensemble issimplyitsfrequencyofappearanceasrootnodeofasubtreewithhigherproportionofgenesclassifiedas“Disease”forthatsubtreegeneratingavaluein[0,1].Asasinglemeasure,wesimplyaddthesetwomeasures.Asasinglemeasurewesimplyaddthesetwovalues.
.CC-BY-NC-ND 4.0 International licenseauthor/funder. It is made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the. https://doi.org/10.1101/288845doi: bioRxiv preprint
37
CodeavailabilitystatementThecodewehavedevelopedforthispaperisavailablerightnowuponrequest.
Weareworkingonmakingiteasiertouse.Itwillbereadybeforepublication.
DataavailabilitystatementAllthenewdataproducedinthispaperhasbeenmadeavailablethroughthesupplementarymaterialsaccessiblefromthesubmission.
AcknowledgementsMinaRyten,David ZhangandKarishmaD’Sawere supportedby theUKMedicalResearchCouncil(MRC)throughtheawardofTenure-trackClinicianScientistFellowshiptoMinaRyten(MR/N008324/1).SebastianGuelfiwassupportedbyAlzheimer’sResearchUKthroughtheawardofaPhDFellowship(ARUK-PhD2014-16).ReginaReynoldswassupportedthroughtheawardofaLeonardWolfsonDoctoralTrainingFellowshipinNeurodegeneration.
Henry Houlden was supported by the UKMedical Research Council, theWellcome Trust,MuscularDystrophyUK,MuscularDystrophyAssociationUSAandtheNationalInstituteforHealthResearch.
Thisresearchwasmadepossiblethroughaccesstothedataandfindingsgeneratedbythe100,000GenomesProject.The100,000GenomesProjectismanagedbyGenomicsEnglandLimited (a wholly owned company of the Department of Health). The 100,000 GenomesProject is funded by the National Institute for Health Research and NHS England. TheWellcome Trust, Cancer ResearchUK and theMedical Research Council have also fundedresearch infrastructure. The100,000GenomesProjectusesdataprovidedbypatientsandcollectedbytheNationalHealthServiceaspartoftheircareandsupport.
AuthorcontributionsJuanBotíageneratedthemachinelearningclassifierensembles.JuanBotía,SebastianGuelfi,David Zhang, Karishma D’Sa, Regina Reynolds andMina Ryten assessed and analysed theoutputofclassifierensembles.DanielOnahandJuanBotíadesignedandimplementedawebservertoaccesspredictions.
EllenMcDonaghandAntonioRuedaMartinconceived,designed,builtandcreatedPanelAppwhere 800 registered reviewers worldwide have supplied data or curated 167 panels forgenome analysis and helped establish 213 reviewable panels with 4104 genes. AugustoRendoncontributedtothedesignofPanelAppandcoordinatedthedevelopmentandcurationefforts.
Arianna Tucci and Henry Houlden (on behalf of the Neurology Genomics England ClinicalInterpretationPartnership)assistedandledthecurationofgenepanels.
JuanBotía,DavidZhang,JohnHardyandMinaRytenpreparedthemanuscript.JuanBotíaandMinaRytenconceivedanddesignedtheproject.
.CC-BY-NC-ND 4.0 International licenseauthor/funder. It is made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the. https://doi.org/10.1101/288845doi: bioRxiv preprint
38
ConflictsofInterestTheauthorshavenoconflictsofinteresttodeclare.
.CC-BY-NC-ND 4.0 International licenseauthor/funder. It is made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the. https://doi.org/10.1101/288845doi: bioRxiv preprint
Recommended