Comparativegenomics
LucySkrabanekICB,WMC
15April2010
Whatdoesitencompass?
• Genomeconservation– transferknowledgegainedfrommodelorganismstonon‐modelorganisms
• Genomeevolution– understandhowgenomeschangeovertimeinordertoidentifyevolutionaryprocessesandconstraints
• Genomevariation– understandhowgenomesvarywithinaspeciestoidentifygenescentraltoparticularprocesses
Mainuses
• Wholegenomecomparisons– Genomeevolution
• Codingregionscomparisons– Geneprediction– Genestructure(exon‐intron)prediction– Functionprediction
• Non‐codingregioncomparisons– Regulatoryregiondiscovery
• Protein‐proteininteractionprediction
iTOL;LetunicandBork,Bioinformatics2007
Rearrangementrate• Canestimatethenumberofrearrangementsfromcytologicalcomparisons
• Twoverydifferentrates:– Veryslowrateofrearrangement(1or2exchangesper10MYR)• ~7rearrangementsbetweenthehumangenomefromthehypotheticalprimateancestor(60MYA)
• 13rearrangementsbetweencatandhuman
Rearrangementrate
• Punctuatedbyabruptglobalgenomerearrangementinsomelineages– Gibbonsandsiamangsrearranged3or4times
moreextensivelythanhumanorothergreatapes– Dogshavehighlyrearrangedgenomescompared
totheancestralCarnivoragenome– Rodentspeciesexhibitveryrapidpatternsof
chromosomechange• ~180conservedsegmentsbetweenmouseandhuman• ~100conservedsegmentsbetweenratandhuman
Human, cat and mouse X chromosome comparison
O’Brienetal,Science1999
Genomecomparisonacrossspeciesgrosschangesinchromosomenumber
O’Brienetal,Science1999
Genomealignment
• Verydifferentfromsinglegeneorproteinalignments
• StandardDPAsaretooexpensive• Madecomplicatedbyextensiverearrangementsoflargehomologoussegments
Problems
• Lookingforsyntenicregions• Rearrangementsdisruptingsyntenicregions– Insertions– Deletions– Inversions– Translocations– Duplications
Assumptions
• Thetwogenomestobealignedderivedfromacommonancestor
• Thereremainssufficientsimilaritybetweenthegenomestoenableeasyidentificationofhomologousregions
• Forthealignmenttobeinformative,therehastohavebeentimeforthegenomestodivergeandforselectiontohaveoccurred
Algorithmrequirements
• Genomealignmentalgorithmsmust– Scalelinearly(computationally)– Berobust(nottoomanyparameters)– Bememory‐efficient– Beabletohandlerearrangements,geneduplications,repetitiveelements
• Smith‐Waterman,NeedlemanWunsch– TimetodocalculationoftheorderofO(n2)
• Notfeasibleforsequencelength>10,000bp– Cannothandlerearrangementsorinversions
Alignmentmethods• Seedingmethods(e.g.,BLASTZ,BLAT,EXONERATE)
– Produce“local”alignments– Allmatchesfound(includingallparalogs)
• Verysensitive,notveryspecific– Veryfast
• Anchor‐basedmethods(e.g.,MUMmer,AVID,WABA)– Produce“global”alignments– Specific– Difficultywithrearrangements
• Multiplesequencealigners(e.g.,Mauve,GRIMM‐Synteny,Shuffle‐LAGAN,Mercator)– Designedtodealwithrearrangments
Localaligner:BLASTZ• AsinBLAST,findseeds
– Seedsaredeterminedbya12of19weighted‐spacedseedsstrategywherethepositionsforstrictmatchesarespecified
– 1110100110010101111combinationshowntobethemostsensitive(andmoresensitivethanthe11‐consecutivematchseedstrategyusedbyBLAST)
– Alsoallowsonetransitionin1/12strictmatchpositions
– Seedswithmanymatchesmaskedout(assumedtoberepetitiveregions)
BLASTZ• Gap‐freeextension• FurtherextendseedsbyDPA,allowinggaps
– Lowcomplexityregionshavetheirscoresspecificallydown‐weighted
• Repeatabovesteps(usingamoresensitiveseed,e.g.,7‐mer)forregionsthatliebetweenmatches,thatsharethesameorderandorientation,andareseparatedby<50kb
• Post‐processingtoremovemultiplehitstothesameregion
• Whenaligninghumanandmousegenomes,canachieve98%alignmentofknowncodingregions
Localaligner:BLAT• Twomodes:untranslatedandtranslated
– Untranslatedmodeperformspoorlywhenconservation<90%sotranslatedmodeusuallyusedwhenaligninggenomes Willmoreefficientlyidentifyregionsthatareconservedfortheircodingabilityratherthanfortheregulatoryfunctions
• Seedscreatedbybuildinganindexofnon‐overlapping5‐mersfromonegenome– Frequent5‐mersandambiguoussequences(repeatsandlow‐complexityregions)excluded
• Theothergenomeischoppedinto<200kbchunks– Comparisonsaremadebetweentheindexed5‐mersandthechunks
• DPAappliedbothupstreamanddownstreamofmatches
Globalaligner:Shuffle‐LAGAN• CHAOS:localaligner
– Findsmatchingseeds,andchainsthemtogether(withinacutoffdistance)
– ExtendschainswithungappedBLAST• Createamapwheretheanchorshavetobecollinearinone
sequenceonly(1‐monotonicconservationmap)• Alignconsistentsubsegments
– Sortalllocalalignmentsbytheirstartcoordinates,identifyalignablesubsegments,expandtobordersofadjacentsubsegments(buttrimifanycharactersareoverlapping)
Brudnoetal,Bioinformatics,2003
Globalaligner:Mauve
1. Findlocalalignments(multi‐MUMs)– seed‐and‐extendhashingmethod
2. Createaphylogeneticguidetree(NJ)3. Selectasubsettouseasanchorsa) partitionintoLCBs(locallycollinear
blocks)4. Recursivelyidentifyadditionalanchors
aroundandwithineachLCB5. ProgressivelyaligneachLCBusingtheguide
tree.
Darlingetal,GenomeResearch,2004
Darlingetal,GenomeResearch,2004
Breakpointidentification:GR‐Aligner• Identifyandmergeconservedsequencepairs
– Bl2Sequsedtogeneratelocalnon‐overlappingmatches– Matchesaresortedbystartposition– Mergeadjacentmatchesiftheyareeitherbothdirectorinverseandifthecombinedscoreisoversomethreshold(intervalalignedwithNW)
Chuetal,Bioinformatics,2009
• Identifybreakpointsininversions– Identifyallinversematchesflankedbydirectmatches
– Alignhomologoussequencebetweentheinverseanddirectmatchandfindtheoptimalscore
GR‐Aligner• Identifybreakpointsin
transpositions– Identifyalldirectmatchesthatcrossonlyoneotherdirectmatch,andcrossedmatchesareflankedbydirectmatchesthatdonotcrossanyothermatches
– Alignsequencebetweenthetransposedmatchwiththesequencesoutsidethetransposedmatchandfindtheoptimalscore
Chuetal,Bioinformatics,2009
Genomealignmentparameters• Repeat‐masking,E‐value
– Estimatedbyaligningreversedgenome– TRFonlyrepeat‐finderthateliminatedmanyspuriousalignments(non‐
standardparameters,hard‐masking)
• Scoringmatrix• Gapcosts
– 495combinationsofabove3parameters(usinghard‐maskingwithTRFandanE‐valueof1)
– Forproteinfamilyalignments,2:1:2:16:1[match‐score:transition‐cost:transversioncost:gap‐open:gap‐extend]balancedsensitivityandspecificity
– ForstructuralRNAalignments,3:3:4:24:1and4:4:5:24:1• X‐dropparameter
– Usedtoterminateextensionofgappedalignments(terminatesifthescoredropsbymorethanXbelowthepreviouslyseenmaximumscore)
– Seemstofindfewerfalse‐positiveswhenX‐dropisnotgreaterthanalignmentscorecutoff
Frithetal,BMCBioinformatics,2010
Visualizationtools:VISTAvs.PipMaker
• VISTA(VISualizationToolsforAlignments)– UsesAVIDtogeneratealignments– Usesaslidingwindowapproach
• Plotspercentidentitywithinafixedwindowsize,atregularintervals
• PipMaker(PercentIdentityPlot)– UsesBLASTZ– X‐axisisthereferencesequence;horizontallinesrepresentgap‐freealignments http://pipmaker.bx.psu.edu/pipmaker/
http://genome.lbl.gov/vista/index.shtml
2‐Dview:VISTA‐dot
http://genome.jgi‐psf.org/synteny/
Circos
http://mkweb.bcgsc.ca/circos/
• Oncewehaveamacro‐alignment,wecanstudytheevolutionofgenomesbetweenspecies,andalsocantracetheevolutionaryhistoryofthestructureofeachgenomeitself
• Genomestructurerearrangement• Geneduplication,chromosomalduplicationandpolyploidization(wholegenomeduplication)– Newgenes
Phylogenetictree–sharedgenecontent
http://www.bork.embl‐heidelberg.de/~korbel/SHOT_v2/
Phylogenetictree–geneorder
Booreetal,Nature,1998
Unravellingthehistoryofthegenome• Plants
– Wheat• Allohexaploid(AABBCC)
– Maize• Diploidizedallotetraploid• Hasgrown12xinthepast5MYduetoincreasednumbersoftransposableelements
– Arabidopsis• Haploid,5N
• Yeast– Saccharomycescerevisiaevs.Kluyveromyceslactis
• Human– Genomeduplication?
Decipherhistoryoflineages
• Genomesizechanges– Compaction
• Largescaledeletions‐Fugu• Intronloss
– Expansion• Duplication• Transposableelementinsertions
• Transposons‐Alus,LINEs– Ancientinsertionspriortoeutherianradiation– Morerecentinsertions‐maize
Baxendaleetal,NatureGenet,1995
Whygen(om)eduplication?• Duplicatedgenesprovideasourceforgeneticnoveltyduringevolution– Eithermemberofaduplicatedgenepaircandivergetoeither• Acquireanewfunctionwhichmaybepositivelyselectedfor
• Subfunctionalize• Bedifferentlyregulated(e.g.,tissue‐specific)• Becomeapseudogene• Bedeleted
• Wholegenomeduplicationallowsfortheduplication(andsubsequentdivergence)ofwholepathwaysatatime
Duplicationevents• Resolutionofpolyploidy– Non‐disjunctionofchromosomes(formmultivalentsinsteadofdivalents)• Sterility
– Duplicatedgenesdonotstarttodiverge(orgetdeleted)untildisomicinheritanceresolved
Rearrangements
• Aregenomerearrangementsthecauseorconsequenceofdiploidization?– Mostwidelyacceptedhypothesisisthatdiploidizationproceedsbystructuraldivergenceofchromosomes
– Somelociappeardisomic,otherstetrasomic• Stage1:pairingbetweensimilarchromosomesallowed
– Locinearthecentromerescandisplayresidualtetrasomy• Stage2:non‐homologouschromosomepairingresolved
Effectsofpolyploidization
Wolfe,NatureRevGenet2001
Effectsofpolyploidization
Wolfe,NatureRevGenet2001
Saccharomyces cerevisiae (baker's yeast)
Whole Genome Duplication (WGD)
WGDinS.cerevisiae:Wolfe&Shields,Nature1997
Yeast
• Saccharomycescerevisiae– Degeneratetetraploid– Polyploidyfollowedbyextensivedeletionand(70‐100)reciprocaltranslocations
– 8%ofgenesduplicatedin55blocks(plusmanymissedsmallerblocks)
– Relativeorientationofgenesinblocksconservedwithrespecttothecentromere
SeoigheandWolfe,PNAS,1998
Duplicated blocks in
yeast !
WolfeandShields,Nature1997
Estimationoftimeofpolyploidyevent
• DivergedfromKluyveromyceslactis(unduplicated)~150MYA– Comparisonofgenesequencesandgeneorderrevealsconservation• 59%ofadjacentgenepairsinK.lactisorK.marxianusarealsoadjacentinS.cerevisiae
• 16%ofKluveromycesneighborscanbeexplainedintermsofinferredancestralgeneorder
– PhylogeneticanalysesofduplicatedgeneswhereboththeKluveromycesorthologueandanoutgrouporthologuewereavailable,deducedthatthepolyploidizationeventinS.cerevisiaeoccurredaround100MYA
KellisetalNature428:617‐624(2004)
10% retention
5,000 genes
10,000 genes
5,500 genes
5,000 genes
Different subsets retained
Evidence from conserved order of a very few genes
Evidence from interleaving genes from sister segments
Each region of K.waltii matches two regions of S.cerevisiae
We don’t even need any remaining two-copy genes to infer the ancestral order
Human• 1970‐SusumuOhnoproposedthatvertebrategenomeshadoriginatedviagenomeduplication
• 2R(twoRounds[ofgenomeduplication])hypothesis– Onebefore,andoneafterdivergenceofagnathans(lamprey)fromtetrapods(~430MYAand~500MYA)
– Popularandcontroversial• Splitbetweenthemap‐basedpeopleandthetree‐basedpeople
• Duplicatedregionsareevident(covering≈44%ofthegenome),butitisdifficulttotellwhetheritisdueto(a)genomeduplication(s)orchromosomalduplications
Evidence?• Thereareregionsinthegenomewhicharequadruplicated– HSA1,6,9and19(MHCregion,10genefamilies)– HSA4,5,8and10– HSA2,7,12and17(Hoxclusters)
• Expecttosee(A,B)(C,D)tree,wherethetimeofdivergenceofAfromBandCfromDisapproximatelythesame– However,thisisnotconsistentlythecase
Possible routes to 4-gene families !
Hokampetal,JStructFuncGenomics2003
Hughes,MolBiolEvol1998
Estimates of divergence times for genes in the MHC region on chromosomes 1, 6, 9 and 19
Hughesetal,GenomeRes,2001
Divergence times of genes with
members on at least two of the
Hox cluster bearing
chromosomes (2, 7, 12, 17)
Hughesetal,GenomeRes,2001
Phylogenetic relationships of four- and three-membered gene families on Hox cluster bearing chromosomes
Conclusions• Duplicatedregionsseeninhumangenomearemostlikelyvertebratespecific
• Significantamountofduplicationoccurring~350‐650MYA– Possibletoexplainlargemarginbyalloploidy?
• Probablesupportforatleastonewholegenomeduplicationevent
• Withtheavailabilityofmoregenomes,suchasCionaintestinalisoramphioxus,itmaybeeasiertodecipherthehistoryofduplication– However,longtimespansunderconsideration,anddiploidizationrequiresextensivegenomicchanges
PLOSBiology3:e314(2005)
PanopoulouandPoustka,TIG21:559,2005
Fish-specific genome duplication event
Proposed evolution of the Hox clusters
Skrabanek,PhDthesis,1999
Landeretal‐Intl.HumanGenomeSequencingConsortiumpaperNature409:860(2001)
Find most similar vertebrate gene (here M1) to the Ciona gene. Other vertebrate genes are added to the cluster if they are more similar to M1 than M1 is to the Ciona gene.
S1
< S1
> S1
Duplicates may have arisen by speciation (lineage-splitting) or by gene duplication events specific to one or more vertebrates
Fugu-specific duplication
Human-specific duplication
Finding all homologs, and only homologs
Number of gene duplications
46.6% of ancestral chordate genes are duplicated in ≥1 lineage 34.5% with at least one duplication before Fugu-tetrapod split 23.5% with at least one duplication after Fugu-tetrapod split
No evidence of 2R hypothesis from gene family membership alone (no peak at four genes/family), nor from phylogenetics (since duplications abundant on every branch)
Hypothetical genomic region
1R
2R
decay
decay
! Paralogs generated by a gene duplication before the Fugu-tetrapod split count as matches. ! N-fold redundancy calculated by identifying all cases where ≥ 2 different duplicates are within a 100-gene window, and then counting the number of times that their paralogs occur within a 100-gene window elsewhere in the genome.
2R
4-fold redundancy most common - accounts for 25% of the genome
! 1,912 genes duplicated prior to fish-tetrapod split ! 2,953 paralogous gene pairs ! 32.4% are found in 386 detectable paralogous segments, comprising 772 individual genomic segments ! 454 are tetra-paralogons, where overlapping sets fall into 4-fold groups
! Of the genes that duplicated after the fish-tetrapod split, only 11% are found in paralogous regions (i.e., duplications after the split did not include large segments of the genome)
Could be of recent origin, or could have undergone multiple rearrangement events that destroyed the tetra-paralogons signal
Old duplications Recent duplications
Hox-bearing chromosomes: 50% of genes duplicated after the fish-tetrapod split are tandem duplicates (separated by < 10 genes), whereas only 6% of genes duplicated before the split are tandem duplicates.
Conclusions• 2Rhypothesismostlikelyscenario• Mostlikelyscenario:Tworoundsofcloselyspacedauto‐tetraploidizationevents– Someparalogouspairswithintetraparalogonsextendoverlongerregionsthanothers• Sooctaploidyunlikely,sincepairsofregionswouldhavehadtolosethesamesetsofgenes
– Phylogenetictreesarenotconsistentlynested• Soallotetraploidyortwodistantlyspacedautotetraploidyeventsunlikely
– Treetopologieswithinparalogousblocksnotalwayscongruent• Genelossandrediploidizationprocessesprobablyspannedthetwoduplicationevents
Identificationoffunctionalelements• Codingsequences
– Relativelyeasytoidentify• Manygenepredictionprogramsavailable• Generalgenestructureknown,e.g.,
– TATAboxes– Splicedonor‐acceptorsites
– ESTsandcDNAsavailabletoaidgeneprediction• Non‐codingfunctionalsequences
– Muchhardertoidentify– Nocommonstructureofregulatoryregions– TFbindingsitesareshortandubiquitous
• Comparativegenomics– Agenomicsequencethatprovidesafunctionthatisunderselectionandtendstobeconservedbetweenspeciesiscalleda“functionalsequence”(e.g.,aprotein‐codingregionoratranscriptionfactorbindingsite)
Codingregioncomparison• Discovernewgenes
– Annotategenestructure• Exon‐intronstructure
• Comparegenecontent– Findgenescommontosetsoforganisms– Findgenesuniquetoanorganism
• Wecanstudyhowgenomesvarywithinaspeciestoidentifygenescentraltoparticularprocesses– Canalsocomparesubspeciese.g.,E.coliK12andO157:H7(pathogenic)
• Discovergenefunction• Canfindmissinggenesinmetabolicpathways
Predictingstructureof‘new’genes• Manygenefindingprograms
– Lookforstartsites,terminationsites,splicesites– Analyzecodonusage,exonlength
• ComparativemethodA– Findsyntenicregions– Predictgenesusingconventionalgenefindingtechniquesinbothspecies
– Genespredictedinbothspeciesareprobable
Geneprediction
• ComparativemethodB– Findsyntenicregions– Performpatternfiltering
• Codingexonstendtobewellconserved• Conservationhigherinfirstandsecondpositionsofcodons
– Advantage:candealwithsequencingerrors
Identificationofparaloguesandorthologuesinotherspecies
• BestReciprocalHit(BRH)
• ReciprocalHitbySynteny(RHS)– Identificationofadjacentorthologues
• Domainchecking‐internalqualitycontrol
Mouse Human
http://www.nbn.ac.za/
Identificationoforthologues
Orphan Others
Matches to some other chromosome
Human
BRH
Mouse
RHS
http://www.nbn.ac.za/
Genome sequences of 3 strains of E. coli
Welchetal,PNAS2002
4,288 genes
5,063 genes
5,016 genes
Rich in genes encoding potential fimbrial adhesins, autotransporters, iron-sequestration systems, phase-switch recombinases.
Includes genes encoding fimbrial adhesins, iron uptake systems, as well as toxins and toxin-transport systems and integrases.
Conservednon‐codingelements• PhastCons–phylo‐HMM• Alignmentoffivevertebrategenomes• 3‐8%ofthehumangenomeisconservedinothervertebrates–about1.18millionelements– Distinguishfrom“ultra‐conserved”codingelements(exons)
– ManyHCEsoverlapwith3’UTRs– ManyHCEsshowpotentialtohavelocalRNAsecondarystructure• Possiblefunction–post‐transcriptionalmodifications/regulationorcodeforfunctionalRNAs
– AlsomanyHCEsareingenedeserts–possibledistalregulatoryelements?
Siepeletal,GenomeResearch2005