View
226
Download
0
Category
Tags:
Preview:
Citation preview
GENOME EVOLUTION AND GENOME EVOLUTION AND GENE DUPLICATIONS IN GENE DUPLICATIONS IN EUKARYOTESEUKARYOTES
Shin-Han ShiuShin-Han Shiu
Plant Biology / QBMIPlant Biology / QBMI
Michigan State UniversityMichigan State University
Genomes and gene contentsGenomes and gene contents
30,000 25,000
10,000
6,00045,000
17,000
Duplicate genes in the genomeDuplicate genes in the genome
Arabidopsis gene families*Arabidopsis gene families*
*: Clusters of Markov clustering using all-against-all BLAST E values as distance measures
Gene function and duplicationGene function and duplication
What’s the consequence?What’s the consequence?
Gene function and duplicationGene function and duplication
What’s the consequence?What’s the consequence?
Focus I: Duplication Mechanism and Loss Focus I: Duplication Mechanism and Loss RateRate
GeneDuplications
Mechanisms ConsequencesPreferential
retention
Duplication mechanismsDuplication mechanisms
+
Whole genome duplicationWhole genome duplication
Tandem duplicationTandem duplication
Segmental duplicationSegmental duplication
Replicative transpositionReplicative transposition
Lineage-specific gains in plants and animalsLineage-specific gains in plants and animals
OrganismOrganism Lineage-specific Lineage-specific gainsgains
Normalized Normalized gain*gain*
# of genes in # of genes in familiesfamilies
analyzedanalyzed% total% total
Rice 10115 6743 28467 35.5 (23.7)**
Arabidopsis 5984 3990 21936 27.3 (18.2)**
Human 811 811 21954 3.7
Mouse 1265 1265 24041 5.3
*: The gain counts are normlized against the ratio between the Arabidopsis-rice and human-mouse divergence time (150 and 100 Mya, respectively).
**: Numbers in parentheses refer to percentage total based on normalized gains.
Substantially more recent duplicates in plants than in animalsSubstantially more recent duplicates in plants than in animals Mostly due to frequent whole genome duplications in plantsMostly due to frequent whole genome duplications in plants
Gain vs. LossGain vs. Loss
3 rounds of whole-genome duplications in the Arabidopsis lineage3 rounds of whole-genome duplications in the Arabidopsis lineage ~82% duplicates from the last round were lost in the past 40 ~82% duplicates from the last round were lost in the past 40
million yearsmillion years
15,000*30,000
60,000
120,000
Arabidopsisgene content:
21,000**
*: Number of orthologous groups in shared families between Arabidopsis and rice.**: Number of genes in shared families.
Genome duplications + tandem duplications – gene losses =
““Age” distribution of animal duplicatesAge” distribution of animal duplicates
Steady decay in the number of duplicatesSteady decay in the number of duplicates Frequent TD, SD, and RTFrequent TD, SD, and RT
Ks: rate of nucleotide substitutions in codon sites that do not affect amino acid identity
Shiu et al., 2006
Plant duplicate “age” distributionPlant duplicate “age” distribution
Apparent peak at ~0.18 instead of zero KsApparent peak at ~0.18 instead of zero Ks Frequent Frequent WGDWGD, TD, SD (maybe), and RT (in some plants), TD, SD (maybe), and RT (in some plants)
Shiu et al., 2004
Genome remodeling in polyploidsGenome remodeling in polyploids
Natural and synthetic polyploidsNatural and synthetic polyploids
~348 Mb
~203 Mb~314 Mb
~257 Mb
20,000 yr
Experimental approachesExperimental approaches
Genome-wide polymorphism monitored by tiling arrayGenome-wide polymorphism monitored by tiling array
Genome
Tiled probes
Gap Resolution
Array
20,000 yr
~6 million features
Genome-wide Single Feature PolymorphismGenome-wide Single Feature Polymorphism
Mid-parent (MP) vs. Arabidopsis suecica (As)Mid-parent (MP) vs. Arabidopsis suecica (As)
PolyploidPolyploid SFPSFP
Natural 58,517
Synthetic 503
Genome-wide Single Feature PolymorphismGenome-wide Single Feature Polymorphism
Genome-wide polymorphism monitored by tiling arrayGenome-wide polymorphism monitored by tiling array
Gene Pseudogene Transposon
Genome-wide Single Feature PolymorphismGenome-wide Single Feature Polymorphism
Duplication or deletionDuplication or deletion
MP duplication or
As deletion
Genome Survey SequencingGenome Survey Sequencing
Sequence ~40-60Mb of the Arabidopsis suecica genome Sequence ~40-60Mb of the Arabidopsis suecica genome 0.15-0.2 X coverage, will be done next week!0.15-0.2 X coverage, will be done next week!
Ultra-high throughput sequencer (GS20) funded by the Ultra-high throughput sequencer (GS20) funded by the Strategic Partnership GrantStrategic Partnership Grant Ultra-high throughputUltra-high throughput
20-30 Mb per run, each run 5 hours20-30 Mb per run, each run 5 hours Will be 100Mb per run early 2007Will be 100Mb per run early 2007
Cost efficientCost efficient ~$0.3/kb~$0.3/kb
Read length rather limitedRead length rather limited ~100bp per read now~100bp per read now Will be ~200bp early 2007Will be ~200bp early 2007
For more information contact:For more information contact: Andreas Weber (Andreas Weber (aweber@msu.eduaweber@msu.edu)) David DeWitt (David DeWitt (dewittd@msu.edudewittd@msu.edu)) Or Shin-Han Shiu (Or Shin-Han Shiu (shius@msu.edushius@msu.edu))
Seminar on instrumentation: Seminar on instrumentation: 9/29, Friday, 1pm, 1415 BPS9/29, Friday, 1pm, 1415 BPS
Summary: Gene duplication and polyploidySummary: Gene duplication and polyploidy
Gene duplication occurred frequently in eukaryotes but most Gene duplication occurred frequently in eukaryotes but most duplicate are lost.duplicate are lost.
In plants, whole genome duplication is common. But gene lost In plants, whole genome duplication is common. But gene lost occurred frequently.occurred frequently.
After 4 generations, very small number of SFPs are identified in After 4 generations, very small number of SFPs are identified in synthetic polyploids.synthetic polyploids.
After 20,000 generations, most coding genes do not have After 20,000 generations, most coding genes do not have clustered sequence polymorphism that indicative of deletion.clustered sequence polymorphism that indicative of deletion.
Clustered polymorphisms mostly locate in pseudogenes and Clustered polymorphisms mostly locate in pseudogenes and transposons.transposons.
Survey sequencing is necessary to determine if some coding Survey sequencing is necessary to determine if some coding genes have become pseudogenes without being deleted. genes have become pseudogenes without being deleted.
Focus II: Differential Retention of Focus II: Differential Retention of DuplicatesDuplicates
GeneDuplications
Mechanisms ConsequencesPreferential
retention
Duplicate genes in the genomeDuplicate genes in the genome
Arabidopsis gene families*Arabidopsis gene families*
*: Clusters of Markov clustering using all-against-all BLAST E values as distance measures
Large gene families in plantsLarge gene families in plants
One of the largest gene familiesOne of the largest gene families
Normalized gain: % expanded OGs Normalized gain: % expanded OGs
Large family sizes do not necessarily indicates higher expansion Large family sizes do not necessarily indicates higher expansion ratesrates
Ancestral family sizes and gene gainsAncestral family sizes and gene gains
Large ancestral family tend to have more lineage specific gains Large ancestral family tend to have more lineage specific gains but with many exceptionsbut with many exceptions
Differential expansion of functional Differential expansion of functional categoriescategories
GO: GeneOntologyGO: GeneOntology
Protein ubiquitinationProtein ubiquitination Polysaccharide biosynthesisPolysaccharide biosynthesis Cell wall modificationCell wall modification Transcriptional regulationTranscriptional regulation Biotic stress responseBiotic stress response Secondary metabolismSecondary metabolism
Differences in DuplicabilityDifferences in Duplicability
CategoryCategory ArabidopsisArabidopsis HumanHuman
Defense responseDefense response
ProteolysisProteolysis
TransportTransport
Ion channel activityIon channel activity
MetabolismMetabolism
DevelopmentDevelopment
Protein kinase activityProtein kinase activity
Transcription factor activityTranscription factor activity
DuplicabilityDuplicability The propensity for the retention of a duplicate geneThe propensity for the retention of a duplicate gene Computational analysis of genome-wide trendComputational analysis of genome-wide trend
Kinase superfamily sizes among eukaryotesKinase superfamily sizes among eukaryotes
OrganismNumber of
genesKinase
superfamilyPercent
total gene
Arabidopsis thaliana 25,814 1041 4.0
Oryza sativa subsp. indica ~35,000 1607 3.6
Chlamydomonas reinhardtii ~12,200 414 3.4
Plasmodium falciparum 5,334 94 1.8
Plasmodium yoelii 7,681 70 0.9
Caenorhabditis elegans 19,484 417 2.1
Drosophila melanogaster 13,808 262 1.9
Anopheles gambiae 15,088 216 1.4
Ciona intestinalis 15,852 316 2.0
Fugu rubripes 33,609 632 1.9
Mus musculus 22,444 495 2.2
Homo sapiens 22,980 472 2.1
Saccharomyces cerevisiae 6449 113 1.8
Candida albicans 6,164 95 1.5
Neurospora crassa 10082 104 1.9
Schizosaccharomyces pombe 4945 109 2.2
Shiu & Bleecker, 2003
Kinase families in rice and Kinase families in rice and ArabidopsisArabidopsis
Gene count differences among families indicate differential Gene count differences among families indicate differential expansionexpansion
Shiu et al., 2004
Estimation of ancestral RLK family sizeEstimation of ancestral RLK family size
A. B.440 speciation points rice Arabidopsis
A. B.WAK LRR VIII, X, XII
Kinase phylogeny of Arabidopsis and rice RLKsKinase phylogeny of Arabidopsis and rice RLKs
Shiu et al., 2004
Development vs. resistance/defense RLKsDevelopment vs. resistance/defense RLKs
Shiu et al., 2004
ContradictionContradiction
Plant genes invovled in development tend to have high Plant genes invovled in development tend to have high duplicabilityduplicability
DevelopmentalRLKs
Low duplicability
Resistance/DefenseRLKs
High duplicability
Animal tyrosinekinases
Low duplicability
Transcription factors
High duplicability
Selection for expansionSelection for expansion
Depend on the level of variations of the signalsDepend on the level of variations of the signals
T
T
OR
Summary: differential retentionSummary: differential retention
Longevity and duplicability of plant genesLongevity and duplicability of plant genes
High High
High Low
Low High
Low Low
Duplicability Longevity Examples
Transcription factors
Resistance genes
Enzymes in central metabolicpathways
??
Focus III: Functional ConsequencesFocus III: Functional Consequences
GeneDuplications
Mechanisms ConsequencesPreferential
retention
Functional Consequences of DuplicationFunctional Consequences of Duplication
Functional divergence and conservationFunctional divergence and conservation Is it because of changes in cis-regulatory elements or coding sequencesIs it because of changes in cis-regulatory elements or coding sequences
How are duplicates retained, subfunctionalization or How are duplicates retained, subfunctionalization or neofunctionalizationneofunctionalization
Divergence in gene expressionDivergence in gene expression
Develop pipelines for cis-element prediction and Develop pipelines for cis-element prediction and
Clusters ofgenes with similarexpression profiles
Machine learning
Motif functionalprediction
Cis-regulatorylogic
Expression dataOver-representedsequence motifs
in 5’ regions
Experimentalvalidations
Divergence in post-translational Divergence in post-translational modificationmodification
Conservation of phosphorylation site across specesConservation of phosphorylation site across speces SACE: budding yeastSACE: budding yeast CAGL: Candida glabraCAGL: Candida glabra CAAL: Candida albicansCAAL: Candida albicans CATR: Candida tropicalisCATR: Candida tropicalis NECR: Neurospora crassaNECR: Neurospora crassa DEHA: Debaryomuces hanseniiDEHA: Debaryomuces hansenii
Detailed Functional Studies of Duplicate Detailed Functional Studies of Duplicate GenesGenes
Functional analyses of DDF1 and DDF2 transcription factorsFunctional analyses of DDF1 and DDF2 transcription factors Derived from recent whole genome duplication in ArabidopsisDerived from recent whole genome duplication in Arabidopsis Related to the well known CBF factors involved in cold and draught Related to the well known CBF factors involved in cold and draught
stressstress
DDFs
PromoterGFP
Knockouts
Over-expression
studies
Interactingproteins
Bindingtargets
DDFs
PromoterGFP
Knockouts
Over-expression
studies
Interactingproteins
Bindingtargets
Arabidopsis thaliana Arabidopsis lyrata
Focus IV: Protein spaceFocus IV: Protein space
GeneDuplications
Mechanisms ConsequencesPreferential
retentionConsequences
Preferentialretention
Tiling array analysis of transcriptomeTiling array analysis of transcriptome
Human Chr 21, 22Human Chr 21, 22
Kapranov et al., 2002
Posterior probability p(F|coding)Posterior probability p(F|coding)
Performance of the CI measurePerformance of the CI measure
Known Arabidopsis exon and intron 90-300bpKnown Arabidopsis exon and intron 90-300bp
Arabidopsis small protein that are not annotatedArabidopsis small protein that are not annotated Correctly predict 19 out of 20 (95%).Correctly predict 19 out of 20 (95%).
Yesat sORF with translation evidenceYesat sORF with translation evidence Correctly predict 98 out of 114 (86%)Correctly predict 98 out of 114 (86%)
In “intergenic” sequences of Arabidopsis genomeIn “intergenic” sequences of Arabidopsis genome 3,274 sORF identified3,274 sORF identified
Coupling with tiling array expressionCoupling with tiling array expression
Hybridization intensities for feature typesHybridization intensities for feature types
Summary: Novel coding genesSummary: Novel coding genes
Many unannotated regions in the genomes are expressed.Many unannotated regions in the genomes are expressed.
Using the CI measure, many proteins that were not annotated Using the CI measure, many proteins that were not annotated but with evidence of expression from yeast and Arabidopsis are but with evidence of expression from yeast and Arabidopsis are identified correctly.identified correctly.
Using the CI measure, we estimated that ~3000 novel coding Using the CI measure, we estimated that ~3000 novel coding regions are present in the unannotated regions of Arabidopsis regions are present in the unannotated regions of Arabidopsis thaliana genome.thaliana genome.
Using tiling array data, we found that many of these novel Using tiling array data, we found that many of these novel coding regions are expressed.coding regions are expressed.
AcknowledgementAcknowledgement
Lab membersLab members
Kousuke Hanada
Melissa Lehti-Shiu
Cheng Zou
Emily Eckenrode
University of ChicagoUniversity of Chicago Justin BorevitzJustin Borevitz Xu ZhangXu Zhang
University of WisconsinUniversity of Wisconsin Sara PattersonSara Patterson Rick VierstraRick Vierstra
University of MissouriUniversity of Missouri Scott PeckScott Peck
Michigan State UniversityMichigan State University Many…Many… Rong Jin, Comp Sci & EngRong Jin, Comp Sci & Eng Yue-Hua Cui, Stat & ProbYue-Hua Cui, Stat & Prob Startup fundStartup fund
Recent completion …Recent completion …
Genome remodeling in polyploidsGenome remodeling in polyploids
Genome duplication occur frequently in plantsGenome duplication occur frequently in plants What is the fate of duplicates?What is the fate of duplicates?
How fast do gene losses occur?How fast do gene losses occur? Is there any preference in genes retained?Is there any preference in genes retained?
AB
CD
E
A1B1
C1D1
E1
A2B2
C2D2
E2
t1 t2
A1B1
C1D1
E1
A2B2
C2D2
E2
A1B1
C1D1
E1
A2B2
C2D2
E2
Ng = 5 10 8 5
Comparing degrees of expansionComparing degrees of expansion
Combined set
Arabidopsis: ~25,000 proteins
Rice prediction:~66,000 genes
Gene/domainfamilies
Shared
unique
Pairwise distance
Putative orthologous
groups
ui = 1
GO:0001
ei = 4
All orthologous groups
Total unexpanded = Σ ui
Total expanded = Σ ei
Major questions on gene duplicationMajor questions on gene duplication
When: timing of gene duplications, e.g. N = 10When: timing of gene duplications, e.g. N = 10
Domain gains in rice and Domain gains in rice and ArabidopsisArabidopsis
Gain in one lineage does not necessarily predict gain in the otherGain in one lineage does not necessarily predict gain in the other
Identify novel small coding genesIdentify novel small coding genes
Determine base composition probabilitiesDetermine base composition probabilities
Codingsequences
Non-codingsequences
CDSparameters
NCDSparameters
# of AAA
# of all NNNPc(AAA) =
Pc(AAAT)
Pc(AAA)Pc(T|AAA) =
Calculate posterior probabilityCalculate posterior probability
c1 c2 c3
c4 c5 c6
Feature tablesFeature tables
n
)()|()()|()()|()|(
NCDSPNCDSSPCDSPCDSSPCDSPCDSSPSCDSP
Setting up the Bayes’Setting up the Bayes’
PriorsPriors
S = S = ATG ATG TTC TTC TAC TAC TTT TTT GG……
6
1
2
1)(...)()( 621 CDSPCDSPCDSP2
1)()( NCDSPCDSP
…
6
1
)()|()()|(m
mCDSPmCDSSPCDSPCDSSP
...)|()|()|()|()()|( 132111 TTCTPGTTCPTGTTPATGTPATGPCDSSP ccccc...)|()|()|()|()()|( 213222 TTCTPGTTCPTGTTPATGTPATGPCDSSP ccccc
...)|()|()|()|()()|( 654666 TTCTPGTTCPTGTTPATGTPATGPCDSSP ccccc
...)|()|()|()|()()|( TTCTPGTTCPTGTTPATGTPATGPCDSSP nnnnnn
)()|()()|()()|()|(
NCDSPNCDSSPCDSPCDSSPCDSPCDSSPSCDSP
Coding Likelihood (CL)Coding Likelihood (CL)
Sliding windows of a sequenceSliding windows of a sequence
Simulation based on NCDS (introns)Simulation based on NCDS (introns)
n
SCDSPCL n
)|(1 2 3 4 … n
Divergence in post-translational Divergence in post-translational modificationmodification
Conservation of phosphorylation site across specesConservation of phosphorylation site across speces
Recommended