Upload
ruby-barnett
View
249
Download
8
Embed Size (px)
Citation preview
Comparative & Evolutionary Genomics/Transcriptomics
BioinformaticsBioinformatics
Embryonic Stem Cells (胚胎幹細胞 )
Neural Progenitor Cells (神經先驅細胞 )
Neural Cells
Human and Great ApeRice
Email: [email protected]
中央研究院 基因體研究中心
www.sinica.edu.tw/~trees
(Neurodegenerative disease)
(Neurodegenerative disease) (Neurodegenerative disease)
DNA sequence: A, C, G, T --- 4 letters RNA sequence: A, C, G, U (Uracil, (U), 尿嘧啶 ) --- 4 letters
DNA nucleotide acid ( 核苷酸 )
Phosphoric acid(磷酸 ) Deoxyribose (去氧核糖 ) Nitrogenous base ( 含氮鹽基 )
Nitrogenous base ( 含氮鹽基 )
Purines :
Pyrimidine :
Nitrogenous base ( 含氮鹽基 )
Adenine (A, 腺嘌呤 ) Guanine (G, 鳥糞嘌呤 )
Cytosine (C, 胞嘧啶 ) Thymine (T, 胸腺嘧啶 )
ACCGTGTGGCAGTGCACAGGTATTTGGCCATAGACA
5‘ 3‘
TGGCACACCGTCACGTGTCCATAAACCGGTATCTGT3‘ 5‘
The human genome includes 3x109 nucleotides.
2007/07/12 Genomics Research Center 4
10 11 20 16 17 21 22 15 11 12 18 22
512 x 512
Lena Sequence: A C G T
DNA Level(genome)
RNA Level(transcriptome)
Protein Level(Proteome)
Function
Transcription Translation
Central dogma
(A) Single Exon Gene (intronless gene)(A) Single Exon Gene (intronless gene)
ORF
Translation start Stop
Transcription start
5‘ 3‘UTR UTR
Gene
Intergenic sequence Intergenic sequence
(B) Multiple Exon Gene(B) Multiple Exon Gene
Intron 1 Intron 25‘ 3‘Exon 1 Exon 2 Exon 3
Gene
Intergenic sequence Intergenic sequence
Gene Structure
Transcription Start
Intron 1 Intron 25‘ 3‘Exon 2 Exon 9
3‘UTR5‘UTR
Promoter
Average 8.8 exons
Exon 1
770bp
Lander, E.S., et al., 2001. Initial sequencing and analysis of the human genome. Nature 409, 860– 921.
145bp3365bp300bp
--------------------------------------27kb-------------------------------------
Intron 8
------1340bp------
770bp 300bp
mRNA
Gene
Average length of human gene components:Average length of human gene components:
5‘UTR 3‘UTRORF
Genomics Research Center 10
SelectionNatural selection purifying (negative) selection neutral selection positive selection balancing selection
Artificial selection domestication
Transcription Start
Intron 1 Intron 25‘ 3‘Exon 2 Exon 9
3‘UTR5‘UTR
Promoter
Average 8.8 exons
Exon 1
770bp
Lander, E.S., et al., 2001. Initial sequencing and analysis of the human genome. Nature 409, 860– 921.
145bp3365bp300bp
--------------------------------------27kb-------------------------------------
Intron 8
------1340bp------
770bp 300bp
mRNA
Gene
Average length of human gene components:Average length of human gene components:
5‘UTR 3‘UTRORF
12
Research directions
Complexity of evolutionary genomics Analysis of conflicting arguments in evolution
Complexity of comparative genomics Genomic variations between human & non-human
primates
Complexity of transcriptome Post-transcriptional events
Genomic sequence
Cassette exon
Alternatively Spliced Variants (ASVs)
Transcript 1
Transcript 2
Transcript 3
Cassette exon
14
Part IConflicting arguments in evolution
Alternatively and constitutively spliced exons are
subject to different evolutionary forces (ASEs vs. CSEs; Molecular Biology and Evolution, 23(3), 2006; BMC Bioinformatics, 7:259, 2006; Molecular
Biology and Evolution, 24(7):1443-6, 2007; BMC Evolutionary Biology, 7(1):179, 2007.)
• Do ASEs evolve faster or slower than CSEs?
• ASEs evolve faster than CSEs at the protein level, but the
reverse is true at the RNA level.
• Ks values in ASEs neutral;
• Ks values in CSEs accelerated
• These trends hold for mammals of different molecular clocks.
Complexity of evolutionary genomics
15
Part IConflicting arguments in evolution
Gene family size conservation is a
good indicator of evolutionary rates
(singletons vs. duplicate genes;
Bioinformatics 25(11), 1419-1421, 2009; Molecular Biology and
Evolution, 27(8), 2010)
• Do singletons evolve faster or slower
than duplicate genes?
Complexity of evolutionary genomics
Percentage of essential genes: H=C=M>1 > H=C=M=1 > dup-HCM
Ka and Ka/Ks: H=C=M>1 < H=C=M=1 < dup-HCM Evolutionary rates
Expression level: H=C=M>1 > H=C=M=1 > dup-HCM
Expression breadth: H=C=M>1 > H=C=M=1 > dup-HCM
Gene compactness: H=C=M>1 < H=C=M=1 < dup-HCM
H=C=M>1 vs. H=C=M=1 vs. dup-HCM
Duplicate genes with family size conservation: H=C=M>1Singleton: H=C=M=1Duplicate genes without family size conservation: dup-HCM
17
Position-dependent correlations between DNA
methylation and the evolutionary rates of mammalian
coding exons
• Positive correlation: DNA methylation vs. the level of sequence
divergence
• Positive correlation: DNA methylation vs. gene expression level.
• Negative correlation: gene expression level vs. protein
evolutionary rates
(PNAS 109(39), 15841-15846, 2012)
Complexity of evolutionary genomics
(PNAS 109(39), 15841-15846, 2012)
19
Part IIGenomic variations between human and non-human primates
Complexity of comparative genomics
20
Human-specific insertions and deletions inferred from mammalian
genome sequences
Genome Research, 17(1), 2007;
Nucleic Acids Research, 35:W633-8, 2007;
ENCYCLOPEDIA OF LIFE SCIENCES, 2008;
Genome Biology and Evolution. Vol. 2009:415, 2009
Human-specific insertions and deletions inferred from mammalian
genome sequences (Genome Research 18(7), 2008)
Analysis of protein-coding coincident SNPs (co-SNPs) between
primates (manuscript preparing)
Complexity of comparative genomics
Great apes
Mammal
Primate
Hominidae
Gorilla
Macaque
Rodent(Mouse, Rat)
Chicken
Orangutan
310 75 25 14 8 6 (MYA)
Bonobo
Chimpanzee
Human
2007/07/12 Genomics Research Center 22
Unit: million year
黑猩猩Chimpanzee
侏儒黑猩猩(Bonobo)
大猩猩(Gorilla)
紅毛猩猩(Orangutan)
非洲
婆羅洲跟蘇門達臘
中央研究院基因體研究中心
人有 23 對染色體 : 1, 2, …, 22, X, Y黑猩猩有 24 對染色體 : 1, 2, …, 23, X, Y
人的第 2 號染色體相當於黑猩猩的第 12 和 13 號染色體人和黑猩猩的基因體大小差不多
人和黑猩猩在 DNA 上的相似度超過 98%
幾乎所有的人的基因和黑猩猩的基因皆具很高的相似度 人 vs. 老鼠 88%人 vs. 雞 60%
1. Previous studies have reported that chimpanzee is resistant to late complications of hepatitis B/C and progression from HIV infection to AIDS.
2. Furthermore, it has not found neurofibrillary tangles characteristic of human Alzheimer’s disease in chimpanzee.
Why do we analyze human-chimpanzee genetic differences?
Despite the small genetic differences, human and chimpanzee exhibit fairly different phenotypes, human-specific diseases especially.
A insertion event occurred in the human CMAH gene and inactivated the human CMAH. the event affect brain expansion during human evolution. (PNAS, 99 (18), 2002)
agttcgataattcggctaagttcg----ttcggcta
Human ( 人 )Chimpanzee ( 黑猩猩 )
人類和黑猩猩間的序列插入或缺失片段
agttcg----ttcggctaagttcggatattcggcta
Human ( 人 )Chimpanzee ( 黑猩猩 )
中央研究院基因體研究中心
Difference rate = difference length/aligned length 100%
Human-specific deletion
Human-specific insertion
人類獨有的序列插入或缺失片段
agttcgataattcggctaagttcg----ttcggctaagttcg----ttcggataagttcg----ttcggataagtgag----tgctgcta
HumanChimpanzeeMouseRatDog
agttcg----ttcggctaagttcggatattcggctaagttcggatattcggataagttcggatattcggataagtgaggatatgctgcta
HumanChimpanzeeMouseRatDog
中央研究院基因體研究中心
Using Gene Ontology (GO) analysis:
HS indels located in CDSs are enriched in genes related to 1. virus infection, including genes associated with latent virus
infection, regulation of viral life cycle, and viral infectious cycle in Biological Process
2. viral interaction complex, viral replication complex, and viral transcriptional complex in Cellular Component
Human-specific (HS) insertions and deletions inferred from mammalian genome sequences
Feng-Chi Chen, Chueng-Jong Chen, Wen-Hsiung Li*, and Trees-Juen Chuang*, Genome Research 17(1):16-22, 2007. (SCI: 13.608)
A well-known hypothesis: Conserved sequences between two species functional important potential coding regions
Human
Mouse
Comparative approaches:
Sgp2TWINSCANDoubleScanSLAMSGP-1/-2Ka/Ks ratio testPSEP……………………
Chimpanzee is our closest relative in the nature.
Alignable sequence divergence between human and chimpanzee 1.23%
Human
Chimp
Pseudogenes
Pseudogenes are genomic DNA sequences that are similar to functional genes but have lost their abilities to encode functional protein products.
They are regarded as defunct relatives of functional genes.
Two types of pseudogenes:
1. Duplicated pseudogenes
Exon 1 Exon 2 Exon 3 Original genes
Exon 1 Exon 2 Exon 3 Exon 1 Exon 2 Exon 3
Segmental duplication
Function loss (e.g., disrupt the reading frame or contain premature stop codons)
xExon 1 Exon 2 Exon 3 mRNA
Protein product (functional)
Transcription
Translation
2. Processed pseudogenes (PPGs)
Exon 1 Exon 2 Exon 3
Exon 1 Exon 2 Exon 3 mRNA
Transcription
Protein product (functional)
Translation xExon 1 Exon 2 Exon 3
genome
Retrotransposition
Lack of introns, promoter, and regulatory elements not transcribed
Key concept:Key concept:
PPGs must have been expressed at the time of pseudogenization
Exon 1 Exon 2 Exon 3
Exon 1 Exon 2 Exon 3 mRNA
Transcription
Protein product (functional)
Translation x
Exon 1 Exon 2 Exon 3
genome
Retrotransposition
Lack of introns, promoter, and regulatory elements not transcribed
expressedexpressed
Yao-Ting Huang, Feng-Chi Chen*, Chiuan-Jung Chen, Hsin-Liang Chen and Trees-Juen Chuang* (2008). Genome Research 18(7), 1163-1170. (SCI: 13.608)
35
PPG
Gene of origin CENTP-annotated transcript
Human
Exon loss
PPG
Gene of origin
Chimpanzee
Case 1: exon deletion event:
Case 2: pseudoexon event:
*stop codon
Well-annotated transcript
Speciation
CENTP exon
Ancestral state
ASV 1
ASV 2?
PPG
Gene of origin
Pseudogenization
Gene of origin(ASV 1) Pseudogenization
36
Part III
Post-transcriptional events
• Cis-splicing events
• Trans-splicing events
• RNA editing events
• …….
Providing an essential source of diversity in transcriptome and proteome
Complexity of transcriptome
37
Cis-splicing
Gene/alternatively spliced variant finding in mammals: CRASA (Genome Research 13, 313-322, 2003); PSEP
(Bioinformatics 20, 3064-3079, 2004); ESTviewer (Bioinformatics 21, 2510-2513, 2005); ENACE (BMC Bioinformatics, 7:136, 2006);
and in plants: PGAA & RiceViewer (Plant Physiology 143(3), 1086-1095,
2007)
Complexity of transcriptome
38
5’ 3’Trans-spliced RNA RNA level
Trans-splicing Trans-splicing is a small class of the post-transcriptional events in higher eukaryotes, which generates transcripts that are orderly inconsistent with their corresponding DNA templates (in a non-co-linear fashion).
Difficulties: (1) in vitro artifacts and (2) genetic rearrangements
1 2 35’ 3’4
mRNA 1
DNA level
1 2 3 4 RNA level
2 35’ 3’4
mRNA 2
1 DNA level
2 3 41 RNA level
Complexity of transcriptome
39
Trans-splicing
New findings:
(1)We have developed a pipeline (“TSscan”), which integrated different types of NGS data of different hESC lines to effectively minimize potential false positives.
(2)We found the first case of trans-spliced lincRNAs (long intergenic non-coding RNAs).
(3)We showed that trans-splicing may conspicuously affect the pluripotency maintenance of hESCs.
Complexity of transcriptome
Integrative transcriptome sequencing identifies trans-splicing events with important roles in human embryonic stem cell pluripotency. (manuscript submitted)
A big question:
Is an observed non-co-linear chimeric RNA functional, junk, or just artificial?
Most of observed (even experimentally validated) chimeric RNAs (trans-splicing or gene fusion) are still in vitro artifacts.(Manuscript preparing)
DNA Level(genome)
RNA Level(transcriptome)
Protein Level(Proteome)
Function
Transcription Translation
Central dogma
When you can’t trust the DNA RNA editing alters transcript sequences
44
RNA editing, which is mediated by ADAR (adenosine deaminase acting on RNA) enzymes, alters particular RNA loci from Adenosines (A) into Inosines (I), which in turn are recognized as Guanosines (G)
We developed a new pipeline to identify RNA editing with high accuracy across 20 species from nematodes to humans.
Evolutionary analysis showed rapid expansion of RNA editing loci along with the growth of repetitive elements in the genomes, particularly in great apes.
A consistent trend in vertebrates: brain has the highest level of editing frequency. (manuscript preparing)
RNA-editing (A-to-I RNA editing)
GTCAAGTGTA
GTCAGGTGTA
DNA level
RNA levelGTCAIGTGTA
Complexity of transcriptome
Unpublished
45
Email: [email protected]
How to join us?中央研究院國際研究生博士班學程 生物資訊學程
中央研究院國際研究生博士班學程 生物多樣性學程
中央研究院 - 台灣大學合辦 基因體與系統生物學學位學程 ( 碩士班 / 博士班 )
中央研究院 - 陽明大學基因體科學學程 ( 博士班 )
中央研究院 - 國防大學基因體科學學程 ( 博士班 )
中央研究院 基因體研究中心
全額獎助金