46
Comparative & Evolutionary Genomics/Transcriptomics Bioinformatics Bioinformatics Embryonic Stem Cells ( 胚胚胚胚胚 ) Neural Progenitor Cells ( 胚胚胚胚胚胚 ) Neural Cells Human and Great Ape Rice Email: [email protected] 中中中中中 中中 中中中中 www.sinica.edu.tw/~trees (Neurodegenerative disease) (Neurodegenerative disease) (Neurodegenerative disease)

Comparative & Evolutionary Genomics/Transcriptomics Bioinformatics Embryonic Stem Cells ( 胚胎幹細胞 ) Neural Progenitor Cells ( 神經先驅細胞 ) Neural Cells Human

Embed Size (px)

Citation preview

Comparative & Evolutionary Genomics/Transcriptomics

BioinformaticsBioinformatics

Embryonic Stem Cells (胚胎幹細胞 )

Neural Progenitor Cells (神經先驅細胞 )

Neural Cells

Human and Great ApeRice

Email: [email protected]

中央研究院 基因體研究中心

www.sinica.edu.tw/~trees

(Neurodegenerative disease)

(Neurodegenerative disease) (Neurodegenerative disease)

DNA sequence: A, C, G, T --- 4 letters RNA sequence: A, C, G, U (Uracil, (U), 尿嘧啶 ) --- 4 letters

DNA nucleotide acid ( 核苷酸 )

Phosphoric acid(磷酸 ) Deoxyribose (去氧核糖 ) Nitrogenous base ( 含氮鹽基 )

Nitrogenous base ( 含氮鹽基 )

Purines :

Pyrimidine :

Nitrogenous base ( 含氮鹽基 )

Adenine (A, 腺嘌呤 ) Guanine (G, 鳥糞嘌呤 )

Cytosine (C, 胞嘧啶 ) Thymine (T, 胸腺嘧啶 )

ACCGTGTGGCAGTGCACAGGTATTTGGCCATAGACA

5‘ 3‘

TGGCACACCGTCACGTGTCCATAAACCGGTATCTGT3‘ 5‘

The human genome includes 3x109 nucleotides.

2007/07/12 Genomics Research Center 4

10 11 20 16 17 21 22 15 11 12 18 22

512 x 512

Lena Sequence: A C G T

DNA Level(genome)

RNA Level(transcriptome)

Protein Level(Proteome)

Function

Transcription Translation

Central dogma

Gene A Gene B

Intergenic regions

Gene Structure

Genic regions

(A) Single Exon Gene (intronless gene)(A) Single Exon Gene (intronless gene)

ORF

Translation start Stop

Transcription start

5‘ 3‘UTR UTR

Gene

Intergenic sequence Intergenic sequence

(B) Multiple Exon Gene(B) Multiple Exon Gene

Intron 1 Intron 25‘ 3‘Exon 1 Exon 2 Exon 3

Gene

Intergenic sequence Intergenic sequence

Gene Structure

Transcription Start

Intron 1 Intron 25‘ 3‘Exon 2 Exon 9

3‘UTR5‘UTR

Promoter

Average 8.8 exons

Exon 1

770bp

Lander, E.S., et al., 2001. Initial sequencing and analysis of the human genome. Nature 409, 860– 921.

145bp3365bp300bp

--------------------------------------27kb-------------------------------------

Intron 8

------1340bp------

770bp 300bp

mRNA

Gene

Average length of human gene components:Average length of human gene components:

5‘UTR 3‘UTRORF

Genomics Research Center 9

Mutation

Selection

Genomics Research Center 10

SelectionNatural selection purifying (negative) selection neutral selection positive selection balancing selection

Artificial selection domestication

Transcription Start

Intron 1 Intron 25‘ 3‘Exon 2 Exon 9

3‘UTR5‘UTR

Promoter

Average 8.8 exons

Exon 1

770bp

Lander, E.S., et al., 2001. Initial sequencing and analysis of the human genome. Nature 409, 860– 921.

145bp3365bp300bp

--------------------------------------27kb-------------------------------------

Intron 8

------1340bp------

770bp 300bp

mRNA

Gene

Average length of human gene components:Average length of human gene components:

5‘UTR 3‘UTRORF

12

Research directions

Complexity of evolutionary genomics Analysis of conflicting arguments in evolution

Complexity of comparative genomics Genomic variations between human & non-human

primates

Complexity of transcriptome Post-transcriptional events

Genomic sequence

Cassette exon

Alternatively Spliced Variants (ASVs)

Transcript 1

Transcript 2

Transcript 3

Cassette exon

14

Part IConflicting arguments in evolution

Alternatively and constitutively spliced exons are

subject to different evolutionary forces (ASEs vs. CSEs; Molecular Biology and Evolution, 23(3), 2006; BMC Bioinformatics, 7:259, 2006; Molecular

Biology and Evolution, 24(7):1443-6, 2007; BMC Evolutionary Biology, 7(1):179, 2007.)

• Do ASEs evolve faster or slower than CSEs?

• ASEs evolve faster than CSEs at the protein level, but the

reverse is true at the RNA level.

• Ks values in ASEs neutral;

• Ks values in CSEs accelerated

• These trends hold for mammals of different molecular clocks.

Complexity of evolutionary genomics

15

Part IConflicting arguments in evolution

Gene family size conservation is a

good indicator of evolutionary rates

(singletons vs. duplicate genes;

Bioinformatics 25(11), 1419-1421, 2009; Molecular Biology and

Evolution, 27(8), 2010)

• Do singletons evolve faster or slower

than duplicate genes?

Complexity of evolutionary genomics

Percentage of essential genes: H=C=M>1 > H=C=M=1 > dup-HCM

Ka and Ka/Ks: H=C=M>1 < H=C=M=1 < dup-HCM Evolutionary rates

Expression level: H=C=M>1 > H=C=M=1 > dup-HCM

Expression breadth: H=C=M>1 > H=C=M=1 > dup-HCM

Gene compactness: H=C=M>1 < H=C=M=1 < dup-HCM

H=C=M>1 vs. H=C=M=1 vs. dup-HCM

Duplicate genes with family size conservation: H=C=M>1Singleton: H=C=M=1Duplicate genes without family size conservation: dup-HCM

17

Position-dependent correlations between DNA

methylation and the evolutionary rates of mammalian

coding exons

• Positive correlation: DNA methylation vs. the level of sequence

divergence

• Positive correlation: DNA methylation vs. gene expression level.

• Negative correlation: gene expression level vs. protein

evolutionary rates

(PNAS 109(39), 15841-15846, 2012)

Complexity of evolutionary genomics

(PNAS 109(39), 15841-15846, 2012)

18(PNAS 109(39), 15841-15846, 2012)

Complexity of evolutionary genomics

19

Part IIGenomic variations between human and non-human primates

Complexity of comparative genomics

20

Human-specific insertions and deletions inferred from mammalian

genome sequences

Genome Research, 17(1), 2007;

Nucleic Acids Research, 35:W633-8, 2007;

ENCYCLOPEDIA OF LIFE SCIENCES, 2008;

Genome Biology and Evolution. Vol. 2009:415, 2009

Human-specific insertions and deletions inferred from mammalian

genome sequences (Genome Research 18(7), 2008)

Analysis of protein-coding coincident SNPs (co-SNPs) between

primates (manuscript preparing)

Complexity of comparative genomics

Great apes

Mammal

Primate

Hominidae

Gorilla

Macaque

Rodent(Mouse, Rat)

Chicken

Orangutan

310 75 25 14 8 6 (MYA)

Bonobo

Chimpanzee

Human

2007/07/12 Genomics Research Center 22

Unit: million year

黑猩猩Chimpanzee

侏儒黑猩猩(Bonobo)

大猩猩(Gorilla)

紅毛猩猩(Orangutan)

非洲

婆羅洲跟蘇門達臘

中央研究院基因體研究中心

人有 23 對染色體 : 1, 2, …, 22, X, Y黑猩猩有 24 對染色體 : 1, 2, …, 23, X, Y

人的第 2 號染色體相當於黑猩猩的第 12 和 13 號染色體人和黑猩猩的基因體大小差不多

人和黑猩猩在 DNA 上的相似度超過 98%

幾乎所有的人的基因和黑猩猩的基因皆具很高的相似度 人 vs. 老鼠 88%人 vs. 雞 60%

1. Previous studies have reported that chimpanzee is resistant to late complications of hepatitis B/C and progression from HIV infection to AIDS.

2. Furthermore, it has not found neurofibrillary tangles characteristic of human Alzheimer’s disease in chimpanzee.

Why do we analyze human-chimpanzee genetic differences?

Despite the small genetic differences, human and chimpanzee exhibit fairly different phenotypes, human-specific diseases especially.

A insertion event occurred in the human CMAH gene and inactivated the human CMAH. the event affect brain expansion during human evolution. (PNAS, 99 (18), 2002)

agttcgataattcggctaagttcg----ttcggcta

Human ( 人 )Chimpanzee ( 黑猩猩 )

人類和黑猩猩間的序列插入或缺失片段

agttcg----ttcggctaagttcggatattcggcta

Human ( 人 )Chimpanzee ( 黑猩猩 )

中央研究院基因體研究中心

Difference rate = difference length/aligned length 100%

Human-specific deletion

Human-specific insertion

人類獨有的序列插入或缺失片段

agttcgataattcggctaagttcg----ttcggctaagttcg----ttcggataagttcg----ttcggataagtgag----tgctgcta

HumanChimpanzeeMouseRatDog

agttcg----ttcggctaagttcggatattcggctaagttcggatattcggataagttcggatattcggataagtgaggatatgctgcta

HumanChimpanzeeMouseRatDog

中央研究院基因體研究中心

Using Gene Ontology (GO) analysis:

HS indels located in CDSs are enriched in genes related to 1. virus infection, including genes associated with latent virus

infection, regulation of viral life cycle, and viral infectious cycle in Biological Process

2. viral interaction complex, viral replication complex, and viral transcriptional complex in Cellular Component

Human-specific (HS) insertions and deletions inferred from mammalian genome sequences

Feng-Chi Chen, Chueng-Jong Chen, Wen-Hsiung Li*, and Trees-Juen Chuang*, Genome Research 17(1):16-22, 2007. (SCI: 13.608)

A well-known hypothesis: Conserved sequences between two species functional important potential coding regions

Human

Mouse

Comparative approaches:

Sgp2TWINSCANDoubleScanSLAMSGP-1/-2Ka/Ks ratio testPSEP……………………

Chimpanzee is our closest relative in the nature.

Alignable sequence divergence between human and chimpanzee 1.23%

Human

Chimp

Pseudogenes

Pseudogenes are genomic DNA sequences that are similar to functional genes but have lost their abilities to encode functional protein products.

They are regarded as defunct relatives of functional genes.

Two types of pseudogenes:

1. Duplicated pseudogenes

Exon 1 Exon 2 Exon 3 Original genes

Exon 1 Exon 2 Exon 3 Exon 1 Exon 2 Exon 3

Segmental duplication

Function loss (e.g., disrupt the reading frame or contain premature stop codons)

xExon 1 Exon 2 Exon 3 mRNA

Protein product (functional)

Transcription

Translation

2. Processed pseudogenes (PPGs)

Exon 1 Exon 2 Exon 3

Exon 1 Exon 2 Exon 3 mRNA

Transcription

Protein product (functional)

Translation xExon 1 Exon 2 Exon 3

genome

Retrotransposition

Lack of introns, promoter, and regulatory elements not transcribed

Key concept:Key concept:

PPGs must have been expressed at the time of pseudogenization

Exon 1 Exon 2 Exon 3

Exon 1 Exon 2 Exon 3 mRNA

Transcription

Protein product (functional)

Translation x

Exon 1 Exon 2 Exon 3

genome

Retrotransposition

Lack of introns, promoter, and regulatory elements not transcribed

expressedexpressed

Yao-Ting Huang, Feng-Chi Chen*, Chiuan-Jung Chen, Hsin-Liang Chen and Trees-Juen Chuang* (2008). Genome Research 18(7), 1163-1170. (SCI: 13.608)

35

PPG

Gene of origin CENTP-annotated transcript

Human

Exon loss

PPG

Gene of origin

Chimpanzee

Case 1: exon deletion event:

Case 2: pseudoexon event:

*stop codon

Well-annotated transcript

Speciation

CENTP exon

Ancestral state

ASV 1

ASV 2?

PPG

Gene of origin

Pseudogenization

Gene of origin(ASV 1) Pseudogenization

36

Part III

Post-transcriptional events

• Cis-splicing events

• Trans-splicing events

• RNA editing events

• …….

Providing an essential source of diversity in transcriptome and proteome

Complexity of transcriptome

37

Cis-splicing

Gene/alternatively spliced variant finding in mammals: CRASA (Genome Research 13, 313-322, 2003); PSEP

(Bioinformatics 20, 3064-3079, 2004); ESTviewer (Bioinformatics 21, 2510-2513, 2005); ENACE (BMC Bioinformatics, 7:136, 2006);

and in plants: PGAA & RiceViewer (Plant Physiology 143(3), 1086-1095,

2007)

Complexity of transcriptome

38

5’ 3’Trans-spliced RNA RNA level

Trans-splicing Trans-splicing is a small class of the post-transcriptional events in higher eukaryotes, which generates transcripts that are orderly inconsistent with their corresponding DNA templates (in a non-co-linear fashion).

Difficulties: (1) in vitro artifacts and (2) genetic rearrangements

1 2 35’ 3’4

mRNA 1

DNA level

1 2 3 4 RNA level

2 35’ 3’4

mRNA 2

1 DNA level

2 3 41 RNA level

Complexity of transcriptome

39

Trans-splicing

New findings:

(1)We have developed a pipeline (“TSscan”), which integrated different types of NGS data of different hESC lines to effectively minimize potential false positives.

(2)We found the first case of trans-spliced lincRNAs (long intergenic non-coding RNAs).

(3)We showed that trans-splicing may conspicuously affect the pluripotency maintenance of hESCs.

Complexity of transcriptome

Integrative transcriptome sequencing identifies trans-splicing events with important roles in human embryonic stem cell pluripotency. (manuscript submitted)

A big question:

Is an observed non-co-linear chimeric RNA functional, junk, or just artificial?

Most of observed (even experimentally validated) chimeric RNAs (trans-splicing or gene fusion) are still in vitro artifacts.(Manuscript preparing)

DNA Level(genome)

RNA Level(transcriptome)

Protein Level(Proteome)

Function

Transcription Translation

Central dogma

When you can’t trust the DNA RNA editing alters transcript sequences

44

RNA editing, which is mediated by ADAR (adenosine deaminase acting on RNA) enzymes, alters particular RNA loci from Adenosines (A) into Inosines (I), which in turn are recognized as Guanosines (G)

We developed a new pipeline to identify RNA editing with high accuracy across 20 species from nematodes to humans.

Evolutionary analysis showed rapid expansion of RNA editing loci along with the growth of repetitive elements in the genomes, particularly in great apes.

A consistent trend in vertebrates: brain has the highest level of editing frequency. (manuscript preparing)

RNA-editing (A-to-I RNA editing)

GTCAAGTGTA

GTCAGGTGTA

DNA level

RNA levelGTCAIGTGTA

Complexity of transcriptome

Unpublished

45

Email: [email protected]

How to join us?中央研究院國際研究生博士班學程 生物資訊學程

中央研究院國際研究生博士班學程 生物多樣性學程

中央研究院 - 台灣大學合辦 基因體與系統生物學學位學程 ( 碩士班 / 博士班 )

中央研究院 - 陽明大學基因體科學學程 ( 博士班 )

中央研究院 - 國防大學基因體科學學程 ( 博士班 )

中央研究院 基因體研究中心

全額獎助金