Upload
milo-holmes
View
221
Download
3
Tags:
Embed Size (px)
Citation preview
Todd D. Taylor, Ph.D.Genome Annotation and Comparative Analysis TeamComputational and Experimental Systems Biology GroupRIKEN Genomic Sciences [email protected]
Bioinformatics and Comparative Genome Analysis CourseInstitut Pasteur Tunis - TunisiaApril 2, 2007
Human Chromosome 21 (Nature, May 2000)
17 of 33.5 Mb Chromosome 18p (Nature, September 2005)
16 Mb Chromosome 11q (Nature, March 2006)
81 Mb ~4-5 % contribution to the Human Genome
Project Chimpanzee
Chromosome 22q (Nature, May 2004) 33.5 Mb (syntenic to human chr21)
Chromosome Y (Nature Genetics, January 2006) Development of novel methods for gene and promoter
prediction Identifying genes missed by other high-throughput
methods Identification of unique regulatory mechanisms
Looking for similarities Compare with distant species, like mouse Regions that are conserved may be important
Looking for differences Compare with close species, like primates Regions that are different may be important
Of course, there are exceptions to every rule!
Homo
Pan
Gorilla
Gibbons
Old world monkeys
New world monkeys
Prosimians
Hom
inid
ae
Cat
arrh
ini
Hom
inoi
dea
Ant
hrop
oide
a
Pri
mat
es
Eut
heri
a (p
lace
ntal
ia)
Mam
mal
ia
Lagomorpha
Rodents
Sauropsida
Am
niot
a (a
mni
otes
)
Pongo
5 MYa
Hom
inoi
dea
Pri
mat
es
Mam
mal
ia
Reptilia + Aves
~35
0MY
a
~2
50M
Ya
Metatheria
Prototheria
Hom
inid
ae
HeterodontyMammary glandsHomoeothermicHairPlacentation (in most), amnion, internal fertilizationSweat and sebaceous glandsAnucleate red blood cells
34% maps to identical sequence in human genome
Hiram Clawson and Kate Rosenbloom (UCSC). 09 June 2006
95% maps to identical sequence in human genome
Hiram Clawson and Kate Rosenbloom (UCSC). 09 June 2006
Nobrega, et al. Science 302, 413 (2003)
Size Intelligence Language Ageing Disease susceptibility
Cancer Schizophrenia Autism Triplet expansion
diseases AIDS Hepatitis
Newton,2002 年4月号
Science 295, 131-134 (2002)
1.23% substitution
Number of simple repetitive sequences
Insertion of Alu and L1 elements Unique sequences Local duplications Translocations Inversions Fewer CpG Islands predicted in chimp
Compare with small ‘representative’ human chromosome (21)
Clone-based sequencing strategy Map chimp BAC-end sequences to human chr. 21 Screen libraries for additional clones to fill gap
regions 3 gaps, over 99% coverage
Human Chr21 q-arm
Chi
mp
Chr
22 q
-arm
100%
85%
5Mb
Iden
tity
Human Chr21 q-arm
Chi
mp
Chr
22 q
-arm
100%
85%
1Mb
Iden
tity
HSA21q
0.00
0.05
0.10
0.15
0.20
0.25
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32
Bas
e ch
ange
s or
inse
rtio
n si
ze p
erbp
0.0000
0.0010
0.0020
0.0030
0.0040
0.0050
Inse
rtio
n fr
eque
ncy
per
bp
PTR22q
0.00
0.05
0.10
0.15
0.20
0.25
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32Position (Mb)
Bas
e ch
ange
s or
inse
rtio
nsi
ze p
er b
p
0.0000
0.0010
0.0020
0.0030
0.0040
0.0050
Inse
rtio
n fr
eque
ncy
per
bpBase change
Insertion size (bp)
Insertion frequency
Chimpanzee Sequencing & Analysis Consortium. Nature (205) 437:69-87
Overall : 1.44%
SINE/Alu 1.81% LINE/L1 1.38% CpG islands 2.26% Simple repeats 4.06%
Base change
Insertion frequency
Base change 1.000 -
Insertion frequency 0.907 1.000
Insertion size 0.051 0.013
Size (bp)# of Sequence gapsEstimated total size ofclone gaps# of clone gaps
Base content G+C%CG dinucleotideCpG islandsRepeats bp # ID# bp # ID#
SINEs 3,647,427 15,574 15,131 3,614,185 15,481 9,551
Young Alus *1 21,798 75 75 3,122 12 12
LINEs 5,848,427 13,758 8,731 5,737,082 13,671 6,223
Young L1s *2 92,171 59 52 78,653 64 53
LTRs 3,612,930 9,975 7,269 3,551,044 9,838 5,324
DNA elements 949,215 4,169 3,363 943,348 4,187 2,887
RNAs 8,625 98 97 8,672 99 98
Satellite 17,246 23 20 14,773 20 17
Others 30,452 41 38 34,852 49 42
Total 14,114,322 43,638 34,649 13,903,956 43,345 24,142
42.6% 42.4%
*1 AluYa5, AluYa8, AluYb8 and AluYb9
*2 L1HS and L1PA2
241.01%
PTR22q32,799,845
2274,311
358,450885
HSA21q33,102,702
1473,108
340.94%361,259
950
Family Subfamily HS21 PTR22
LINE/L1 L1HS 11 2
LTR/ERV1 HERVIP10FH 14 5
MER41A-int 10 2
MER4A1-int 5 0
MER83B-int 11 0
MER87 32 12
SINE/Alu AluYa5 23 3
AluYb8 37 2
AluYb9 7 1
DNA/MER2 Tigger3 42 67
LTR/ERV1 LTR49-int 11 23
LTR/MaLR MLT1E-int 0 5
Human-specific characteristics have been acquired during the 5 million years since the divergence between Pan and Homo.
Phylogeny of Hominidae
Time
Gorilla
Pan(Chimpanzee)
Homo (Human)
Pongo(Orangutan)
5 〜6 MYa
Human(?)Chimpanzee
Gorilla
Orangutan
Homo ACGTGTTTGAAATATTACTGATTGTAAPan ACGAGTTTGAAATATTATTGATTGTAAGorilla ACGTGTTTGAATCATTATTGATTGTAAOrangutan ACGTGTTTAAATTATTATTGGTTGCAALCA ACGTGTTTGAAATATTATTGATTGTAA
Gorilla
Pan(Chimpanzee)
Homo
Pongo(Orangutan)
Time
LCA
Outgroup
(LCA: The Last Common Ancestor)
HSA21q
0.00
0.05
0.10
0.15
0.20
0.25
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32
Bas
e ch
ange
s or
inse
rtio
n si
ze p
erbp
0.0000
0.0010
0.0020
0.0030
0.0040
0.0050
Inse
rtio
n fr
eque
ncy
per
bp
Human
Chimpanzee
Gorilla
Orangutan
IN/DEL examination based on 10,292,002 finished sequences RIKEN
totalPCR primers designable
good amplification for both*
insertion to the human sequence
267 158 139
insertion to the chimp sequence
222 147 128
489 305 267
* positive amplification found for both chimp and human template DNA
106
1 2 3 4 1 2 3 4 1 1 2
Example 1 Deletion in Human Lineage
Example 2 Insertion in Human Lineage
1 2 3 4 1 2 3 4 1 1 2
Pt Hs Gg Pp
117
129
Example 3 Deletion in Chimp Lineage
1 2 3 4 1 2 3 4 1 1 2
Pt Hs Gg Pp
Pt Hs Gg Pp
1900
9802900
4200
1300
4200
106
1 2 3 4 1 2 3 4 1 1 2
154
Example 4Allelic Deletion in Chimp Lineage
1 2 3 4 1 2 3 4 1 1 2 Pt Hs Gg Pp
1200
2400
284 genes 223 known 19 novel CDS 25 novel transcripts 12 putative 5 predicted
85 pseudogenes
We lacked information for 6 genes located in sequencing gaps
6 hsa21 genes are absent from the ptr22 sequence (H2BFS, 5 KAP genes from the 21q22.1 cluster)
4 hsa21 genes appear to be pseudogenes in chimp
3 ptr22 pseudogenes are absent from the hsa21 sequence
1 hsa21 pseudogene has a complete ORF in ptr22
83% of genes have at least one amino acid replacement
10% of the potential ptr22 proteins are predicted to have a different length Amino acid insertion or deletion Different start codon Different stop codon Other, more complex rearrangement
Shorter in chimp: ADAMTS5
Longer in chimp: C21orf30
•17 bp deletion in chimpanzee•Human and chimpanzee splice sites are different•Splice-site diversity
C21orf71
C21orf9
TCP10L
C21orf96
FLJ32835
The human chr21 genes ordered according to their chromosomal position
Se
qu
en
ce
id
en
tity
Human-specific replacements
1. KIAA0184 2. COL6A2 3. HUNK 4. AGPAT3 5. DSCR3 6. PWP2H 7. STCH 8. SLC5A3 9. CHAF1B 10.SIM2 11.KCNE2 12.APP 13.C21orf98 14.C21orf61 15.IFNAR1 16.UBASH3A 17.TMPRSS3 18.DSCR1 19.C21orf7 20.ADARB1 21.TSGA2 22.IFNAR2 23.C21orf63 24.KCNE1 25.C21orf2 26.C21orf55 27.ATP5A 28.CLDN8 29.C21orf5630.DNMTA1
Chimp-specific replacements
1. BACE2 2. TIAM1 3. BACH1 4. FAM3B 5. C21orf33 6. ADAMTS1 7. C21orf103 8. ITGB2 9. HLCS 10.DNMT3L 11.IFNGR2 12.PPIA3L 13.C21orf59 14.MRPL39 15.CLDN17 16.KRTAP11-117.CCT818.DSCR219.TFF220.BTG321.HSF2BP22.C21orf115
0.0
0.5
1.0
1.5
2.0
2.5
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34
0%
5%
10%
15%
20%
25%
30%
35%
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 >1.1
Ka / Ks
Gen
e F
requ
ency
Chimpanzee Sequencing & Analysis Consortium. Nature (205) 437:69-87
Correralate phenotype with genotype
Using Affymetrix arrays it could be shown thatthe amount of transcript/gene varies in a species-specific manner (Enard et al. 2001).
-> What DNA sequence differences are responsible for the observed differences in transcript-levels?
Transcription start site
(TSS)
Promoter
Enhancer
3‘UTR5‘UTR
•Transcriptional control
• RNA stability
ANNOTATED GENES
DETECTED GENES
UPREGULATED (IN HUMAN)
DOWNREGULATED (IN HUMAN)
237 genes annotated for chromosome 21
189 represented on the affymetrix A-E arrays
brain liverIFNAR2IFNGR2ETS2ITSNC21orf97DSCR1LSSTTC3CXADR
higher in chimphigher in human
189 annotated genes represented on the Affymetrix A-E arrays (Hellmann, Pääbo)
Identifying cis-regulatory elements in the human genome is a major challenge of the post-genomic era Promoters and enhancers that regulate gene expression in
normal and diseased cells and tissues Inter-species sequence comparisons have emerged as a
major technique for identifying human regulatory elements Particularly those to the sequenced mouse, chicken and fish
genomes A significant fraction of empirically defined human
regulatory modules Too weakly conserved in other mammalian genomes, such as
the mouse, to distinguish them from nonfunctional DNA Completely undetectable in nonmammalian genomes
Identification of such significantly divergent functional sequences will require complementary methods in order to complete the functional annotation of the human genome Deep intra-primate sequence comparison is a novel
alternative to the commonly used distant species comparisons
Non-coding sequences with primate-specific conservation include three
regulatory elements
Nature (2003) 424:788-793
• Transcript A-B combines at least one exon (complete or partial overlap) from both Gene A & Gene B– Usually only supported by a few mRNA/EST sequences, and
rarely by a CCDS• Currently, about 32 known cases found by searching NCBI
Entrez (including 8 from chr 11 recently submitted by our group)
• Transcript A-B combines at least one exon (complete or partial overlap) from both Gene A & Gene B– Usually only supported by a few mRNA/EST sequences, and
rarely by a CCDS• Currently, about 32 known cases found by searching NCBI
Entrez (including 8 from chr 11 recently submitted by our group)
Child gene AChild gene A
Child gene BChild gene B
Conjoined Gene A – BConjoined Gene A – B
Fused transcript formed by combining the exons of two or more distinct genes (child genes)
Fused transcript formed by combining the exons of two or more distinct genes (child genes)
ExonExon IntronIntron
Chr1 SRP9 – EPHX1 fusion (1 EST evidence-DA417873)Chr1 SRP9 – EPHX1 fusion (1 EST evidence-DA417873)Alternate splicing and novel exons observed in fused mRNAAlternate splicing and novel exons observed in fused mRNA
Number of mRNAs examinedNumber of mRNAs examined 456 (326 conjoined genes)456 (326 conjoined genes)
At least one exon* from both child genes At least one exon* from both child genes conserved in conserved in NumberNumber
Chimpanzee mRNAsChimpanzee mRNAs 125 (69 conjoined genes)125 (69 conjoined genes)
Mouse mRNAsMouse mRNAs 30 (15 conjoined genes)30 (15 conjoined genes)
Both Chimpanzee and Mouse mRNAsBoth Chimpanzee and Mouse mRNAs 25 (11 conjoined genes)25 (11 conjoined genes)
27%27% Conjoined Conjoined genes genes conserved conserved in in ChimpanzeeChimpanzee
6.5%6.5% Conjoined Conjoined genes genes conserved conserved in Mousein Mouse
* Exons considered were part of conjoined gene mRNAs* Exons considered were part of conjoined gene mRNAs
• RIKEN• Yoshiyuki Sakaki• Tulika P. Srivastava• Vineet K. Sharma• Asao Fujiyama• Masahira Hattori• Atsushi Toyoda• Yoko Kuroki• Yasushi Totoki• Hideki Noguchi• Hidemi Watanabe• Takehiko Itoh (MRI)
• Chimpanzee Chr 22 Sequencing Consortium• Chinese National Human Genome
Center at Shanghai, China• KRIBB Genome Research Center,
Daejeon, Korea• National Yang Ming University Genome
Research Center, Taipei, Taiwan• National Institute of Genetics, Mishima,
Japan• RIKEN Genomic Sciences Center,
Yokohama, Japan• GBF, Dept. of Genome Analysis,
Braunschweig, Germany• Institute for Molecular Biotechnology,
Jena, Germany• Max-Planck Institute for Molecular
Genetics, Berlin, Germany