Upload
saber
View
51
Download
0
Tags:
Embed Size (px)
DESCRIPTION
finding genes by comparing genomes. roderic guigó i serra imim/upf/crg, barcelona. número de genes en el cromosoma 22. initial annotation545Dunham et al., 1999 genscan+RT-PCR590Das et al., 2001 genscan+microarrays730Shoemaker et al., 2001 - PowerPoint PPT Presentation
Citation preview
finding genes by comparing genomes
roderic guigó i serraimim/upf/crg, barcelona
número de genes en el cromosoma 22
• initial annotation 545 Dunham et al., 1999
• genscan+RT-PCR 590 Das et al., 2001
• genscan+microarrays 730 Shoemaker et al., 2001
• reviewed annotation 726 chr22 team, sanger, 2001
• mouse shotgun data +20 (our data)
• geneid predictions 794
• genscan predictions 1128
número de genes en el genoma humano
• Consortium 30.000-40.000 2001
• Celera 27.000-38.000 2001
• Consortium+Celera 50.000 Hogenesch et al.
2001
• DBsearches 65.000-75.000 Wrigth et al., 2001
• HumanGenomeSciences 90.000-120.000 Haseltine,
2001
sequence conservation
andcoding function
sequence conservation and coding function
• rosseta (Batzoglou et al., 2000)
• cem (Bafna and Huson, 2000)
• sgp1 (Wiehe et al., 2000)
• twinscan (Korf et al., 2001)
• slam (Patcher et al., 2001)
• doublescan (Meyer and Durbin, 2002)
• sgp2 (Parra et al., 2003)
comparative gene prediciton
comparative gene prediction
1. THE GENE PREDICTION IS THE RESULT OF THE SEQUENCE ALIGNMENT
given two homologous genomic sequences, infer the exonic structure in each sequence maximizing the score of the alignment of the resulting amino acid sequences.
This problem is usually solved through a complex extension of the classical dynamic programming algorithm for sequence alignment.
• blayo et al., 2002• pedersen and scharl, 2002
comparative gene prediction
2. GENE PREDICTION AND SEQUENCE ALIGNMENT ARE PRODUCED SIMULTANIOUSLY
given two homologous genomic sequences, Pair hidden Markov Models for sequence alignment, and Generalized HMMs (GHMMs) for gene prediction are combined into the so-called Generalized Pair HMMs
• progen – novichkov et al., 2001 • slam – pachter et al, 2001• doublescan – meyer and durbin, 2002
comparative gene prediction
3. GENE PREDICTION IS SEPARATED FROM SEQUENCE ALIGNMENT
first, the alignment is obtained between two homologous genomic sequences using some generic sequence alignment program, such as tblastx, sim4 or glass
then, gene structures are predicted that are compatible with this alignment, meaning that predicted exons fall in the aligned regions.
• rosseta – batzoglou et al., 2000• cem – bafna and huson, 2000• sgp-1 – wiehe et al., 2001
comparative gene prediction
4. GENE PREDICTION IS (EVEN MORE) SEPARATED FROM SEQUENCE ALIGNMENT
This approach does not require the comparison of two homologous genomic sequencs. Rather, a query sequence from a target genome is compared against a collection of sequence from a second (informant, reference) genome and the results of the comparison are used to modify the scores of the exons produced by underlying ``ab initio'' gene prediction algorithms.
• twinscan – korf et al., 2001• sgp-2 – parra et al., 2003
QuerySequence
tblastxHSPs
geneidExons
HSPsProjectio
ns
SGPExons
syntenic gene prediction (sgp2)
programs based on mouse human genome sequence comparisons improve gene
predictions
sensitivity specificity
genscan 0.79 0.46
twinscan 0.80 0.62
SGP 0.79 0.66
Accuracy on human chromosome 22
how accurate are the sgp predictionsnucleotide level
how accurate are the sgp predictionsexon level
gene predicition programs predict a large number of genes
TWINSCAN
SGP
48462 total 47055
17562 novel 21942
3171
multiexoniclongno low complexity
4543
954human ts
2217orphans
1560orphans
2983human sgp
317 637 2217 1560 1931 1052
intron aligned human ts orphans orphans human sgp intron aligned
231 482 1971
away from an ensembl 1417 1706 857
predictions in the mouse genome
and a large number of novel genes ...
TWINSCAN
SGP
48462 total 47055
17562 novel 21942
3171
multiexoniclongno low complexity
4543
954human ts
2217orphans
1560orphans
2983human sgp
317 637 2217 1560 1931 1052
intron aligned human ts orphans orphans human sgp intron aligned
231 482 1971
away from an ensembl 1417 1706 857
predictions in the mouse genome
...with exons...
TWINSCAN
SGP
48462 total 47055
17562 novel 21942
10987
3171
multiexoniclongno low complexity
12158
954human ts
2217orphans
1560orphans
2983human sgp
317 637 2217 1560 1931 1052
intron aligned human ts orphans orphans human sgp intron aligned
231 482 1971
away from an ensembl 1417 1706 857
predictions in the mouse genome
that look fine proteins
TWINSCAN
SGP
48462 total 47055
17562 novel 21942
10987
3171
multiexoniclongno low complexity
12158
4543
954human ts
2217orphans
1560Orphans
2983human sgp
317 637 2217 1560 1931 1052
intron aligned human ts orphans orphans human sgp intron aligned
231 482 1971
away from an ensembl 1417 1706 857
predictions in the mouse genome
almost every mouse gene has the human orthologue counterpart
TWINSCAN
SGP
48462 total 47055
17562 novel 21942
10987
3171
multiexoniclongno low complexity
12158
4543
954human ts
2217orphans
1560orphans
2983human sgp
317 637 2217 1560 1931 1052
intron aligned human ts orphans orphans human sgp intron aligned
predictions in the mouse genome
|1b chr1_2213 MSTNICSFKDRCVSILCCKFCKQVLSSRGMKAVLLADTEIDLFSTDIPPTNAVDFTGRCY **** *:*******************************:************:*** **** chr1_1808 MSTNNCTFKDRCVSILCCKFCKQVLSSRGMKAVLLADTDIDLFSTDIPPTNTVDFIGRCY |1b
|2b |3a chr1_2213 FTKICKCKLKDIACLKCGNIVGYHVIVPCSSCLLSCNNRHFWMFHSQAVYDINRLDSTGV ** *********************************** ***********.*****:*** chr1_1808 FTGICKCKLKDIACLKCGNIVGYHVIVPCSSCLLSCNNGHFWMFHSQAVYGINRLDATGV |2b |3a
chr1_2213 NVLLRGNLPEIEESTDEDVLNISAEECIR *:** ***** **.***:.*:***** ** chr1_1808 NLLLWGNLPETEECTDEETLEISAEEYIR
orthologous human mouse genes have conserved exonic structure
orthologous human mouse genes have conserved exonic structure.
• 85% of the orhologous pairs have identical number of exons
•91% of the orthologous exons have identical length
•99.5% of the orthologous exons have identical phase
• there are a few cases of intron insertion/deletion (22)
• U12 introns appear to be strongly conserved between human and mouse
• non-canonical GC-AG are less conserved.
data on 1506 human/mouse refseq orthologues
we will target genes with conserved intron positions
|2a chr10_1592 LGSETCCNSHTSLQTSGVPDGSNNNSALIFITALQKMFTGFLLVNKSSCKLNPCWEKVQV * . ****:** ** ****** chr19_1200 ------------------------------------MRCSQEPVNKSACKSNPRWEKVQV |1a
chr10_1592 SSLYKLTDNCVNLQPLKRKEKKATLITLLSFTLHLLSSLAALRWDVNLPVNAVRKWMVQE *************************** ***:*************************** chr19_1200 SSLYKLTDNCVNLQPLKRKEKKATLITPLSFALHLLSSLAALRWDVNLPVNAVRKWMVQG
|3b chr10_1592 GQELEISISGGCLTFMGKSSSNSVITALLMAEELHHYDNFFYSCEPKSSLLFLLSRAVIE ************************************************************ chr19_1200 GQELEISISGGCLTFMGKSSSNSVITALLMAEELHHYDNFFYSCEPKSSLLFLLSRAVIE |2b
|4b chr10_1592 VCLYGV-LNSKVCQLQKVYILINTPVAWRSEGLADRWLPRKAQQASHLQHLVVGAREQAQ .**** . : :********************** ************ .** . .* chr19_1200 ACLYGENTAGPGLHSRKVYILINTPVAWRSEGLADRWLLRKAQQASHLQHLSAGATRAVQ |3c
sequence conservation andcoding function
ortholgous splice sites are more conserved than expected solely from their splicing
function
ortholgous splice sites are more conserved than expected solely from their splicing
function
prediction of splice sites
we will target genes with conserved intron positions
the final pools
TWINSCAN
SGP
48462 total 47055
17562 novel 21942
10987
3171
multiexoniclongno low complexity
12158
4543
954human ts
2217orphans
1560orphans
2983human sgp
317 637 2217 1560 1931 1052
intron aligned
human ts orphans orphans human sgp
intron aligned
predictions in the mouse genome
rtpcr: targeting conserved intron positions
|2a chr10_1592 LGSETCCNSHTSLQTSGVPDGSNNNSALIFITALQKMFTGFLLVNKSSCKLNPCWEKVQV * . ****:** ** ****** chr19_1200 ------------------------------------MRCSQEPVNKSACKSNPRWEKVQV |1a
chr10_1592 SSLYKLTDNCVNLQPLKRKEKKATLITLLSFTLHLLSSLAALRWDVNLPVNAVRKWMVQE *************************** ***:*************************** chr19_1200 SSLYKLTDNCVNLQPLKRKEKKATLITPLSFALHLLSSLAALRWDVNLPVNAVRKWMVQG
|3b chr10_1592 GQELEISISGGCLTFMGKSSSNSVITALLMAEELHHYDNFFYSCEPKSSLLFLLSRAVIE ************************************************************ chr19_1200 GQELEISISGGCLTFMGKSSSNSVITALLMAEELHHYDNFFYSCEPKSSLLFLLSRAVIE |2b
|4b chr10_1592 VCLYGV-LNSKVCQLQKVYILINTPVAWRSEGLADRWLPRKAQQASHLQHLVVGAREQAQ .**** . : :********************** ************ .** . .* chr19_1200 ACLYGENTAGPGLHSRKVYILINTPVAWRSEGLADRWLLRKAQQASHLQHLSAGATRAVQ |3c
rt-pcr on 12 normal mouse adult tissues,and direct sequencing of the amplimers
pool predictions
tested positive success rate
intron aligned
1428 214 133 62%
similar 2125 38 4 11%
orphan 3425 63 2 3%
rt-pcr on 12 normal mouse adult tissues,and direct sequencing of the amplimers
about 1000 human genes not in ensembl
• low support by ESTs: 34% match EST sequences
• low representation in other vertebrate genomes: 33% have sequence matches in fish genomes
• restricted expression patterns
restricted expression patterns
CodeB H K Y V S M L T K E O
%Id Homology
3B1 ● ●
38% Dystrophin-like; with ZZ domain
3B3 ● ● ● ● ●
25% Novel aquaporin; similar to Drosophila CG12251
3C3 ● ● ● ● ●
25% TEP1 (telomerase associated); probable ATPase
3C5 ● ●
47% Voltage-dependent calcium channel gamma subunit
4B3 ● ● ●
34% Interferon-induced / fragilis transmembrane family
4C6 ● ● ● ● ●
30% Interleukin 22-binding protein CRF2-10
4G4● ● ● ●
64% Nna1p, nuclear ATP/GTP-binding protein
5B5 ● ● ●
43% Likely aminophospholipid flippase (transporting ATPase)
1E3● ● ● ● ●
40% N-acetylated-α-linked-acidic dipeptidase (NAALADase)
6C4 ● ●
42% Not-type homeobox; poss. involved in notochord development
6F5● ● ●
66% Drosophila brain-specific homeobox protein (bsh)
11F2● ● ● ● ●
29% Human GABA-B receptor 2, neurotransmitter release regulator
5A2 ● ● ● ●
41% Skate liver organic solute transporter beta
11B6 ● ● ●
55% Interferon-activatable protein 203; nuclear protein
12B3● ● ● ● ● ● ● ●
25% Fatty acid desaturase; maintains membrane integrity
11F6● ● ● ● ● ● ●
44% Rat vanilloid receptor type 1 like protein 1
12E3 ● ●
52% Fizzy/CDC20; modulates degradation of cell-cycle proteins
12F1 ● ● ● ● ●
43% Otoferlin (mutated in DFNB9, nonsyndromic deafness)
12H1● ● ●
45% Fruitfly additional sex combs; a Polycomb group protein
12C4● ● ●
43% C. elegans C15C8.2; single-minded-like; HLH and PAS domains
12D2 ●
41% Cytosolic phospholipase A2, group IVB
12A5●
38% Fruitfly GH15686p; Ent2-like nucleoside transporter
12E5● ● ● ●
32% Relaxin 3 preproprotein; prohormone of the insulin family
11A1 ● ● ● ● ●
89% Mouse BET3, involved in ER to Golgi transport
11A2● ● ● ● ● ●
70% Vacuolar ATP synthase subunit S1
11B2 ● ● ● ● ● ●
54% Myosin light chain kinase, skeletal muscle.
11G236% Dapper / frodo (transduces Wnt signals by interacting with Dsh.
limitations:sensitivity of the procedure
twisncan
ensembl sgp2
initial predictions 48464 23026 48451
multiexonic genes 36831 17565 38979
25320 16368 16952 21184
69% 94% 97% 54%
orhtolog pairs 24743 30927
21099 15355 16757 19831
85% 87% 95% 64%
intron aligned 17271 18056
16337 13709 15112 15977
94% 78% 86% 88%
TS success
Dn
/Ds
0 1
0.15
0.20
0.25
SGP successD
n/D
s0 1
0.22
0.32
0.42
0.52
0.62
0.72
0.82
0.92
specificity of the prediction can be improved: Ka/Ks ratio
further work
• scale the procedure. Try to find rtpcr evidence for (almost) every human gene not yet confirmed
• intronless genes
• human specific gene families (if any)
• genes with non-canonical splicing
selenoproteins
Selenoproteins are proteins that incorporate the aminoacid selenocysteine, the 21st amino acid.
•Function: mostly redox enzymes
•Distribution: 3 domains of life
•Number: 22 families in mammals
selenoproteins
• UGA (STOP) is the codon for Sec
• There is a tRNAsec with the UGA anticodon
• Recoding:
1. RNA structure: the SECIS element
2. SECIS binding proteins
selenoproteins
the SECIS element.computational search for selenoproteins
dSelG
SECIS Pattern
using geneid to search for selenoproteins
1. Predict SECIS (PatScan)
1. Gene prediction with1. TGA in-frame2. SECIS
genome wide search in drosophila
SECIS predicted 35876
SECIS thermo assessment
1220
Genes predicted 12194
Predicted Selenoproteins
(4)
RealSelenoproteins
3
dSelG
dSelM
dSelG and dSelM: experimental verification
dSelM has selenoprotein homologues in vertebrates
IMIM/UPF/CRG Genís Parra, Josep F. Abril, Roderic Guigó
University of Geneva
Manolis Dermitzakis, Alexandre Reymond,Robert Lyle, Catherine Ucla, Stylianos Antonarakis
GlaxoSmithKline Pankaj Agarwal
University of Oxford
Chris Ponting
Washington University
Evan Keibler, Michael Brent
Universitat de BarcelonaUniversity of LinconHarvard University
Montserrat Corominas, Florenci Serras, Marta Morey, Sergi BertranVadim Gladishev, Gregory KruikovMarla Berry, Nadia Morozova
IMIM/UPF/CRG Sergi Castellano
COMPARATIVE GENE PREDICTION
SELENOPROTEINS