50
finding genes by comparing genomes roderic guigó i serra imim/upf/crg, barcelona

finding genes by comparing genomes

  • Upload
    saber

  • View
    51

  • Download
    0

Embed Size (px)

DESCRIPTION

finding genes by comparing genomes. roderic guigó i serra imim/upf/crg, barcelona. número de genes en el cromosoma 22. initial annotation545Dunham et al., 1999 genscan+RT-PCR590Das et al., 2001 genscan+microarrays730Shoemaker et al., 2001 - PowerPoint PPT Presentation

Citation preview

Page 1: finding genes by comparing genomes

finding genes by comparing genomes

roderic guigó i serraimim/upf/crg, barcelona

Page 2: finding genes by comparing genomes

número de genes en el cromosoma 22

• initial annotation 545 Dunham et al., 1999

• genscan+RT-PCR 590 Das et al., 2001

• genscan+microarrays 730 Shoemaker et al., 2001

• reviewed annotation 726 chr22 team, sanger, 2001

• mouse shotgun data +20 (our data)

• geneid predictions 794

• genscan predictions 1128

Page 3: finding genes by comparing genomes

número de genes en el genoma humano

• Consortium 30.000-40.000 2001

• Celera 27.000-38.000 2001

• Consortium+Celera 50.000 Hogenesch et al.

2001

• DBsearches 65.000-75.000 Wrigth et al., 2001

• HumanGenomeSciences 90.000-120.000 Haseltine,

2001

Page 4: finding genes by comparing genomes

sequence conservation

andcoding function

Page 5: finding genes by comparing genomes

sequence conservation and coding function

Page 6: finding genes by comparing genomes

• rosseta (Batzoglou et al., 2000)

• cem (Bafna and Huson, 2000)

• sgp1 (Wiehe et al., 2000)

• twinscan (Korf et al., 2001)

• slam (Patcher et al., 2001)

• doublescan (Meyer and Durbin, 2002)

• sgp2 (Parra et al., 2003)

comparative gene prediciton

Page 7: finding genes by comparing genomes

comparative gene prediction

1. THE GENE PREDICTION IS THE RESULT OF THE SEQUENCE ALIGNMENT

given two homologous genomic sequences, infer the exonic structure in each sequence maximizing the score of the alignment of the resulting amino acid sequences.

This problem is usually solved through a complex extension of the classical dynamic programming algorithm for sequence alignment.

• blayo et al., 2002• pedersen and scharl, 2002

Page 8: finding genes by comparing genomes

comparative gene prediction

2. GENE PREDICTION AND SEQUENCE ALIGNMENT ARE PRODUCED SIMULTANIOUSLY

given two homologous genomic sequences, Pair hidden Markov Models for sequence alignment, and Generalized HMMs (GHMMs) for gene prediction are combined into the so-called Generalized Pair HMMs

• progen – novichkov et al., 2001 • slam – pachter et al, 2001• doublescan – meyer and durbin, 2002

Page 9: finding genes by comparing genomes

comparative gene prediction

3. GENE PREDICTION IS SEPARATED FROM SEQUENCE ALIGNMENT

first, the alignment is obtained between two homologous genomic sequences using some generic sequence alignment program, such as tblastx, sim4 or glass

then, gene structures are predicted that are compatible with this alignment, meaning that predicted exons fall in the aligned regions.

• rosseta – batzoglou et al., 2000• cem – bafna and huson, 2000• sgp-1 – wiehe et al., 2001

Page 10: finding genes by comparing genomes

comparative gene prediction

4. GENE PREDICTION IS (EVEN MORE) SEPARATED FROM SEQUENCE ALIGNMENT

This approach does not require the comparison of two homologous genomic sequencs. Rather, a query sequence from a target genome is compared against a collection of sequence from a second (informant, reference) genome and the results of the comparison are used to modify the scores of the exons produced by underlying ``ab initio'' gene prediction algorithms.

• twinscan – korf et al., 2001• sgp-2 – parra et al., 2003

Page 11: finding genes by comparing genomes

QuerySequence

tblastxHSPs

geneidExons

HSPsProjectio

ns

SGPExons

syntenic gene prediction (sgp2)

Page 12: finding genes by comparing genomes

programs based on mouse human genome sequence comparisons improve gene

predictions

sensitivity specificity

genscan 0.79 0.46

twinscan 0.80 0.62

SGP 0.79 0.66

Accuracy on human chromosome 22

Page 13: finding genes by comparing genomes

how accurate are the sgp predictionsnucleotide level

Page 14: finding genes by comparing genomes

how accurate are the sgp predictionsexon level

Page 15: finding genes by comparing genomes

gene predicition programs predict a large number of genes

TWINSCAN

SGP

48462 total 47055

17562 novel 21942

3171

multiexoniclongno low complexity

4543

954human ts

2217orphans

1560orphans

2983human sgp

317 637 2217 1560 1931 1052

intron aligned human ts orphans orphans human sgp intron aligned

231 482 1971

away from an ensembl 1417 1706 857

predictions in the mouse genome

Page 16: finding genes by comparing genomes

and a large number of novel genes ...

TWINSCAN

SGP

48462 total 47055

17562 novel 21942

3171

multiexoniclongno low complexity

4543

954human ts

2217orphans

1560orphans

2983human sgp

317 637 2217 1560 1931 1052

intron aligned human ts orphans orphans human sgp intron aligned

231 482 1971

away from an ensembl 1417 1706 857

predictions in the mouse genome

Page 17: finding genes by comparing genomes

...with exons...

TWINSCAN

SGP

48462 total 47055

17562 novel 21942

10987

3171

multiexoniclongno low complexity

12158

954human ts

2217orphans

1560orphans

2983human sgp

317 637 2217 1560 1931 1052

intron aligned human ts orphans orphans human sgp intron aligned

231 482 1971

away from an ensembl 1417 1706 857

predictions in the mouse genome

Page 18: finding genes by comparing genomes

that look fine proteins

TWINSCAN

SGP

48462 total 47055

17562 novel 21942

10987

3171

multiexoniclongno low complexity

12158

4543

954human ts

2217orphans

1560Orphans

2983human sgp

317 637 2217 1560 1931 1052

intron aligned human ts orphans orphans human sgp intron aligned

231 482 1971

away from an ensembl 1417 1706 857

predictions in the mouse genome

Page 19: finding genes by comparing genomes

almost every mouse gene has the human orthologue counterpart

TWINSCAN

SGP

48462 total 47055

17562 novel 21942

10987

3171

multiexoniclongno low complexity

12158

4543

954human ts

2217orphans

1560orphans

2983human sgp

317 637 2217 1560 1931 1052

intron aligned human ts orphans orphans human sgp intron aligned

predictions in the mouse genome

Page 20: finding genes by comparing genomes

|1b chr1_2213 MSTNICSFKDRCVSILCCKFCKQVLSSRGMKAVLLADTEIDLFSTDIPPTNAVDFTGRCY **** *:*******************************:************:*** **** chr1_1808 MSTNNCTFKDRCVSILCCKFCKQVLSSRGMKAVLLADTDIDLFSTDIPPTNTVDFIGRCY |1b

|2b |3a chr1_2213 FTKICKCKLKDIACLKCGNIVGYHVIVPCSSCLLSCNNRHFWMFHSQAVYDINRLDSTGV ** *********************************** ***********.*****:*** chr1_1808 FTGICKCKLKDIACLKCGNIVGYHVIVPCSSCLLSCNNGHFWMFHSQAVYGINRLDATGV |2b |3a

chr1_2213 NVLLRGNLPEIEESTDEDVLNISAEECIR *:** ***** **.***:.*:***** ** chr1_1808 NLLLWGNLPETEECTDEETLEISAEEYIR

orthologous human mouse genes have conserved exonic structure

Page 21: finding genes by comparing genomes

orthologous human mouse genes have conserved exonic structure.

• 85% of the orhologous pairs have identical number of exons

•91% of the orthologous exons have identical length

•99.5% of the orthologous exons have identical phase

• there are a few cases of intron insertion/deletion (22)

• U12 introns appear to be strongly conserved between human and mouse

• non-canonical GC-AG are less conserved.

data on 1506 human/mouse refseq orthologues

Page 22: finding genes by comparing genomes

we will target genes with conserved intron positions

|2a chr10_1592 LGSETCCNSHTSLQTSGVPDGSNNNSALIFITALQKMFTGFLLVNKSSCKLNPCWEKVQV * . ****:** ** ****** chr19_1200 ------------------------------------MRCSQEPVNKSACKSNPRWEKVQV |1a

chr10_1592 SSLYKLTDNCVNLQPLKRKEKKATLITLLSFTLHLLSSLAALRWDVNLPVNAVRKWMVQE *************************** ***:*************************** chr19_1200 SSLYKLTDNCVNLQPLKRKEKKATLITPLSFALHLLSSLAALRWDVNLPVNAVRKWMVQG

|3b chr10_1592 GQELEISISGGCLTFMGKSSSNSVITALLMAEELHHYDNFFYSCEPKSSLLFLLSRAVIE ************************************************************ chr19_1200 GQELEISISGGCLTFMGKSSSNSVITALLMAEELHHYDNFFYSCEPKSSLLFLLSRAVIE |2b

|4b chr10_1592 VCLYGV-LNSKVCQLQKVYILINTPVAWRSEGLADRWLPRKAQQASHLQHLVVGAREQAQ .**** . : :********************** ************ .** . .* chr19_1200 ACLYGENTAGPGLHSRKVYILINTPVAWRSEGLADRWLLRKAQQASHLQHLSAGATRAVQ |3c

Page 23: finding genes by comparing genomes

sequence conservation andcoding function

Page 24: finding genes by comparing genomes

ortholgous splice sites are more conserved than expected solely from their splicing

function

Page 25: finding genes by comparing genomes

ortholgous splice sites are more conserved than expected solely from their splicing

function

Page 26: finding genes by comparing genomes

prediction of splice sites

Page 27: finding genes by comparing genomes

we will target genes with conserved intron positions

Page 28: finding genes by comparing genomes

the final pools

TWINSCAN

SGP

48462 total 47055

17562 novel 21942

10987

3171

multiexoniclongno low complexity

12158

4543

954human ts

2217orphans

1560orphans

2983human sgp

317 637 2217 1560 1931 1052

intron aligned

human ts orphans orphans human sgp

intron aligned

predictions in the mouse genome

Page 29: finding genes by comparing genomes

rtpcr: targeting conserved intron positions

|2a chr10_1592 LGSETCCNSHTSLQTSGVPDGSNNNSALIFITALQKMFTGFLLVNKSSCKLNPCWEKVQV * . ****:** ** ****** chr19_1200 ------------------------------------MRCSQEPVNKSACKSNPRWEKVQV |1a

chr10_1592 SSLYKLTDNCVNLQPLKRKEKKATLITLLSFTLHLLSSLAALRWDVNLPVNAVRKWMVQE *************************** ***:*************************** chr19_1200 SSLYKLTDNCVNLQPLKRKEKKATLITPLSFALHLLSSLAALRWDVNLPVNAVRKWMVQG

|3b chr10_1592 GQELEISISGGCLTFMGKSSSNSVITALLMAEELHHYDNFFYSCEPKSSLLFLLSRAVIE ************************************************************ chr19_1200 GQELEISISGGCLTFMGKSSSNSVITALLMAEELHHYDNFFYSCEPKSSLLFLLSRAVIE |2b

|4b chr10_1592 VCLYGV-LNSKVCQLQKVYILINTPVAWRSEGLADRWLPRKAQQASHLQHLVVGAREQAQ .**** . : :********************** ************ .** . .* chr19_1200 ACLYGENTAGPGLHSRKVYILINTPVAWRSEGLADRWLLRKAQQASHLQHLSAGATRAVQ |3c

Page 30: finding genes by comparing genomes

rt-pcr on 12 normal mouse adult tissues,and direct sequencing of the amplimers

pool predictions

tested positive success rate

intron aligned

1428 214 133 62%

similar 2125 38 4 11%

orphan 3425 63 2 3%

Page 31: finding genes by comparing genomes

rt-pcr on 12 normal mouse adult tissues,and direct sequencing of the amplimers

Page 32: finding genes by comparing genomes

about 1000 human genes not in ensembl

• low support by ESTs: 34% match EST sequences

• low representation in other vertebrate genomes: 33% have sequence matches in fish genomes

• restricted expression patterns

Page 33: finding genes by comparing genomes

restricted expression patterns

Page 34: finding genes by comparing genomes
Page 35: finding genes by comparing genomes
Page 36: finding genes by comparing genomes

CodeB H K Y V S M L T K E O

%Id Homology

3B1              ● ●      

38% Dystrophin-like; with ZZ domain

3B3      ●   ●   ● ● ●    

25% Novel aquaporin; similar to Drosophila CG12251

3C3    ●   ●     ● ●   ●  

25% TEP1 (telomerase associated); probable ATPase

3C5                ●   ●  

47% Voltage-dependent calcium channel gamma subunit

4B3    ●     ●     ●      

34% Interferon-induced / fragilis transmembrane family

4C6  ●       ●   ● ● ●    

30% Interleukin 22-binding protein CRF2-10

4G4●               ● ● ●  

64% Nna1p, nuclear ATP/GTP-binding protein

5B5          ●     ● ●    

43% Likely aminophospholipid flippase (transporting ATPase)

1E3●     ● ●     ●     ●  

40% N-acetylated-α-linked-acidic dipeptidase (NAALADase)

6C4              ● ●      

42% Not-type homeobox; poss. involved in notochord development

6F5● ● ●                  

66% Drosophila brain-specific homeobox protein (bsh)

11F2●         ●   ● ● ●    

29% Human GABA-B receptor 2, neurotransmitter release regulator

5A2    ●   ● ●     ●      

41% Skate liver organic solute transporter beta

11B6      ●       ●     ●  

55% Interferon-activatable protein 203; nuclear protein

12B3●     ● ● ●   ● ● ● ●  

25% Fatty acid desaturase; maintains membrane integrity

11F6● ●       ●   ● ● ● ●  

44% Rat vanilloid receptor type 1 like protein 1

12E3              ● ●      

52% Fizzy/CDC20; modulates degradation of cell-cycle proteins

12F1  ●       ● ● ● ●      

43% Otoferlin (mutated in DFNB9, nonsyndromic deafness)

12H1● ●             ●      

45% Fruitfly additional sex combs; a Polycomb group protein

12C4●             ●     ●  

43% C. elegans C15C8.2; single-minded-like; HLH and PAS domains

12D2          ●            

41% Cytosolic phospholipase A2, group IVB

12A5●                      

38% Fruitfly GH15686p; Ent2-like nucleoside transporter

12E5●     ●       ●       ●

32% Relaxin 3 preproprotein; prohormone of the insulin family

11A1    ● ● ●   ●         ●

89% Mouse BET3, involved in ER to Golgi transport

11A2● ●           ● ●   ● ●

70% Vacuolar ATP synthase subunit S1

11B2            ● ● ● ● ● ●

54% Myosin light chain kinase, skeletal muscle.

11G236% Dapper / frodo (transduces Wnt signals by interacting with Dsh.

Page 37: finding genes by comparing genomes

limitations:sensitivity of the procedure

twisncan

ensembl sgp2

initial predictions 48464 23026 48451

multiexonic genes 36831 17565 38979

25320 16368 16952 21184

69% 94% 97% 54%

orhtolog pairs 24743 30927

21099 15355 16757 19831

85% 87% 95% 64%

intron aligned 17271 18056

16337 13709 15112 15977

94% 78% 86% 88%

Page 38: finding genes by comparing genomes

TS success

Dn

/Ds

0 1

0.15

0.20

0.25

SGP successD

n/D

s0 1

0.22

0.32

0.42

0.52

0.62

0.72

0.82

0.92

specificity of the prediction can be improved: Ka/Ks ratio

Page 39: finding genes by comparing genomes

further work

• scale the procedure. Try to find rtpcr evidence for (almost) every human gene not yet confirmed

• intronless genes

• human specific gene families (if any)

• genes with non-canonical splicing

Page 40: finding genes by comparing genomes

selenoproteins

Selenoproteins are proteins that incorporate the aminoacid selenocysteine, the 21st amino acid.

•Function: mostly redox enzymes

•Distribution: 3 domains of life

•Number: 22 families in mammals

Page 41: finding genes by comparing genomes

selenoproteins

• UGA (STOP) is the codon for Sec

• There is a tRNAsec with the UGA anticodon

• Recoding:

1. RNA structure: the SECIS element

2. SECIS binding proteins

Page 42: finding genes by comparing genomes

selenoproteins

Page 43: finding genes by comparing genomes

the SECIS element.computational search for selenoproteins

dSelG

SECIS Pattern

Page 44: finding genes by comparing genomes

using geneid to search for selenoproteins

1. Predict SECIS (PatScan)

1. Gene prediction with1. TGA in-frame2. SECIS

Page 45: finding genes by comparing genomes

genome wide search in drosophila

SECIS predicted 35876

SECIS thermo assessment

1220

Genes predicted 12194

Predicted Selenoproteins

(4)

RealSelenoproteins

3

Page 46: finding genes by comparing genomes

dSelG

Page 47: finding genes by comparing genomes

dSelM

Page 48: finding genes by comparing genomes

dSelG and dSelM: experimental verification

Page 49: finding genes by comparing genomes

dSelM has selenoprotein homologues in vertebrates

Page 50: finding genes by comparing genomes

IMIM/UPF/CRG Genís Parra, Josep F. Abril, Roderic Guigó

University of Geneva

Manolis Dermitzakis, Alexandre Reymond,Robert Lyle, Catherine Ucla, Stylianos Antonarakis

GlaxoSmithKline Pankaj Agarwal

University of Oxford

Chris Ponting

Washington University

Evan Keibler, Michael Brent

Universitat de BarcelonaUniversity of LinconHarvard University

Montserrat Corominas, Florenci Serras, Marta Morey, Sergi BertranVadim Gladishev, Gregory KruikovMarla Berry, Nadia Morozova

IMIM/UPF/CRG Sergi Castellano

COMPARATIVE GENE PREDICTION

SELENOPROTEINS