140
Cédric Notredame (15/03/22) Finding Genes In a Genome Cédric Notredame

Cédric Notredame (22/04/2015) Finding Genes In a Genome Cédric Notredame

Embed Size (px)

Citation preview

Cédric Notredame (18/04/23)

Finding Genes

In a Genome

Cédric Notredame

Cédric Notredame (18/04/23)

Naked Genome

Cédric Notredame (18/04/23)

All Dressed Up!

Cédric Notredame (18/04/23)

Cédric Notredame (18/04/23)

Cédric Notredame (18/04/23)

Naked Genomes are Useless

Useful Genome

Accurate Annotation

-Experimental Methods

-Computational Methods

-ESTs, THS, DNA Chips…

-Homology, Ab-Initio

Cédric Notredame (18/04/23)

ANNOTATION

-Where are the genes ?

-What do they do: Biochemistry ?

-When do they do it: Regulation ?

-Who do they do it for: Metabolic ?

Cédric Notredame (18/04/23)

Outline

Naked Genome => Fully Dressed Sequence

1. Cleaning the genome

2. Similarity methods

3. Experimental Methods

4. Ab-initio MethodsEukaryotes

Prokaryotes

5-How Good Are The Methods ??

Cédric Notredame (18/04/23)

Outline

Eukaryotes

Prokaryotes

Cédric Notredame (18/04/23)

Cédric Notredame (18/04/23)

Gene Fishingin

Prokaryotic Genomes

Cédric Notredame (18/04/23)

What is a Prokaryotic Gene ?

GenePromoter RBS

Protein

ORF

mRNASTOPATG

Terminator

Cédric Notredame (18/04/23)

What is a Prokaryotic Gene:Operon

Cédric Notredame (18/04/23)

2-Homology Based Methods

1-Ab-initio:-ORFing-Codon Bias

Promoter RBS

mRNASTOP

Terminator

3-Regulatory Sequence Detection-Non Coding-Short Genes

Cédric Notredame (18/04/23)

Prokaryotic Genomes

-High Gene Density:Haemophilus Influenza: 85%

-No Introns

-Operons

In a prokaryotic Genome, any ORF longer than 300 nt

Can SAFELY be considered to be a gene

Cédric Notredame (18/04/23)

Prokaryotic Genomes

Clean-up

ORFing

Homology Search

Gene Prediction

Promoter Detection

Cédric Notredame (18/04/23)

CleaningYourDNA

Sequence

Cédric Notredame (18/04/23)

Cleaning a DNA Sequence

Is My Sequence Contaminated ?

-Cloning may lead tothe inclusion of VectorSequences.

-These sequences must beremoved

Cédric Notredame (18/04/23)

Paste in yournew sequence

Cédric Notredame (18/04/23)

Crop

Our sequence displays two vectorcontaminations

Cédric Notredame (18/04/23)

Contamination Matters

Contaminations Look Like Horizontal Transfers

BUT Genuine Genome may Contain Similarity tothe Cloning vector (Antibiotics Resistance)

-Wrong Phylogeny

-Error Propagation in Secondary Databases

-Eukaryote Genomes can also be cleaned this way

Cédric Notredame (18/04/23)

ORFingProkaryotic Genomes

Cédric Notredame (18/04/23)

Prokaryotic Genomes: ORFing

Where are the ORFs In my Sequence ?

Cédric Notredame (18/04/23)

Prokaryotic Genomes: ORFing

ATG (Start) Codons

STOP Codons

Cédric Notredame (18/04/23)

Prokaryotic Genomes: ORFing

Cédric Notredame (18/04/23)

Prokaryotic Genomes: GORF

www.ncbi.nih.gov/gorf/gorf.html

Cédric Notredame (18/04/23)

Prokaryotic Genomes: GORF

Cédric Notredame (18/04/23)

Prokaryotic Genomes: GORF

TO COG

TO BLAST

Cédric Notredame (18/04/23)

Prokaryotic Genomes: GORF

Cédric Notredame (18/04/23)

GORF: Can You Trust it ???

Random ORF Random 3rd Position

Real ORF Biased 3rd Position

Cédric Notredame (18/04/23)

GORF: Can You Trust it ???

Cédric Notredame (18/04/23)

Prokaryotic Genomes: GORFing cDNAs

BUT…

-Will NOT detect SHORT genes

-Will NOT detect Non Coding Genes

Works with Bacterial Genomes

Good enough for ~85% proteome

Works with Eukaryotic cDNA

Cédric Notredame (18/04/23)

Ab-InitioGene Predictions

InProkaryotic Genomes

Cédric Notredame (18/04/23)

Predicting Genes

What are the sequences in my genome that LOOK LIKE Genes

Cédric Notredame (18/04/23)

Using The Codon Biases

Cédric Notredame (18/04/23)

Using The Codon Biases

Coding RegionsDo NOT look LikeRandom DNA:

-Codon Bias

Cédric Notredame (18/04/23)

Real Genes Use Mostly the Optimal Codons

Cédric Notredame (18/04/23)

Predicting Genes

ALL the characteristics of a Gene can be Built into a model

Hidden Markov Model

Cédric Notredame (18/04/23)

Hidden Markov Model

-Each Nucleotide has a STATE: Coding/Non Coding …

-This STATE is HIDDEN

-The HMM tries to UNCOVER the STATE of each Nucleotide.

Cédric Notredame (18/04/23)

Hidden Markov Model

Occasionally Dishonest CAsino …

-This STATE is HIDDEN in the data

Observation: 122234455666125654151661661515566616166661

State : FFFFFFFFLLLLFFFFFFLLLLLLLLLLLLLLLLLLLLLLLLLLL

Cédric Notredame (18/04/23)

GeneMark

Cédric Notredame (18/04/23)

Simplified HMM for Coding Regions

S

GGG 0.02GGGA 0.00GGGT 0.6GGGC 0.38G

TGG 1.00W

64 Codons

GGG 0.02GGGA 0.00GGGT 0.6GGGC 0.38G

TGG 1.00W

64 Codons

E

Cédric Notredame (18/04/23)

Emission Proba

Transition Proba

Simplified HMM for Coding Regions

Cédric Notredame (18/04/23)

Proba of seq (GGG-TGG Given Model)

=Proba(GGG)*Proba(GGG-

>TGG)*Proba(TGG)

HMM order 5: 6th Nucleotide depends on the 5 previous

Takes into account Codon Bias AND dipeptide Comp

Simplified HMM for Coding Regions

Cédric Notredame (18/04/23)Translate Predicted Genes into Proteins Text Output

http://opal.biology.gatech.edu/GeneMark/

Cédric Notredame (18/04/23)

Cédric Notredame (18/04/23)

Non Standard FASTA

Cédric Notredame (18/04/23)

GLIMMER: An alternative to GeneMark

Cédric Notredame (18/04/23)

Main Problems

Cédric Notredame (18/04/23)

GeneMark and HMM predictions

Works Very Well

Good enough for ~99% proteome

BUT…

-Will NOT detect Some SHORT genes

-Will NOT detect Non Coding Genes

Cédric Notredame (18/04/23)

Which Program ???

The established programs ALL work well

No point in fighting if your users have their mind set on a brand…

Cédric Notredame (18/04/23)

If Your Gene is NON-Coding…

The only existing model for NON-Coding genes are those for tRNA

Cédric Notredame (18/04/23)

Homology BasedGene Prediction

InProkaryotic Genomes

Cédric Notredame (18/04/23)

BLASTx

What are the portion of my GenomeThat Look like a Known Gene/Protein?

Cédric Notredame (18/04/23)

blastx

protein

nucleotide

proteinVS

Non Coding, but works only for higly similar sequences ( >70%)

Cédric Notredame (18/04/23)

BlastX and HMM predictions

BUT…

Needs Homology

Depends on the databases

Very Reliable on Prokaryotes

Can Help in Eukaryotes

Cédric Notredame (18/04/23)

Finding PromotersIn

Prokaryotic Genomes

Cédric Notredame (18/04/23)

Promoter Hunting

Are There known promoters in my Sequence ?

Ideal for

-Finding Small Proteins

-Finding Non Coding Genes

Cédric Notredame (18/04/23)

Cédric Notredame (18/04/23)

Cédric Notredame (18/04/23)

                                                                     

                                                                                                        

Cédric Notredame (18/04/23)

Cédric Notredame (18/04/23)

prodoric.tu-bs.de/

Cédric Notredame (18/04/23)

prodoric.tu-bs.de/

Cédric Notredame (18/04/23)

Cédric Notredame (18/04/23)

rsat.ulb.ac.be/rsat/RSA_home.cgi

Cédric Notredame (18/04/23)

Cédric Notredame (18/04/23)

Fishing GenesIn

Eukaryotic Genomes

Cédric Notredame (18/04/23)

Cédric Notredame (18/04/23)

2-Homology Based Methods

1-Transcript Based Methods 3-Ab-initio:-HMMs

4-Regulatory Sequence Detection

Promoter

mRNA (form2)

exonexon exon exon exonexon

mRNA (form2)

Cédric Notredame (18/04/23)

Eukaryote Genomes

Clean-up

Transcripts

Prediction

Homology

Promoter Detection

Cédric Notredame (18/04/23)

Know your

Opponent …

Cédric Notredame (18/04/23)

Exons are longer in Vertebrates

Cédric Notredame (18/04/23)

Introns are longer in Vertebrates-100 bp in Fungi-1000 bp in Vertebrates

Cédric Notredame (18/04/23)

Genes contain more Introns in Mammals

Cédric Notredame (18/04/23)

Cleaning

Eukaryotic Genomes

Cédric Notredame (18/04/23)

RepeatsRepeatsTransposable elements, simple repeatsTransposable elements, simple repeats

RepeatMaskerRepeatMaskerSmith and Waterman Clean-up.Smith and Waterman Clean-up.

Avoiding Repeats

PlusPlus -Remove lots of noise.-Remove lots of noise.

MinusMinus -Changes Sequence -Changes Sequence Statistics.Statistics.

Cédric Notredame (18/04/23)

Homology BasedGene Prediction

InEukaryotic Genomes

Cédric Notredame (18/04/23)

Homology Based Predictions

What are the portion of my GenomeThat Look like a Known Protein?

Cédric Notredame (18/04/23)

Three Tools

GeneWise:GeneWise: Most CommonMost Common

Procrustes:Procrustes: Most SophisticatedMost Sophisticated

BlastX/TBlastXBlastX/TBlastX SimplestSimplest

Cédric Notredame (18/04/23)

blastx

protein

Genome

proteinVS

BLASTX

Cédric Notredame (18/04/23)

tblastx

protein

Genome

protein

ESTs

VS

TBLASTX: Exon Fishing

Cédric Notredame (18/04/23)

genomic sequence

Protein

Procrustes

Cédric Notredame (18/04/23)

40% id

Cédric Notredame (18/04/23)

www.ebi.ac.uk/Wise2/advanced.html

GeneWise

genomic sequence

Protein

Cédric Notredame (18/04/23)

Transcript BasedGene Prediction

InEukaryotic Genomes

Cédric Notredame (18/04/23)

Gene indices

Using Established ESTs Collections

Cédric Notredame (18/04/23)

AAAAAA...putativemRNA exon 15‘UTR exon 2 3‘UTR

expressed sequence tags(ESTs)

ESTs give us an Insight into this Complexity

1-Cluster the ESTs to reconstitute a gene

Cédric Notredame (18/04/23)

EMBL database

Quality clipping

BLASTsearch, clustering

EST Collection

Quality clipping

Assembly,Consensus sequences

Visualization

Gene indices Typical WorkFlow

Cédric Notredame (18/04/23)

Gene indices Alignment

consensus

Cédric Notredame (18/04/23)

Gene indices Alignment Software

Phrap (Phil Green)

CAP3 (X. Huang)

TIGR assembler

GAP4 (R. Staden)

Cédric Notredame (18/04/23)

Gene indicesConsensus sequences

Reduced error rate

Long Consensus

Efficient database search

exon/intron boundaries

Alternative Splicing

Cédric Notredame (18/04/23)

UniGene (NCBI)

TIGR Gene Indices

STACK (SANBI)

GeneNest (DKFZ,MPI)

Goal:

One cluster

One Gene

Gene indices

Clustering of EST and mRNA sequences of an organism to reduce redundance in sequence data.

Cédric Notredame (18/04/23)

GeneNestgenenest.molgen.mpg.de

Cédric Notredame (18/04/23)

TIGR Gene Indices

Alignment scheme

www.tigr.org

Cédric Notredame (18/04/23)

UniGene

www.ncbi.nih.nlm.gov/UniGene

Cédric Notredame (18/04/23)

UniGene

www.ncbi.nih.nlm.gov/UniGene

Cédric Notredame (18/04/23)

Gene indices

Cédric Notredame (18/04/23)

Gene indicesApplications

Detection of exon/intron boundaries

Detection of alternative splicing

Detection of Single Nucleotide Polymorphisms

Genome annotation

Analysis of gene expression

Design of DNA-chips/arrays

Cédric Notredame (18/04/23)

Mapping of EST consensus sequences on genomic DNA

genomic sequence

exons

consensus sequence( mRNA)

Cédric Notredame (18/04/23)

Comparing YourComparing YourGenomeGenome

with Transcriptswith Transcripts

HowHow to to dodo It ? It ?How Long ?How Long ?

BLAST : 36 hoursBLAST : 36 hours

Popular and well Popular and well describeddescribed

HSPs tend to mangle HSPs tend to mangle IntronsIntronsEST_GENOME 80 hoursEST_GENOME 80 hours

Dynamic Program. post processDynamic Program. post process

Slow and sometimes hard to useSlow and sometimes hard to use

BLAT: 0.5 hoursBLAT: 0.5 hours

Next GenerationNext Generation

Look for nearly identical seq.Look for nearly identical seq.

SIM4 pbil.univ-lyon1.fr/sim4.phpSIM4 pbil.univ-lyon1.fr/sim4.php

Similar to BLAT (slower)Similar to BLAT (slower)

Allows Large GapsAllows Large Gaps

Cédric Notredame (18/04/23)

Cédric Notredame (18/04/23)

Mapping cDNA on genomic DNA

splicenest.molgen.mpg.de

Cédric Notredame (18/04/23)

Gene indicesApplications

Detection of exon/intron boundaries

Detection of alternative splicing

Detection of Single Nucleotide Polymorphisms

Genome annotation

Analysis of gene expression

Design of DNA-chips/arrays

Cédric Notredame (18/04/23)

Alternative Splicing

genomic sequence

exons

consensus sequence( mRNA)

splice variant

Cédric Notredame (18/04/23)

Cédric Notredame (18/04/23)

Splice variants of APECED gene

number of sequences genomic sequencealternative variants

splicenest.molgen.mpg.de

Alternative Splicing

Cédric Notredame (18/04/23)

Alternative Splicing(additional exon)

1-skipped exon

Splice variants of adenylsuccinate lyase

2-unspliced ?

3-gene prediction errors ?

splicenest.molgen.mpg.de

Cédric Notredame (18/04/23)

Alternative Splicing (alternative donor site)

Cédric Notredame (18/04/23)

Alternative Splicing(unknown gene Hs16936)

Cédric Notredame (18/04/23)

Ab-InitioGene Prediction

InEukaryotic Genomes

Cédric Notredame (18/04/23)

Cédric Notredame (18/04/23)

Three Categories of Methods

•Rule Based

–Uses explicit set of rules to make decisions

–GeneFinder•Neural Network

–Uses a data set to build rules.

–Grail•HMM

–Finding the state of each Nucleotide (Coding…)

–Genscan

Cédric Notredame (18/04/23)

Rule Based - GeneFinder

• CodonBias -> score1

• Splice Site description -> score2

• ORFs -> score3

Proba (Gene)= F(score1, score2, score3..)

Cédric Notredame (18/04/23)

• Train Neural Network on Known Genes: Discriminate

• GrailExp : • Measure several coding potentials

• Blast• Coding Potential• …

• Feed all the scores into the Neural Network

Neural Networks - Grail

Cédric Notredame (18/04/23)

Cédric Notredame (18/04/23)

• Genscan models the genes with a Hidden Markov model, that models coding and non-coding regions.

HMM-Genscan

•Fifth order inhomogeneous HMM

–Fifth order : use 6-tuples (two codons)

–Inhomogeneous: each position is special (0,1,2)

Cédric Notredame (18/04/23)

Burge, C. and S. Karlin, Prediction of complete gene structures in human genomic DNA. J Mol Biol, 1997. 268(1): p. 78-94

Cédric Notredame (18/04/23)

Your Genomic Sequence

A Collection of Proteins

Cédric Notredame (18/04/23)

Cédric Notredame (18/04/23)

Evaluating Eukaryote

Gene Prediction

Cédric Notredame (18/04/23)

PMID:11042160

Cédric Notredame (18/04/23)

Nucleotide Accuracy

TPTP FN

Sn Sp TPTP FP++

TP FN+ TP FP+ TN FP+ TN FN+

= =

TN+TN+TP+TPAC=0.5*( ) -1

((TP+FN)*(TN+FP)*(TP+FP)*(TN+FN))1/2

(TN*TP)+(FN*FP)CC=

sensitivity specificity

approximate correlation

correlation coefficient

Cédric Notredame (18/04/23)

Exon Accuracy

Cédric Notredame (18/04/23)

•Blastx:

•Good Gene Hunter/Poor Modeler

•GeneWise

•Best Homology Gene Modeling

Cédric Notredame (18/04/23)

•GenScan:

•Distracted by Complete Genome

•Use GenomeScan instead

Cédric Notredame (18/04/23)

•GenScan:

•Ab-Initio Methods are more Robust

Cédric Notredame (18/04/23)

http://www.cs.ubc.ca/~rogic/evaluation.html

PMID: 8786136

Cédric Notredame (18/04/23)

Cédric Notredame (18/04/23)

•High and Low GC contents can confuse Predictions

Cédric Notredame (18/04/23)

Cédric Notredame (18/04/23)

Annnnnnd The

Winner is …

Cédric Notredame (18/04/23)

http://www.cbs.dtu.dk/services/HMMgene/

Cédric Notredame (18/04/23)

Working on a Genome

Cédric Notredame (18/04/23)

www.sanger.ac.uk/Software/Artemis/

Cédric Notredame (18/04/23)

Cédric Notredame (18/04/23)

Wrapping It Up

Cédric Notredame (18/04/23)

Predicting Genes

Using Homology: BlastX,

Procrustes

ORFing: GORF

Cleaning Up Data

Cédric Notredame (18/04/23)

Predicting Genes

Promoter PredictionPRODORIC

Transcript Based Predictions

GeneNest, UniGene, BLAT

Ab-Initio Predictions with HMMs

GenomeScan and HMMgene

Cédric Notredame (18/04/23)