Upload
yuliana-padley
View
227
Download
1
Tags:
Embed Size (px)
Citation preview
Cédric Notredame (18/04/23)
Naked Genomes are Useless
Useful Genome
Accurate Annotation
-Experimental Methods
-Computational Methods
-ESTs, THS, DNA Chips…
-Homology, Ab-Initio
Cédric Notredame (18/04/23)
ANNOTATION
-Where are the genes ?
-What do they do: Biochemistry ?
-When do they do it: Regulation ?
-Who do they do it for: Metabolic ?
Cédric Notredame (18/04/23)
Outline
Naked Genome => Fully Dressed Sequence
1. Cleaning the genome
2. Similarity methods
3. Experimental Methods
4. Ab-initio MethodsEukaryotes
Prokaryotes
5-How Good Are The Methods ??
Cédric Notredame (18/04/23)
What is a Prokaryotic Gene ?
GenePromoter RBS
Protein
ORF
mRNASTOPATG
Terminator
Cédric Notredame (18/04/23)
2-Homology Based Methods
1-Ab-initio:-ORFing-Codon Bias
Promoter RBS
mRNASTOP
Terminator
3-Regulatory Sequence Detection-Non Coding-Short Genes
Cédric Notredame (18/04/23)
Prokaryotic Genomes
-High Gene Density:Haemophilus Influenza: 85%
-No Introns
-Operons
In a prokaryotic Genome, any ORF longer than 300 nt
Can SAFELY be considered to be a gene
Cédric Notredame (18/04/23)
Prokaryotic Genomes
Clean-up
ORFing
Homology Search
Gene Prediction
Promoter Detection
Cédric Notredame (18/04/23)
Cleaning a DNA Sequence
Is My Sequence Contaminated ?
-Cloning may lead tothe inclusion of VectorSequences.
-These sequences must beremoved
Cédric Notredame (18/04/23)
Contamination Matters
Contaminations Look Like Horizontal Transfers
BUT Genuine Genome may Contain Similarity tothe Cloning vector (Antibiotics Resistance)
-Wrong Phylogeny
-Error Propagation in Secondary Databases
-Eukaryote Genomes can also be cleaned this way
Cédric Notredame (18/04/23)
GORF: Can You Trust it ???
Random ORF Random 3rd Position
Real ORF Biased 3rd Position
Cédric Notredame (18/04/23)
Prokaryotic Genomes: GORFing cDNAs
BUT…
-Will NOT detect SHORT genes
-Will NOT detect Non Coding Genes
Works with Bacterial Genomes
Good enough for ~85% proteome
Works with Eukaryotic cDNA
Cédric Notredame (18/04/23)
Predicting Genes
What are the sequences in my genome that LOOK LIKE Genes
Cédric Notredame (18/04/23)
Using The Codon Biases
Coding RegionsDo NOT look LikeRandom DNA:
-Codon Bias
Cédric Notredame (18/04/23)
Predicting Genes
ALL the characteristics of a Gene can be Built into a model
Hidden Markov Model
Cédric Notredame (18/04/23)
Hidden Markov Model
-Each Nucleotide has a STATE: Coding/Non Coding …
-This STATE is HIDDEN
-The HMM tries to UNCOVER the STATE of each Nucleotide.
Cédric Notredame (18/04/23)
Hidden Markov Model
Occasionally Dishonest CAsino …
-This STATE is HIDDEN in the data
Observation: 122234455666125654151661661515566616166661
State : FFFFFFFFLLLLFFFFFFLLLLLLLLLLLLLLLLLLLLLLLLLLL
Cédric Notredame (18/04/23)
Simplified HMM for Coding Regions
S
GGG 0.02GGGA 0.00GGGT 0.6GGGC 0.38G
TGG 1.00W
64 Codons
GGG 0.02GGGA 0.00GGGT 0.6GGGC 0.38G
TGG 1.00W
64 Codons
E
Cédric Notredame (18/04/23)
Proba of seq (GGG-TGG Given Model)
=Proba(GGG)*Proba(GGG-
>TGG)*Proba(TGG)
HMM order 5: 6th Nucleotide depends on the 5 previous
Takes into account Codon Bias AND dipeptide Comp
Simplified HMM for Coding Regions
Cédric Notredame (18/04/23)Translate Predicted Genes into Proteins Text Output
http://opal.biology.gatech.edu/GeneMark/
Cédric Notredame (18/04/23)
GeneMark and HMM predictions
Works Very Well
Good enough for ~99% proteome
BUT…
-Will NOT detect Some SHORT genes
-Will NOT detect Non Coding Genes
Cédric Notredame (18/04/23)
Which Program ???
The established programs ALL work well
No point in fighting if your users have their mind set on a brand…
Cédric Notredame (18/04/23)
If Your Gene is NON-Coding…
The only existing model for NON-Coding genes are those for tRNA
Cédric Notredame (18/04/23)
BLASTx
What are the portion of my GenomeThat Look like a Known Gene/Protein?
Cédric Notredame (18/04/23)
blastx
protein
nucleotide
proteinVS
Non Coding, but works only for higly similar sequences ( >70%)
Cédric Notredame (18/04/23)
BlastX and HMM predictions
BUT…
Needs Homology
Depends on the databases
Very Reliable on Prokaryotes
Can Help in Eukaryotes
Cédric Notredame (18/04/23)
Promoter Hunting
Are There known promoters in my Sequence ?
Ideal for
-Finding Small Proteins
-Finding Non Coding Genes
Cédric Notredame (18/04/23)
2-Homology Based Methods
1-Transcript Based Methods 3-Ab-initio:-HMMs
4-Regulatory Sequence Detection
Promoter
mRNA (form2)
exonexon exon exon exonexon
mRNA (form2)
Cédric Notredame (18/04/23)
Eukaryote Genomes
Clean-up
Transcripts
Prediction
Homology
Promoter Detection
Cédric Notredame (18/04/23)
Introns are longer in Vertebrates-100 bp in Fungi-1000 bp in Vertebrates
Cédric Notredame (18/04/23)
RepeatsRepeatsTransposable elements, simple repeatsTransposable elements, simple repeats
RepeatMaskerRepeatMaskerSmith and Waterman Clean-up.Smith and Waterman Clean-up.
Avoiding Repeats
PlusPlus -Remove lots of noise.-Remove lots of noise.
MinusMinus -Changes Sequence -Changes Sequence Statistics.Statistics.
Cédric Notredame (18/04/23)
Homology Based Predictions
What are the portion of my GenomeThat Look like a Known Protein?
Cédric Notredame (18/04/23)
Three Tools
GeneWise:GeneWise: Most CommonMost Common
Procrustes:Procrustes: Most SophisticatedMost Sophisticated
BlastX/TBlastXBlastX/TBlastX SimplestSimplest
Cédric Notredame (18/04/23)
AAAAAA...putativemRNA exon 15‘UTR exon 2 3‘UTR
expressed sequence tags(ESTs)
ESTs give us an Insight into this Complexity
1-Cluster the ESTs to reconstitute a gene
Cédric Notredame (18/04/23)
EMBL database
Quality clipping
BLASTsearch, clustering
EST Collection
Quality clipping
Assembly,Consensus sequences
Visualization
Gene indices Typical WorkFlow
Cédric Notredame (18/04/23)
Gene indices Alignment Software
Phrap (Phil Green)
CAP3 (X. Huang)
TIGR assembler
GAP4 (R. Staden)
Cédric Notredame (18/04/23)
Gene indicesConsensus sequences
Reduced error rate
Long Consensus
Efficient database search
exon/intron boundaries
Alternative Splicing
Cédric Notredame (18/04/23)
UniGene (NCBI)
TIGR Gene Indices
STACK (SANBI)
GeneNest (DKFZ,MPI)
Goal:
One cluster
One Gene
Gene indices
Clustering of EST and mRNA sequences of an organism to reduce redundance in sequence data.
Cédric Notredame (18/04/23)
Gene indicesApplications
Detection of exon/intron boundaries
Detection of alternative splicing
Detection of Single Nucleotide Polymorphisms
Genome annotation
Analysis of gene expression
Design of DNA-chips/arrays
Cédric Notredame (18/04/23)
Mapping of EST consensus sequences on genomic DNA
genomic sequence
exons
consensus sequence( mRNA)
Cédric Notredame (18/04/23)
Comparing YourComparing YourGenomeGenome
with Transcriptswith Transcripts
HowHow to to dodo It ? It ?How Long ?How Long ?
BLAST : 36 hoursBLAST : 36 hours
Popular and well Popular and well describeddescribed
HSPs tend to mangle HSPs tend to mangle IntronsIntronsEST_GENOME 80 hoursEST_GENOME 80 hours
Dynamic Program. post processDynamic Program. post process
Slow and sometimes hard to useSlow and sometimes hard to use
BLAT: 0.5 hoursBLAT: 0.5 hours
Next GenerationNext Generation
Look for nearly identical seq.Look for nearly identical seq.
SIM4 pbil.univ-lyon1.fr/sim4.phpSIM4 pbil.univ-lyon1.fr/sim4.php
Similar to BLAT (slower)Similar to BLAT (slower)
Allows Large GapsAllows Large Gaps
Cédric Notredame (18/04/23)
Gene indicesApplications
Detection of exon/intron boundaries
Detection of alternative splicing
Detection of Single Nucleotide Polymorphisms
Genome annotation
Analysis of gene expression
Design of DNA-chips/arrays
Cédric Notredame (18/04/23)
Alternative Splicing
genomic sequence
exons
consensus sequence( mRNA)
splice variant
Cédric Notredame (18/04/23)
Splice variants of APECED gene
number of sequences genomic sequencealternative variants
splicenest.molgen.mpg.de
Alternative Splicing
Cédric Notredame (18/04/23)
Alternative Splicing(additional exon)
1-skipped exon
Splice variants of adenylsuccinate lyase
2-unspliced ?
3-gene prediction errors ?
splicenest.molgen.mpg.de
Cédric Notredame (18/04/23)
Three Categories of Methods
•Rule Based
–Uses explicit set of rules to make decisions
–GeneFinder•Neural Network
–Uses a data set to build rules.
–Grail•HMM
–Finding the state of each Nucleotide (Coding…)
–Genscan
Cédric Notredame (18/04/23)
Rule Based - GeneFinder
• CodonBias -> score1
• Splice Site description -> score2
• ORFs -> score3
Proba (Gene)= F(score1, score2, score3..)
Cédric Notredame (18/04/23)
• Train Neural Network on Known Genes: Discriminate
• GrailExp : • Measure several coding potentials
• Blast• Coding Potential• …
• Feed all the scores into the Neural Network
Neural Networks - Grail
Cédric Notredame (18/04/23)
• Genscan models the genes with a Hidden Markov model, that models coding and non-coding regions.
HMM-Genscan
•Fifth order inhomogeneous HMM
–Fifth order : use 6-tuples (two codons)
–Inhomogeneous: each position is special (0,1,2)
Cédric Notredame (18/04/23)
Burge, C. and S. Karlin, Prediction of complete gene structures in human genomic DNA. J Mol Biol, 1997. 268(1): p. 78-94
Cédric Notredame (18/04/23)
Nucleotide Accuracy
TPTP FN
Sn Sp TPTP FP++
TP FN+ TP FP+ TN FP+ TN FN+
= =
TN+TN+TP+TPAC=0.5*( ) -1
((TP+FN)*(TN+FP)*(TP+FP)*(TN+FN))1/2
(TN*TP)+(FN*FP)CC=
sensitivity specificity
approximate correlation
correlation coefficient
Cédric Notredame (18/04/23)
•Blastx:
•Good Gene Hunter/Poor Modeler
•GeneWise
•Best Homology Gene Modeling
Cédric Notredame (18/04/23)
Predicting Genes
Using Homology: BlastX,
Procrustes
ORFing: GORF
Cleaning Up Data
Cédric Notredame (18/04/23)
Predicting Genes
Promoter PredictionPRODORIC
Transcript Based Predictions
GeneNest, UniGene, BLAT
Ab-Initio Predictions with HMMs
GenomeScan and HMMgene