60
Gene Prediction Gene Prediction Computational Genomics Computational Genomics February 6, 2012 February 6, 2012

Gene Prediction

  • Upload
    cade

  • View
    104

  • Download
    0

Embed Size (px)

DESCRIPTION

Gene Prediction. Computational Genomics February 6, 2012. OUTLINE. Background - Gene prediction - Protein Coding Sequences - Gene structure and ORF - Prokaryotic Gene Model - Biology of Haemophilus haemolyticus 2. Gene Prediction Approaches -Ab Initio Gene Prediction - PowerPoint PPT Presentation

Citation preview

Page 1: Gene  Prediction

Gene PredictionGene PredictionComputational Computational

Genomics Genomics February 6, 2012February 6, 2012

Page 2: Gene  Prediction

OUTLINE

2

1. Background- Gene prediction- Protein Coding Sequences- Gene structure and ORF- Prokaryotic Gene Model- Biology of Haemophilus haemolyticus

2. Gene Prediction Approaches-Ab Initio Gene Prediction-Homology based Gene Prediction -RNA gene prediction

3. Gene Prediction Improvement

4. Strategy

Page 3: Gene  Prediction

What is Gene Prediction ?

3

Finding DNA sequences that encode proteins

Protein-coding genes RNA genes

Functional elements -> Regulatory regions

Gene finding is one of the first and most important steps in

understanding the genome of a species once it has

been sequenced

Page 4: Gene  Prediction

Why develop gene finders?

4

As of May 2010, 1,072 complete published bacterial genomes

reported GOLD

4,289 bacterial genome projects are known to be

ongoing (www.genomesonline.org).

Technological improvements in high-throughput DNA sequencing are tremendously increasing the public availability of prokaryotic and eukaryotic

genomes

Page 5: Gene  Prediction

5

Page 6: Gene  Prediction

6

almost 2000 genomes

completely sequenced by 2011

Sequencing projects are

growing exponentially

Page 7: Gene  Prediction

7

The underlying reasonsfor sequencing the genome of

various bacteria are either because they are highly

virulent to humans, animals or plants,

or they can be applied to bioremediation or bioenergy

production

Page 8: Gene  Prediction

8

Growing amount of nucleotide sequence data requires also a

concurrent development of adequate bioinformatics tools

for comprehensive understanding of the genetic information they encode as well as of their underlying

biology

Extracting knowledge from data

Page 9: Gene  Prediction

What is a Gene?

9

Such definition does not work for alternatively processed transcription

units

A gene is a linear collection

of exons that are incorporated

into a specific mRNA

A gene is an elementary unit of heredity which is indivisible in the functional sense

A gene codes for discrete functional macromolecule (protein) or functional RNA

Page 10: Gene  Prediction

10

Prokaryotic Gene Model: ORF genes

Small genomes, high gene density - H. influenzae genome is 85% genic

Operons - One transcript, many genes

No introns - One gene, one protein

Open Reading Frames - One ORF per gene - ORF with start and stop codons

Page 11: Gene  Prediction

11

Eukaryotic Gene Structure

Prokaryotic Gene Structure

Page 12: Gene  Prediction
Page 13: Gene  Prediction

Haemophilus haemolyticuswhat we know about our target system?

13

Gram negative bacterium

Facultative anaerobium

Shape: Coccobacilli

Emerging pathogen

closely related to H.

influenzae

Page 14: Gene  Prediction

14

H. haemolyticus is most closely related to H. influenzae

16S rRNA gene

infB gene

Multilocus Sequence Analysis (MLSA)

Page 15: Gene  Prediction

Why study Haemophilus haemolyticus ?

15

1. Genetic Diversity

2. Emerging Pathogen

3. Intrinsic Biological Value

Page 16: Gene  Prediction

How Gene Prediction How Gene Prediction works ?works ?

16

Page 17: Gene  Prediction

17

Gene Prediction Methods

Page 18: Gene  Prediction

18

ORF (Open Reading Frame): a sequence defined by in-frame AUG and stop codon, which in turn defines a putative amino acid sequence.

Simple first step in gene finding

Translate genomic sequence in six frames. Identify the stop codon in each frame.

Regions without stop codons are ORF

The longest ORF from a MET codon is a good prediction of protein encoding sequence.

Open Reading Frames

Page 19: Gene  Prediction

19

• Use only sequence information.

• Identify coding exons.

• Integrate coding statistics to differentiate between coding and non-coding regions. (Real exons expected to show codon bias).

• Calculate likelihood a triplet is in a coding region.

*Works relatively well for prokaryotic genomes wherenon-coding component is small and no introns

ORF Scanning

Page 20: Gene  Prediction

20

Predicting Prokaryotic Protein-Coding Genes

The principle difficulties are:

• detection of initiation site (AUG)• alternative start codons• gene overlap• undetected small proteins

Inspite of these difficulties, prokaryote gene prediction can reach 99% accuracy.

Gene prediction is easier and more accurate in prokaryotes than eukaryotes since prokaryote gene structure is much simpler.

Page 21: Gene  Prediction

21

Protein Coding Methods

Page 22: Gene  Prediction

22

Finding Genes in Prokaryotic DNA

Page 23: Gene  Prediction

23

• Intrinsic Gene Prediction Method.• Inspect the input sequence and search for

traces of gene presence. • Extract information on gene locations using

statistical patterns inside and outside gene regions as well as patterns typical of the gene boundaries.

• ab initio algorithms implement intelligent methods to represent these patterns as a model of the gene structure in the organism.

Ab initio methods

Page 24: Gene  Prediction

24

•Several highly accurate prokaryotic gene-finding methods are based on Markov model algorithms.

Markov model based tools

Page 25: Gene  Prediction

What are Hidden Markov Models?

• Hidden Markov models (HMMs) are discrete Markov processes where every state generates an observation at each time step.

• A hidden Markov model (HMM) is a statistical Markov model in which the system being modeled is assumed to be a Markov process with unobserved (hidden) states. [wiki]

25

Page 26: Gene  Prediction

Markov Model (Discrete Markov Process)• A discrete Markov process is a sequence of random

variables q1,…,qt that take values in a discrete set S={s1,…,sN} where the Markov property holds.

• Markov property:

• Parameters▫ Initial state probabilities: πi

▫ State transition probabilities: aij

26

Page 27: Gene  Prediction

From Markov Model to HMM

• HMMs are discrete Markov processes where each state also emits an observation according to some probability distribution, we need to augment our model.

• Parameters▫ Initial state probabilities: πi

▫ State transition probabilities: aij

▫ Emission probabilities: ei(k)

27

Markov Model Hidden Markov Model

Each state emits an observation with 100%

probability

Each state emits an observation according to a certain probability

distribution

Page 28: Gene  Prediction

HMM Example – Agnostic Drink Stand (1/2)

28

Page 29: Gene  Prediction

HMM Example – Agnostic Drink Stand (2/2)Suppose we observed the following sequences:

Vodka, Vodka, Coke, Vodka, Vodka, Vodka, Water, Water, Water, Water, Coke, Water, Coke, Coke, Water, Coke, Coke, Water, Coke, Coke, Coke, Vodka, Coke, Water, Vodka, Coke

How might we infer the hidden states?

A possible labeling:

Vodka, Vodka, Coke, Vodka, Vodka, Vodka, Water, Water, Water, Water, Coke, Water, Coke, Coke, Water, Coke, Coke, Water, Coke, Coke, Coke, Vodka, Coke, Water, Vodka, Coke

29

Page 30: Gene  Prediction

HMM Example in Sequencing Analysis

30

Page 31: Gene  Prediction

HMM and Observation Sequence are Known ??

• Given an HMM parameter θ and an observation sequence X1:T, which state sequence Q1:T best explains the observations?max P(Q|X,θ)

• Viterbi algorithm

31

Page 32: Gene  Prediction

How We Get HMM Parameters?

• Training an HMM from labeled sequence

32

Page 33: Gene  Prediction

Design a HMM model for Gene Prediction• The number of states in the model

▫ Start codon▫ Stop codon▫ Intragenic codon▫ Intergenic region

• The number of distinct observation symbols per state

• State transition probability distribution• Observation symbol probability distribution• Initial state distribution

• N-order Markov Model

33

Page 34: Gene  Prediction

Ab Initio Gene Prediction Software

• GeneMark.hmm

34

Page 35: Gene  Prediction

Ab Initio Gene Prediction Software

• GeneMarkS

35

Page 36: Gene  Prediction

Ab Initio Gene Prediction Software

• EasyGene

36

Page 37: Gene  Prediction

Limitations of Current Methods

• HMM has local averaging effect

• Training process is slow and is case-sensitive

• Algorithms are trained with sequences from known genes (overfitting problem)

• MLE + Viterbi is not optimal (several tools have used the scaling factor to tweak the performance)

• Overlapping genes

37

Page 38: Gene  Prediction

38

Comparison of the Gene Finders

Tools Developed for Output file formats

Prodigal Bacteria & archaea

GBK, GFF, SCO

GeneMarkS

Prokaryotes Algorithm-specific

RAST Bacteria & archaea

GTF, GFF3, GenBank, EMBL

Glimmer3 Prokaryotes Algorithm-specific

EasyGene Prokaryotes GFF3

AMIGene Prokaryotes EMBL, GenBank, GFF

Page 39: Gene  Prediction

39

Tools:•BLAST•SGP2•BLATAdvantages:•Simplest.•Characterized with high accuracy.•Helps find the gene loci plus annotates the region.Disadvantages:Requires huge amounts of extrinsic data and finds only half of the genes. Many of the genes still have no significant homology to known genes.Steps1.Similarity search against the database2.Multiple sequence alignment

Homology based methods

Page 40: Gene  Prediction

Searching against the Database

• Stepso Use a heuristic (approximate) algorithm to discard most

irrelevant sequences. (Based on Smith-Waterman algorithm)

o Perform the exact algorithm on the small group of remaining sequences.

• Representative algorithmso FASTA (Lipman & Pearson 1985) – First fast sequence

searching algorithm for comparing a query sequence against a database

o BLAST - Basic Local Alignment Search Technique (Altschul et al 1990)

o Gapped BLAST (Altschul et al 1997)

40

Page 41: Gene  Prediction

FASTA and BLAST

• First, identify very short (almost) exact matches.

• Next, the best short hits from the 1st step are extended to longer regions of similarity.

• Finally, the best hits are optimized using the Smith-Waterman algorithm.

41

Page 42: Gene  Prediction

FASTA

42

Find runs of identities Score and discard low-scoring runs

Eliminate segments unlikely to be part of alignment; apply banded Smith-Waterman to calculate opt score.

Page 43: Gene  Prediction

BLAST

• As sensitive as FASTA but much faster

• Confine attention to segment pairs that contain a word pair of length w with a score of at least T

• Phase 1: Compile a list of word pairs above threshold

• Phase 2: Scan the database for the match word hits

• Phase 3: Extend the hits

43

Page 44: Gene  Prediction

BLAST Phase 1: List of Word Pairs

• Compile a list of word pairs (w=3) above threshold T = 15

• Example: A query sequence…FSGTWYA…

A list of words (w=3) is:FSG SGT GTW TWY WYAYSG TGT ATW SWY WFAFTG SVT GSW TWF WYS NTW

44

neighborhood GTW 6,5,11 22

word hits GSW 6,1,11 18

> threshold ATW 0,5,11 16

NTW 0,5,11 16

neighborhood GTY 6,5,2 13

word hits GTM 6,5,-1 10 < threshold DAW -1,0,11 10

(T=15)

Page 45: Gene  Prediction

BLAST Phase 3: Extend the Hit• When you manage to find a hit (i.e. a match between a

“word” and a database entry), extend the hit in either direction.

• Keep track of the score (use a scoring matrix). Stop when the score drops below some cutoff value X.

• High-scoring Segment Pairs (HSPs)

45

KENFDKARFSGTWYAMAKKDPEG Query Sequence

MKGLDIQKVAGTWYSLAMAASD. Hit in the Database

Hit!extendextend

Page 46: Gene  Prediction

Gapped BLAST• Try to connect HSPs by aligning the sequences in

between them

• The Gapped BLAST algorithm allows several segments that are separated by short gaps to be connected together to one alignment

46

THEFIRSTLINIHAVEADREA____M_ESIRPATRICKREAD

INVIEIAMDEADMEATTNAMHEW___ASNINETEEN

Page 47: Gene  Prediction

How to Interpret BLAST Results• E-value

▫ Expected # of alignment with score at least S▫ Number of database hits you expect to find by chance

47

Increases linearly with length of query sequence and

database

Decreases exponentially with score of alignment

m = length of query; n= length of database; s= score

K, λ: statistical parameters dependent upon scoring system and background residue frequencies

Score

Alig

nm

en

ts

size of database

your score

expected number of random hits

Page 48: Gene  Prediction

From E-value to P-value• P-Value: probability of obtaining a score greater

than a given score S at random

P (S’>S) = 1– e-E 

Which is approximately E-value• Very small E-values are very similar to P-values. However,

E-values of about 1 to 10 are far easier to interpret than corresponding P-values.

E-Values P-Values10 0.999954605 0.993262052 0.864664721 0.632120560.1 0.09516258 (about 0.1)0.05 0.04877058 (about 0.05)0.001 0.00099950 (about 0.001)0.0001 0.0001000

48

Page 49: Gene  Prediction

BLAST and BLAST-like programs• Traditional BLAST (formerly blastall) nucleotide, protein, translations

▫ blastn nucleotide query vs. nucleotide database▫ blastp protein query vs. protein database▫ blastx nucleotide query vs. protein database▫ tblastn protein query vs. translated nucleotide database▫ tblastx translated query vs. translated database

• Megablast nucleotide only▫ Contiguous megablast

Nearly identical sequences

▫ Discontiguous megablast Cross-species comparison

• Position Specific BLAST Programs protein only▫ Position Specific Iterative BLAST (PSI-BLAST)

Automatically generates a position specific score matrix (PSSM)

▫ Reverse PSI-BLAST (RPS-BLAST) Searches a database of PSI-BLAST PSSMs

49

Page 50: Gene  Prediction

Multiple Sequence Alignment

• Smith-Waterman algorithm

50

Page 51: Gene  Prediction

Carrillo-Lipman Algorithm

51

Page 52: Gene  Prediction

Progressive Alignment Methods

• Feng-Doolittle progressive multiple alignment [1987]

▫ Pairwise alignment of all pairs of N sequence

▫ Construct a guide tree from the distance matrix

▫ Align the sequence based on the tree

52

Page 53: Gene  Prediction

53

Non protein coding gene prediction

A non-coding RNA (ncRNA) is a functional molecule that is not translated into a protein. The term small RNA (sRNA) is often used for bacterial ncRNA.

Transcripts, whose function lies in the RNA sequence itself and not as information carriers for protein synthesis.

For example: small interfering RNAs (siRNA) is used to protect our genome.It recognizes invading foreign RNAs/DNAs based on the sequence specificity. And helps to degrade the foreign RNAs.

Page 54: Gene  Prediction

54

Non-protein Coding Gene Tools

• tRNA– tRNA-ScanSE

• rRNA– RNAmmer

• sRNA– sRNATarget– sRNAPredict

Page 55: Gene  Prediction

55

Gene prediction improvement pipeline (GenePRIMP)

Page 56: Gene  Prediction

56

GenePRIMP• It is a computational evidence based postprocessing

pipeline that identifies erroneously predicted genes.• The list of gene anomalies reported include :

▫ Short genes▫ Long genes▫ Broken genes▫ Interrupted genes▫ Unique genes▫ Dubious genes

(a) GenePRIMP data flow. (b) BLAST alignments of short, long, broken and interrupted genes. Unique genes have no hits to known proteins (nr database). Dubious genes are unique genes that are shorter than 30 amino acids.

Page 57: Gene  Prediction

57

Page 58: Gene  Prediction

58

GenePRIMP analysis of gene calls

Comparison of five gene-calling applications

Page 59: Gene  Prediction

59

Gene Prediction

Gene Prediction Improvement

Str

ate

gy

Page 60: Gene  Prediction

60

ReferencesBinnewies, T. et al. 2006. Ten years of bacterial genome sequencing:comparative-genomics-based discoveries. Funct Integr Genomics 6: 165–185

J. Duan, J.J. Heikkila and B.R. Glick. 2010. Sequencing a bacterial genome: an overview. Topics in Applied Microbiology and Microbial Biotechnology 8: 1443-1451.

Casto A.M. and Amid C. 2010. Beyond the Genome: genomics research ten years after the human genome sequence.Genome Biology, 11:309 King Jordan et al. 2011. Genome Sequences for Five Strains of the Emerging PathogenHaemophilus haemolyticus. Journal of Bacteriology, 193: 5879–5880

Hedegaard J. et al. 2001. Phylogeny of the genus Haemophilus as determined by comparison of partial infB sequences. Microbiology 147, 2599–2609

Murphy T. F. et al. 2007. Haemophilus haemolyticus: A Human Respiratory Tract Commensal to Be Distinguished from Haemophilus influenzae. The Journal of Infectious Diseases, 195:81–9

Theodore M. J. et al. 2012. Evaluation of new biomarker genes for differentiating Haemophilus influenzae from Haemophilus haemolyticus. J. Clin. Microbiology. published online ahead of print on 1 February 2012

Mathe C. et al. 2002. Current Methods of Gene Prediction, their strengths and Weaknesses. Nucleic Acids Research, 30: 4103-4117

Angelova M., Kalajdziski S. and Kocarev L. 2010. Computational methods for gene finding in prokaryotes. ICT Innovations 2010 Web Proceedings, 11-20. Pati A. et al. 2010. GenePRIMP: a gene prediction improvement pipeline for prokaryotic genomes. Nature Methods, 7(6): 455-457.