Gene Prediction in silico Nita Parekh BIRC, IIIT, Hyderabad

Gene Prediction Gene Prediction in silicoin silico

Nita Parekh

BIRC, IIIT, Hyderabad

GoalGoal

The ultimate goal of molecular cell biology is to The ultimate goal of molecular cell biology is to understand the physiology of living cells in terms understand the physiology of living cells in terms of the information that isof the information that is encodedencoded in the genome in the genome of the cellof the cell

How computational approaches can help How computational approaches can help in achieving this goal ?in achieving this goal ?

A gene codes for a protein A gene codes for a protein

Protein

mRNA

DNA

transcription

translation

CCTGAGCCAACTATTGATGAA

PEPTIDE

CCUGAGCCAACUAUUGAUGAA

What is Computational Gene Finding?What is Computational Gene Finding?

Given an uncharacterized DNA sequence, find:Given an uncharacterized DNA sequence, find:

– Which region codes for a protein?Which region codes for a protein?– Which DNA strand is used to encode the Which DNA strand is used to encode the

gene?gene?– Which reading frame is used in that strand?Which reading frame is used in that strand?– Where does the gene start and end?Where does the gene start and end?– Where are the exon-intron boundaries in Where are the exon-intron boundaries in

eukaryotes?eukaryotes?– (optionally) Where are the regulatory (optionally) Where are the regulatory

sequences for that gene?sequences for that gene?

Search space - Search space - 2-5% of Genomic DNA 2-5% of Genomic DNA (~ 100 – 1000 (~ 100 – 1000

Mbp)Mbp)

Need for Computational Gene Prediction

It is the first step towards getting at the function of a protein. of a protein.

It also helps accelerate the It also helps accelerate the annotationannotation of genomes.of genomes.

Deoxyribonucleic acid (DNA) Deoxyribonucleic acid (DNA)

– – is a blueprint of the cellis a blueprint of the cell

Composed of four basic units Composed of four basic units - called- called nucleotidesnucleotides

Each nucleotide contains Each nucleotide contains

- a sugar, a phosphate and - a sugar, a phosphate and

one of the 4 bases:one of the 4 bases:

Adenine(A), Thymine(T), Adenine(A), Thymine(T),

Guanine(G), Cytosine(C)Guanine(G), Cytosine(C)

For all computational purposes, a DNA For all computational purposes, a DNA sequence is considered to be a string on a sequence is considered to be a string on a 4-letter alphabet: A, T, G, C4-letter alphabet: A, T, G, C

ACGCTGAATAGCACGCTGAATAGC

The aim is to find grammar & syntax rules of The aim is to find grammar & syntax rules of DNA language based on the 4-letter alphabet, DNA language based on the 4-letter alphabet,

- similar to English Grammar to form - similar to English Grammar to form meaningful sentencesmeaningful sentences

Biological SequencesBiological Sequences

Order of occurrence of bases: Order of occurrence of bases:

not completely randomnot completely random

- Different regions of the genome exhibit - Different regions of the genome exhibit different patterns of the four bases, A, T, G, Cdifferent patterns of the four bases, A, T, G, Ce.g., protein coding regions, regulatory regions, intron/exon e.g., protein coding regions, regulatory regions, intron/exon

boundaries, repeat regions, etc.boundaries, repeat regions, etc.

Aim: identifying these various patterns to infer Aim: identifying these various patterns to infer their functional rolestheir functional roles

Assumption in biological sequence analysis:

- strings carrying information will be different from random strings

If a hidden pattern can be identified in a If a hidden pattern can be identified in a string, it must be carrying some functional string, it must be carrying some functional informationinformation

ExampleExample

This is a lecture on bioinformaticsThis is a lecture on bioinformatics

asjd lkjfl jdjd sjftye nvcrow nzcdjhspuasjd lkjfl jdjd sjftye nvcrow nzcdjhspu

Frequency of lettersFrequency of lettersA. 7.3% A. 7.3% N. N. 7.8%7.8%B. 0.9% B. 0.9% O. O. 7.4%7.4%C. 3.0% C. 3.0% P. P. 2.7%2.7%D. 4.4% D. 4.4% Q. Q. 0.3%0.3%E. 13.0% R.E. 13.0% R. 7.7%7.7%F. 2.8% F. 2.8% S. S. 6.3%6.3%G. 1.6% G. 1.6% T. T. 9.3%9.3%H. 3.5% H. 3.5% U. U. 2.7%2.7%I. 7.4% I. 7.4% V. V. 1.3%1.3%J. 0.2% J. 0.2% W. W. 1.6%1.6%K. 0.3% K. 0.3% X. X. 0.5%0.5%L. 3.5% L. 3.5% Y. Y. 1.9%1.9%M. 2.5% M. 2.5% Z. Z. 0.1% 0.1%

Other statisticsOther statistics Frequencies of the most common first letter Frequencies of the most common first letter

of a word, last letter of a word, doublets, of a word, last letter of a word, doublets, triplets etc.triplets etc.

20 most used words in 20 most used words in – Written EnglishWritten English

the of to in and a for was is that on at he the of to in and a for was is that on at he with by be it an as hiswith by be it an as his

– Spoken EnglishSpoken Englishthe and I to of a you that in it is yes was the and I to of a you that in it is yes was this but on well he have forthis but on well he have for

Parallels in DNA languageParallels in DNA language

ATGGTGGTCATGGCGCCCCGAACCCTCTATGGTGGTCATGGCGCCCCGAACCCTCTTCCTGCTGCTCTCGGGGGCCCTGACCCTTCCTGCTGCTCTCGGGGGCCCTGACCCTGACCGAGACCTGGGCGGGTGAGTGCGGGACCGAGACCTGGGCGGGTGAGTGCGGGGTCAGGAGGGAAACAGCCCCTGCGCGGGTCAGGAGGGAAACAGCCCCTGCGCGGAGGAGGGAGGGGCCGGCCCGGCGGGGAGGAGGGAGGGGCCGGCCCGGCGGG

GTCTCAACCCCTCCTCGCCCCCAGGCTCGTCTCAACCCCTCCTCGCCCCCAGGCTCCCACTCCATGAGGTATTTCAGCGCCGCCCCACTCCATGAGGTATTTCAGCGCCGCCGTGTCCCGGCCCGGCCGCGGGGAGCCCGTGTCCCGGCCCGGCCGCGGGGAGCCCCGCTTCATCGCCATGGGCTACGTGGACGCGCTTCATCGCCATGGGCTACGTGGACGACACGCAGTTCGTGCGGTTCACACGCAGTTCGTGCGGTTC

Parallels in DNA languageParallels in DNA languageATGATG GTG GTC GTG GTC ATGATG GCG CCC CGA ACC GCG CCC CGA ACC CTC TTC CTG CTG CTC TCG GGG GCC CTC TTC CTG CTG CTC TCG GGG GCC CTG ACC CTG ACC GAG ACC TGG GCG CTG ACC CTG ACC GAG ACC TGG GCG GGT GAG TGC GGG GTC AGG AGG GAA GGT GAG TGC GGG GTC AGG AGG GAA ACA GCC CCT GCG CGG AGG AGG GAG ACA GCC CCT GCG CGG AGG AGG GAG GGG CCG GCC CGG CGG… GGG CCG GCC CGG CGG…

GTC TCA ACC CCT CCT CGC CCC CAG GTC TCA ACC CCT CCT CGC CCC CAG GCT CCC ACT CCA GCT CCC ACT CCA TGATGA GGT ATT TCA GGT ATT TCA GCG CCG CCG TGT CCC GGC CCG GCC GCG CCG CCG TGT CCC GGC CCG GCC GCG GGG AGC CCC GCT TCA TCG CCA GCG GGG AGC CCC GCT TCA TCG CCA TGG GCT ACG TGG ACG ACA CGC AGT TGG GCT ACG TGG ACG ACA CGC AGT TCG TGC GGT TC…TCG TGC GGT TC…

This task needs to be automated because of the This task needs to be automated because of the large genome sizes:large genome sizes:

Smallest genome:Smallest genome:

Mycoplasma genitaliumMycoplasma genitalium 0.5 x 100.5 x 106 6 bpbp

Human genome:Human genome: 3 x 103 x 109 9 bpbp (not the (not the largest)largest)

Finding genes in ProkaryotesFinding genes in Prokaryotes• each gene is one each gene is one continuouscontinuous stretch of stretch of

basesbases

• most of the DNA sequence codes for proteinmost of the DNA sequence codes for protein

(70% of the (70% of the H.influenzeaH.influenzea bacterium genome is coding) bacterium genome is coding)

Finding genes in ProkaryotesFinding genes in Prokaryotes

Gene prediction in prokaryotes is considerably Gene prediction in prokaryotes is considerably simple and involves:simple and involves:

• identifying long reading framesidentifying long reading frames

• using codon frequenciesusing codon frequencies

Finding genes in EukaryotesFinding genes in Eukaryotes• the coding region is usually the coding region is usually discontinuousdiscontinuous • composed of alternating stretches of composed of alternating stretches of exonsexons

and and intronsintrons

• Only 2-3 % of the human genome (~3 x Only 2-3 % of the human genome (~3 x 101099bp) codes for proteinsbp) codes for proteins

Finding genes in EukaryotesFinding genes in Eukaryotes Gene finding problem complicates:Gene finding problem complicates:

due to the existence of interweaving exons due to the existence of interweaving exons and introns –and introns – stop codons may exist in intronic stop codons may exist in intronic regions making it difficult to identify correct ORF regions making it difficult to identify correct ORF

a gene region may encode a gene region may encode many proteins –many proteins – due to alternative splicingdue to alternative splicing

Exon length need not be multiple of three –Exon length need not be multiple of three – resulting in frameshift between exonsresulting in frameshift between exons

Gene may be intron-less Gene may be intron-less (single-exon genes)(single-exon genes)

Relatively low gene density -Relatively low gene density - only 2 - 5% of only 2 - 5% of the human genome codes for proteinsthe human genome codes for proteins

Methods for Identifying Coding RegionsMethods for Identifying Coding Regions Finding Open Reading Frames (ORFs)Finding Open Reading Frames (ORFs)

Homology SearchHomology Search• DNA vs. Protein Searches DNA vs. Protein Searches

Content-based methods:Content-based methods:• Coding statistics, Coding statistics, vizviz., codon usage bias, ., codon usage bias,

periodicity in base occurrence, etc.periodicity in base occurrence, etc.

Signal-based methods:Signal-based methods:• CpG islands CpG islands • Start/Stop signals, promoters, poly-A sites, Start/Stop signals, promoters, poly-A sites,

intron/exon boundaries, etc.intron/exon boundaries, etc.

Integration of these methodsIntegration of these methods

Finding Open Reading Frames (ORF)Finding Open Reading Frames (ORF)

Once a gene has been sequenced it is Once a gene has been sequenced it is important to determine theimportant to determine the correctcorrect open open reading frame (ORF).reading frame (ORF).

Every region of DNA hasEvery region of DNA has six possible reading six possible reading framesframes, three in each direction, three in each direction

The reading frame that is used determines The reading frame that is used determines which amino acids will be encoded by a genewhich amino acids will be encoded by a gene. .

Typically only one reading frame is used in Typically only one reading frame is used in translating a gene, and this is often thetranslating a gene, and this is often the longest open reading framelongest open reading frame

Finding Open Reading Frames (ORF)Finding Open Reading Frames (ORF) Detecting a relatively long sequence Detecting a relatively long sequence

deprived of stop codons indicate a coding deprived of stop codons indicate a coding regionregion

An open reading frame starts with a An open reading frame starts with a start start codon (atg)codon (atg) in most species and ends with a in most species and ends with a stop codon (taa, tag or tga)stop codon (taa, tag or tga)

Once the open reading frame is known the Once the open reading frame is known the DNA sequence can be translated into its DNA sequence can be translated into its corresponding amino acid sequence using corresponding amino acid sequence using the genetic codethe genetic code

The codons are triplet of basesThe codons are triplet of bases

The Genetic Code

Finding Open Reading Frames (ORF)Finding Open Reading Frames (ORF)Consider the following sequence of DNA:Consider the following sequence of DNA:

5´ TCAATGTAACGCGCTACCCGGAGCTCTGGG5´ TCAATGTAACGCGCTACCCGGAGCTCTGGG

CCCAAATTTCATCCACT 3´CCCAAATTTCATCCACT 3´ “Forward Strand”“Forward Strand”

Its complementary Strand is:Its complementary Strand is:

33´ AGTTACATTGCGCGATGGGCCTCGAGACCCGGG´ AGTTACATTGCGCGATGGGCCTCGAGACCCGGGTTTAAAGTAGGTGA 5´TTTAAAGTAGGTGA 5´ “Reverse Strand”“Reverse Strand”

The DNA sequence can be read in The DNA sequence can be read in sixsix reading reading frames -frames - threethree in thein the forwardforward andand threethree in thein the reversereverse direction depending on the start direction depending on the start positionposition


5´ 5´ TCATCAATGTAACGCGCTACCCGGAGCTCTGGATGTAACGCGCTACCCGGAGCTCTGGGCCCAAATTTCATCCACT 3´GCCCAAATTTCATCCACT 3´ Three reading frames in theThree reading frames in the forwardforward direction:direction:

1.1. TCATCA ATGATG TAATAA CGC GCT ACC CGG AGC CGC GCT ACC CGG AGC

TCT GGG CCC AAA TTT CAT CCA CTTCT GGG CCC AAA TTT CAT CCA CT

2.2. CACAA TGT AAC GCG CTA CCC GGA GCT A TGT AAC GCG CTA CCC GGA GCT CTG GGC CCA AAT TTC ATC CAC TCTG GGC CCA AAT TTC ATC CAC T

3.3. AAATAT GTA ACG CGC TAC CCG GAG CTC GTA ACG CGC TAC CCG GAG CTC TGG GCC CAA ATT TCA TCC ACTTGG GCC CAA ATT TCA TCC ACT

Start codon


3´ 3´ AGTTACATTGCGCGATGGGCCTCGAGACCAGTTACATTGCGCGATGGGCCTCGAGACCCGGGTTTAAAGTAGGCGGGTTTAAAGTAGGTGATGA 5´5´

Three reading frames in theThree reading frames in the reversereverse direction:direction: 1.1. AG TTA CAT TGC GCG ATG GGC CTCAG TTA CAT TGC GCG ATG GGC CTC

GAG ACC CGG GTT TAAGAG ACC CGG GTT TAA AGTAGT AGGAGG TGATGA

2.2. A GTT ACA TTG CGCA GTT ACA TTG CGC GATGAT GGG CCT CGA GGG CCT CGA GAC CCG GGT TTA AAG TAGGAC CCG GGT TTA AAG TAG GGTGTG

3.3. AGTAGT TAC ATT GCG CGA TGG GCC TCG TAC ATT GCG CGA TGG GCC TCG AGA CCC GGG TTT AAAAGA CCC GGG TTT AAA GTAGTA GGGGTTStart codonstop codon


In this case the longest open reading frame In this case the longest open reading frame (ORF) is the 3(ORF) is the 3rdrd reading frame of the reading frame of the complementary strand :complementary strand :

AGTAGT TAC ATT GCG CGA TGG GCC TCG TAC ATT GCG CGA TGG GCC TCG AGA CCC GGG TTT AAAAGA CCC GGG TTT AAA GTAGTA

When readWhen read 55´́ toto 33´, the longest ORF is:´, the longest ORF is:

ATGATG AAA TTT GGG CCC AGA GCT CCG AAA TTT GGG CCC AGA GCT CCG GGT AGC GCG TTA CATGGT AGC GCG TTA CAT TGATGA

Finding Long ORFsFinding Long ORFs First step to distinguish between a coding First step to distinguish between a coding

and a non-coding region is to look at theand a non-coding region is to look at the frequency of stop codonsfrequency of stop codons

Sequence similarity search (database Sequence similarity search (database search)search)

When no sequence similarity is found, an When no sequence similarity is found, an ORF can still be considered gene-like ORF can still be considered gene-like according to some statistical features:according to some statistical features:

the three-base periodicity the three-base periodicity

higher G+C contenthigher G+C content

signal sequence patternssignal sequence patterns

Finding Long ORFsFinding Long ORFs

Once a long ORF/ all ORFs above a certain Once a long ORF/ all ORFs above a certain threshold are identified,threshold are identified,

- these ORF sequences are called - these ORF sequences are called putativeputative codingcoding sequences sequences

- translate each ORF using the Universal - translate each ORF using the Universal Genetic code to obtain amino acid sequence Genetic code to obtain amino acid sequence

- search against the protein database for - search against the protein database for homologshomologs

Finding genes in ProkaryotesFinding genes in Prokaryotes

DrawbacksDrawbacks::

The addition or deletion of one or more The addition or deletion of one or more bases will cause all the codons scanned to be bases will cause all the codons scanned to be differentdifferent

sensitive tosensitive to frame shift errorsframe shift errors

Fails to identify very small coding regionsFails to identify very small coding regions Fails to identify the occurrence of Fails to identify the occurrence of

overlapping long ORFs on opposite DNA overlapping long ORFs on opposite DNA strands (genes and ‘shadow genes’)strands (genes and ‘shadow genes’)

Web-based toolsWeb-based tools

ORF FinderORF Finder (NCBI) (NCBI)

http://www.ncbi.nih.gov/gorf/gorf.htmlhttp://www.ncbi.nih.gov/gorf/gorf.html

EMBOSSEMBOSS

getorf getorf - Finds and extracts open reading - Finds and extracts open reading framesframes

plotorf plotorf - Plot potential open reading frames- Plot potential open reading frames

Sixpack Sixpack - Display a DNA sequence with 6-- Display a DNA sequence with 6-frame translation and ORFsframe translation and ORFs

http://www.hgmp.mrc.ac.uk/Software/http://www.hgmp.mrc.ac.uk/Software/EMBOSS/Apps/getorf.htmlEMBOSS/Apps/getorf.html

Homology Search Homology Search

This involvesThis involves Sequence-basedSequence-based Database Database SearchingSearching

• DNA Database searching DNA Database searching

• Protein Database searchingProtein Database searching

Homology SearchHomology Search

Why search databases?Why search databases?

When one obtains a new DNA sequence, When one obtains a new DNA sequence, one needs to know:one needs to know:

whether it alreadywhether it already existsexists in the databanksin the databanks

whether it has anywhether it has any homologous homologous sequencessequences (i.e., sequences derived from a (i.e., sequences derived from a common ancestry) in the databasescommon ancestry) in the databases

Given a putative coding ORF, search forGiven a putative coding ORF, search for homologous proteinshomologous proteins – proteins similar in – proteins similar in their folding or structure or function.their folding or structure or function.


DNA vs. Protein Searches DNA vs. Protein Searches

Use protein for database Use protein for database similarity searches whenever similarity searches whenever

possiblepossible


Three main search tools used for database Three main search tools used for database search:search:

• BLAST - BLAST - algorithm by Karlin & Altschulalgorithm by Karlin & Altschul

http://www.ncbi.nlm.nih.gov/BLAST/http://www.ncbi.nlm.nih.gov/BLAST/

• FastA - FastA - algorithm by Pearson & Lipmanalgorithm by Pearson & Lipman

http://www.ebi.ac.uk/fasta33/http://www.ebi.ac.uk/fasta33/

• Smith-Waterman (SW) algorithm Smith-Waterman (SW) algorithm

-- dynamic programming algorithmdynamic programming algorithm

Limitations of Homology SearchLimitations of Homology Search

Only Only limited numberlimited number of genes are available in of genes are available in various databases.various databases.

Currently only Currently only 50%50% of the sequences are of the sequences are found to be similar to previously known found to be similar to previously known sequences.sequences.

It should always be kept in mind that It should always be kept in mind that similarity-based methodssimilarity-based methods are only as reliableare only as reliable as the databases that are searched, andas the databases that are searched, and apparent homologyapparent homology can becan be misleadingmisleading at at timestimes

Content-based MethodsContent-based Methods

At the core of all gene identification programs

– there exist one or more coding measures

A coding statistic - a function that computes the likelihood that the sequence is coding for a protein.

A good knowledge of core coding statistics is important to understand how gene identification programs work.

Classification of Coding MeasuresClassification of Coding Measures

Coding statistics measure • base compositional bias • periodicity in base occurrence • codon usage bias

Main distinction is between • measures dependent of a model of coding DNA• measures independent of such a model.

Model dependent coding statistics capture the specific features of coding DNA:

Unequal usage of codons in the coding regions - a universal feature of the genomes

Dependencies between nucleotide positions

Base compositional bias between codon positions

- requires a representative sample of coding DNA from the species under consideration to estimate the model's parameters

Markov ModelsMarkov Models

Dependencies between nucleotide positions in coding regions

- can be explicitly described by means of Markov Models

In Markov Models

- the probability of a nucleotide at

a particular codon position depends

on the nucleotide(s) preceding it.

Probability of a DNA sequence of

length L:

)( sx|txPa 1iist

L

2ix,x1 i1i

axPxP )()(transition probabilities

Markov ModelsMarkov ModelsTable III: Probabilities of the four nucleotides at the different codon positions conditioned to the nucleotide in the preceding codon position

Model independent coding statistics capture only the “universal” features of coding DNA:

Position Asymmetry – how asymmetric is the distribution of nucleotides at the 3 triplet positions

Periodic Correlation - correlations between nucleotide positions

- do not require a sample of coding DNA

Signal-based MethodsSignal-based Methods

GT

AG

Signal – a string of DNA recognized by the cellular machinery

Signals for gene identificationSignals for gene identification

There are many signals associated with There are many signals associated with genes, each of whichgenes, each of which suggestssuggests but but does not provedoes not prove the existence of a genethe existence of a gene

Most of these signals can be modeled Most of these signals can be modeled using using weight matricesweight matrices

Signals for gene identificationSignals for gene identification

CpG IslandsCpG Islands – identify the 2% of the – identify the 2% of the genome that codes for proteins genome that codes for proteins

Start & Stop CodonsStart & Stop Codons – signifies the start & – signifies the start & end of a coding regionend of a coding region

Transcription Start Site Transcription Start Site – to identify the – to identify the start of coding regionstart of coding region

Donor & Acceptor SitesDonor & Acceptor Sites - signifies the - signifies the start & end of intronic regionsstart & end of intronic regions

Cap Site Cap Site – found in the 5’ UTR – found in the 5’ UTR

SignalsSignals for gene identificationfor gene identification

PromotersPromoters – to initiate transcription – to initiate transcription (found in 5’ UTR region)(found in 5’ UTR region)

EnhancersEnhancers – regulates gene expression, – regulates gene expression, (found in 5’ or 3’ UTR regions, intronic (found in 5’ or 3’ UTR regions, intronic regions, orregions, or up to few Kb away from the gene)up to few Kb away from the gene)

Transcription Factor Binding Sites Transcription Factor Binding Sites – – short DNA sequences where proteins bind short DNA sequences where proteins bind to initiate transcription /translation processto initiate transcription /translation process

Poly-A Site Poly-A Site – identify the end of coding – identify the end of coding region region (found in 3’ UTR region)(found in 3’ UTR region)

Promoter DetectionPromoter Detection Not all ORFs are genesNot all ORFs are genes

True coding regions have specific sequences True coding regions have specific sequences upstream of the start site known as upstream of the start site known as promoterspromoters where the RNA polymerase binds where the RNA polymerase binds to initiate transcription, e.g., in to initiate transcription, e.g., in E. coliE. coli::

No two patterns are identicalNo two patterns are identical All genes do not have these patternsAll genes do not have these patterns

Consensuspatterns

Positional Weight MatrixPositional Weight Matrixfor TATA boxfor TATA box

11 22 33 44 55 66

AA 22 9595 2626 5959 5151 11

CC 99 22 1414 1313 2020 33

GG 1010 11 1616 1515 1212 00

TT 7979 22 4444 1313 1717 9696

Complications in Gene PredictionComplications in Gene Prediction

The problem of gene identification is further complicated in case of eukaryotes by the vast variation that is found in the structure of genes.

On an average, a vertebrate gene is 30Kb long. Of this, the coding region is only about 1Kb.

The coding region typically consists of 6 exons, each about 150bp long.

These are average statistics


Huge variations from the average are observed

Biggest human gene, dystrophin is 2.4Mb long.

Blood coagulation human factor VIII gene is ~ 186Kb. It has 26 exons with sizes varying from 69 bp to 3106 bp and its 25 introns range in size from 207 to 32,400 bp.

An average 5’ UTR is 750bp long, but it can be longer and span several exons (for e.g., in MAGE family).

On an average, the 3’ UTR is about 450bp long, but for e.g., in case of the gene for Kallman’s syndrome, the length exceeds 4Kb

Some facts about human genesSome facts about human genes Comprise about 3% of the genomeComprise about 3% of the genome Average gene length: ~ 8,000 bpAverage gene length: ~ 8,000 bp Average of 5-6 exons/geneAverage of 5-6 exons/gene Average exon length: ~200 bpAverage exon length: ~200 bp Average intron length: ~2,000 bpAverage intron length: ~2,000 bp ~8% genes have a single exon~8% genes have a single exon

Some exons can be as small as 1 or 3 bpSome exons can be as small as 1 or 3 bp


In higher eukaryotes the gene finding becomes far more difficult because

• It is now necessary to combine multiple ORFs to obtain a spliced coding region.

• Alternative splicing is not uncommon,

• Exons can be very short, and introns can be very long.

Given the nature of genomic sequence in humans, where large introns are known to exist, there is definitely a need for highly specific gene finding algorithms.

GENSCAN: http://genes.mit.edu/GENSCAN.html

Probabilistic model based on HMM. The different states of the model correspond to different functional units on a gene, e.g., promoter region, exon, intron etc.

It uses a homogenous 5th order Markov model for non-coding regions and a 3-periodic (inhomogenous) 5th order Markov model for coding regions

Signals are modeled by weight matrixes, weight arrays and maximal dependence decomposition techniques.

Species: Vertebrates, Maize Arabidopsis, Trained on human genes, Accuracy lower for non-vertebrates

Gene prediction programsGene prediction programsGene prediction programsGene prediction programs

Fgene: Uses Dynamic Programming to find the optimal combinations of exons, promoters, and polyA sites detected by a pattern recognition algorithm.

Species: Human, Drosophila, Nematode, Yeast, Plant

http://dot.imgen.bcm.tmc.edu.9331/gene-finder/gf.html

GrailExp: Uses HMM as the underlying computational technique to determine its genomic predictions. Uses Neural Network to combine the information from various gene finding signals

Species: Human, Mouse

http://compbio.ornl.gov/grailexp/

Gene prediction programsGene prediction programsGene prediction programsGene prediction programs

Documents

Gene Prediction in silico Nita Parekh BIRC, IIIT, Hyderabad