Upload
phyllis-bruce
View
228
Download
4
Tags:
Embed Size (px)
Citation preview
Gene Prediction Gene Prediction in silicoin silico
Nita Parekh
BIRC, IIIT, Hyderabad
GoalGoal
The ultimate goal of molecular cell biology is to The ultimate goal of molecular cell biology is to understand the physiology of living cells in terms understand the physiology of living cells in terms of the information that isof the information that is encodedencoded in the genome in the genome of the cellof the cell
How computational approaches can help How computational approaches can help in achieving this goal ?in achieving this goal ?
A gene codes for a protein A gene codes for a protein
Protein
mRNA
DNA
transcription
translation
CCTGAGCCAACTATTGATGAA
PEPTIDE
CCUGAGCCAACUAUUGAUGAA
What is Computational Gene Finding?What is Computational Gene Finding?
Given an uncharacterized DNA sequence, find:Given an uncharacterized DNA sequence, find:
– Which region codes for a protein?Which region codes for a protein?– Which DNA strand is used to encode the Which DNA strand is used to encode the
gene?gene?– Which reading frame is used in that strand?Which reading frame is used in that strand?– Where does the gene start and end?Where does the gene start and end?– Where are the exon-intron boundaries in Where are the exon-intron boundaries in
eukaryotes?eukaryotes?– (optionally) Where are the regulatory (optionally) Where are the regulatory
sequences for that gene?sequences for that gene?
Search space - Search space - 2-5% of Genomic DNA 2-5% of Genomic DNA (~ 100 – 1000 (~ 100 – 1000
Mbp)Mbp)
Need for Computational Gene Prediction
It is the first step towards getting at the function of a protein. of a protein.
It also helps accelerate the It also helps accelerate the annotationannotation of genomes.of genomes.
Deoxyribonucleic acid (DNA) Deoxyribonucleic acid (DNA)
– – is a blueprint of the cellis a blueprint of the cell
Composed of four basic units Composed of four basic units - called- called nucleotidesnucleotides
Each nucleotide contains Each nucleotide contains
- a sugar, a phosphate and - a sugar, a phosphate and
one of the 4 bases:one of the 4 bases:
Adenine(A), Thymine(T), Adenine(A), Thymine(T),
Guanine(G), Cytosine(C)Guanine(G), Cytosine(C)
For all computational purposes, a DNA For all computational purposes, a DNA sequence is considered to be a string on a sequence is considered to be a string on a 4-letter alphabet: A, T, G, C4-letter alphabet: A, T, G, C
ACGCTGAATAGCACGCTGAATAGC
The aim is to find grammar & syntax rules of The aim is to find grammar & syntax rules of DNA language based on the 4-letter alphabet, DNA language based on the 4-letter alphabet,
- similar to English Grammar to form - similar to English Grammar to form meaningful sentencesmeaningful sentences
Biological SequencesBiological Sequences
Order of occurrence of bases: Order of occurrence of bases:
not completely randomnot completely random
- Different regions of the genome exhibit - Different regions of the genome exhibit different patterns of the four bases, A, T, G, Cdifferent patterns of the four bases, A, T, G, Ce.g., protein coding regions, regulatory regions, intron/exon e.g., protein coding regions, regulatory regions, intron/exon
boundaries, repeat regions, etc.boundaries, repeat regions, etc.
Aim: identifying these various patterns to infer Aim: identifying these various patterns to infer their functional rolestheir functional roles
Assumption in biological sequence analysis:
- strings carrying information will be different from random strings
If a hidden pattern can be identified in a If a hidden pattern can be identified in a string, it must be carrying some functional string, it must be carrying some functional informationinformation
ExampleExample
This is a lecture on bioinformaticsThis is a lecture on bioinformatics
asjd lkjfl jdjd sjftye nvcrow nzcdjhspuasjd lkjfl jdjd sjftye nvcrow nzcdjhspu
Frequency of lettersFrequency of lettersA. 7.3% A. 7.3% N. N. 7.8%7.8%B. 0.9% B. 0.9% O. O. 7.4%7.4%C. 3.0% C. 3.0% P. P. 2.7%2.7%D. 4.4% D. 4.4% Q. Q. 0.3%0.3%E. 13.0% R.E. 13.0% R. 7.7%7.7%F. 2.8% F. 2.8% S. S. 6.3%6.3%G. 1.6% G. 1.6% T. T. 9.3%9.3%H. 3.5% H. 3.5% U. U. 2.7%2.7%I. 7.4% I. 7.4% V. V. 1.3%1.3%J. 0.2% J. 0.2% W. W. 1.6%1.6%K. 0.3% K. 0.3% X. X. 0.5%0.5%L. 3.5% L. 3.5% Y. Y. 1.9%1.9%M. 2.5% M. 2.5% Z. Z. 0.1% 0.1%
Other statisticsOther statistics Frequencies of the most common first letter Frequencies of the most common first letter
of a word, last letter of a word, doublets, of a word, last letter of a word, doublets, triplets etc.triplets etc.
20 most used words in 20 most used words in – Written EnglishWritten English
the of to in and a for was is that on at he the of to in and a for was is that on at he with by be it an as hiswith by be it an as his
– Spoken EnglishSpoken Englishthe and I to of a you that in it is yes was the and I to of a you that in it is yes was this but on well he have forthis but on well he have for
Parallels in DNA languageParallels in DNA language
ATGGTGGTCATGGCGCCCCGAACCCTCTATGGTGGTCATGGCGCCCCGAACCCTCTTCCTGCTGCTCTCGGGGGCCCTGACCCTTCCTGCTGCTCTCGGGGGCCCTGACCCTGACCGAGACCTGGGCGGGTGAGTGCGGGACCGAGACCTGGGCGGGTGAGTGCGGGGTCAGGAGGGAAACAGCCCCTGCGCGGGTCAGGAGGGAAACAGCCCCTGCGCGGAGGAGGGAGGGGCCGGCCCGGCGGGGAGGAGGGAGGGGCCGGCCCGGCGGG
GTCTCAACCCCTCCTCGCCCCCAGGCTCGTCTCAACCCCTCCTCGCCCCCAGGCTCCCACTCCATGAGGTATTTCAGCGCCGCCCCACTCCATGAGGTATTTCAGCGCCGCCGTGTCCCGGCCCGGCCGCGGGGAGCCCGTGTCCCGGCCCGGCCGCGGGGAGCCCCGCTTCATCGCCATGGGCTACGTGGACGCGCTTCATCGCCATGGGCTACGTGGACGACACGCAGTTCGTGCGGTTCACACGCAGTTCGTGCGGTTC
Parallels in DNA languageParallels in DNA languageATGATG GTG GTC GTG GTC ATGATG GCG CCC CGA ACC GCG CCC CGA ACC CTC TTC CTG CTG CTC TCG GGG GCC CTC TTC CTG CTG CTC TCG GGG GCC CTG ACC CTG ACC GAG ACC TGG GCG CTG ACC CTG ACC GAG ACC TGG GCG GGT GAG TGC GGG GTC AGG AGG GAA GGT GAG TGC GGG GTC AGG AGG GAA ACA GCC CCT GCG CGG AGG AGG GAG ACA GCC CCT GCG CGG AGG AGG GAG GGG CCG GCC CGG CGG… GGG CCG GCC CGG CGG…
GTC TCA ACC CCT CCT CGC CCC CAG GTC TCA ACC CCT CCT CGC CCC CAG GCT CCC ACT CCA GCT CCC ACT CCA TGATGA GGT ATT TCA GGT ATT TCA GCG CCG CCG TGT CCC GGC CCG GCC GCG CCG CCG TGT CCC GGC CCG GCC GCG GGG AGC CCC GCT TCA TCG CCA GCG GGG AGC CCC GCT TCA TCG CCA TGG GCT ACG TGG ACG ACA CGC AGT TGG GCT ACG TGG ACG ACA CGC AGT TCG TGC GGT TC…TCG TGC GGT TC…
This task needs to be automated because of the This task needs to be automated because of the large genome sizes:large genome sizes:
Smallest genome:Smallest genome:
Mycoplasma genitaliumMycoplasma genitalium 0.5 x 100.5 x 106 6 bpbp
Human genome:Human genome: 3 x 103 x 109 9 bpbp (not the (not the largest)largest)
Finding genes in ProkaryotesFinding genes in Prokaryotes• each gene is one each gene is one continuouscontinuous stretch of stretch of
basesbases
• most of the DNA sequence codes for proteinmost of the DNA sequence codes for protein
(70% of the (70% of the H.influenzeaH.influenzea bacterium genome is coding) bacterium genome is coding)
Finding genes in ProkaryotesFinding genes in Prokaryotes
Gene prediction in prokaryotes is considerably Gene prediction in prokaryotes is considerably simple and involves:simple and involves:
• identifying long reading framesidentifying long reading frames
• using codon frequenciesusing codon frequencies
Finding genes in EukaryotesFinding genes in Eukaryotes• the coding region is usually the coding region is usually discontinuousdiscontinuous • composed of alternating stretches of composed of alternating stretches of exonsexons
and and intronsintrons
• Only 2-3 % of the human genome (~3 x Only 2-3 % of the human genome (~3 x 101099bp) codes for proteinsbp) codes for proteins
Finding genes in EukaryotesFinding genes in Eukaryotes Gene finding problem complicates:Gene finding problem complicates:
due to the existence of interweaving exons due to the existence of interweaving exons and introns –and introns – stop codons may exist in intronic stop codons may exist in intronic regions making it difficult to identify correct ORF regions making it difficult to identify correct ORF
a gene region may encode a gene region may encode many proteins –many proteins – due to alternative splicingdue to alternative splicing
Exon length need not be multiple of three –Exon length need not be multiple of three – resulting in frameshift between exonsresulting in frameshift between exons
Gene may be intron-less Gene may be intron-less (single-exon genes)(single-exon genes)
Relatively low gene density -Relatively low gene density - only 2 - 5% of only 2 - 5% of the human genome codes for proteinsthe human genome codes for proteins
Methods for Identifying Coding RegionsMethods for Identifying Coding Regions Finding Open Reading Frames (ORFs)Finding Open Reading Frames (ORFs)
Homology SearchHomology Search• DNA vs. Protein Searches DNA vs. Protein Searches
Content-based methods:Content-based methods:• Coding statistics, Coding statistics, vizviz., codon usage bias, ., codon usage bias,
periodicity in base occurrence, etc.periodicity in base occurrence, etc.
Signal-based methods:Signal-based methods:• CpG islands CpG islands • Start/Stop signals, promoters, poly-A sites, Start/Stop signals, promoters, poly-A sites,
intron/exon boundaries, etc.intron/exon boundaries, etc.
Integration of these methodsIntegration of these methods
Finding Open Reading Frames (ORF)Finding Open Reading Frames (ORF)
Once a gene has been sequenced it is Once a gene has been sequenced it is important to determine theimportant to determine the correctcorrect open open reading frame (ORF).reading frame (ORF).
Every region of DNA hasEvery region of DNA has six possible reading six possible reading framesframes, three in each direction, three in each direction
The reading frame that is used determines The reading frame that is used determines which amino acids will be encoded by a genewhich amino acids will be encoded by a gene. .
Typically only one reading frame is used in Typically only one reading frame is used in translating a gene, and this is often thetranslating a gene, and this is often the longest open reading framelongest open reading frame
Finding Open Reading Frames (ORF)Finding Open Reading Frames (ORF) Detecting a relatively long sequence Detecting a relatively long sequence
deprived of stop codons indicate a coding deprived of stop codons indicate a coding regionregion
An open reading frame starts with a An open reading frame starts with a start start codon (atg)codon (atg) in most species and ends with a in most species and ends with a stop codon (taa, tag or tga)stop codon (taa, tag or tga)
Once the open reading frame is known the Once the open reading frame is known the DNA sequence can be translated into its DNA sequence can be translated into its corresponding amino acid sequence using corresponding amino acid sequence using the genetic codethe genetic code
The codons are triplet of basesThe codons are triplet of bases
The Genetic Code
Finding Open Reading Frames (ORF)Finding Open Reading Frames (ORF)Consider the following sequence of DNA:Consider the following sequence of DNA:
5´ TCAATGTAACGCGCTACCCGGAGCTCTGGG5´ TCAATGTAACGCGCTACCCGGAGCTCTGGG
CCCAAATTTCATCCACT 3´CCCAAATTTCATCCACT 3´ “Forward Strand”“Forward Strand”
Its complementary Strand is:Its complementary Strand is:
33´ AGTTACATTGCGCGATGGGCCTCGAGACCCGGG´ AGTTACATTGCGCGATGGGCCTCGAGACCCGGGTTTAAAGTAGGTGA 5´TTTAAAGTAGGTGA 5´ “Reverse Strand”“Reverse Strand”
The DNA sequence can be read in The DNA sequence can be read in sixsix reading reading frames -frames - threethree in thein the forwardforward andand threethree in thein the reversereverse direction depending on the start direction depending on the start positionposition
Finding Open Reading Frames (ORF)Finding Open Reading Frames (ORF)
5´ 5´ TCATCAATGTAACGCGCTACCCGGAGCTCTGGATGTAACGCGCTACCCGGAGCTCTGGGCCCAAATTTCATCCACT 3´GCCCAAATTTCATCCACT 3´ Three reading frames in theThree reading frames in the forwardforward direction:direction:
1.1. TCATCA ATGATG TAATAA CGC GCT ACC CGG AGC CGC GCT ACC CGG AGC
TCT GGG CCC AAA TTT CAT CCA CTTCT GGG CCC AAA TTT CAT CCA CT
2.2. CACAA TGT AAC GCG CTA CCC GGA GCT A TGT AAC GCG CTA CCC GGA GCT CTG GGC CCA AAT TTC ATC CAC TCTG GGC CCA AAT TTC ATC CAC T
3.3. AAATAT GTA ACG CGC TAC CCG GAG CTC GTA ACG CGC TAC CCG GAG CTC TGG GCC CAA ATT TCA TCC ACTTGG GCC CAA ATT TCA TCC ACT
Start codon
Finding Open Reading Frames (ORF)Finding Open Reading Frames (ORF)
3´ 3´ AGTTACATTGCGCGATGGGCCTCGAGACCAGTTACATTGCGCGATGGGCCTCGAGACCCGGGTTTAAAGTAGGCGGGTTTAAAGTAGGTGATGA 5´5´
Three reading frames in theThree reading frames in the reversereverse direction:direction: 1.1. AG TTA CAT TGC GCG ATG GGC CTCAG TTA CAT TGC GCG ATG GGC CTC
GAG ACC CGG GTT TAAGAG ACC CGG GTT TAA AGTAGT AGGAGG TGATGA
2.2. A GTT ACA TTG CGCA GTT ACA TTG CGC GATGAT GGG CCT CGA GGG CCT CGA GAC CCG GGT TTA AAG TAGGAC CCG GGT TTA AAG TAG GGTGTG
3.3. AGTAGT TAC ATT GCG CGA TGG GCC TCG TAC ATT GCG CGA TGG GCC TCG AGA CCC GGG TTT AAAAGA CCC GGG TTT AAA GTAGTA GGGGTTStart codonstop codon
Finding Open Reading Frames (ORF)Finding Open Reading Frames (ORF)
In this case the longest open reading frame In this case the longest open reading frame (ORF) is the 3(ORF) is the 3rdrd reading frame of the reading frame of the complementary strand :complementary strand :
AGTAGT TAC ATT GCG CGA TGG GCC TCG TAC ATT GCG CGA TGG GCC TCG AGA CCC GGG TTT AAAAGA CCC GGG TTT AAA GTAGTA
When readWhen read 55´́ toto 33´, the longest ORF is:´, the longest ORF is:
ATGATG AAA TTT GGG CCC AGA GCT CCG AAA TTT GGG CCC AGA GCT CCG GGT AGC GCG TTA CATGGT AGC GCG TTA CAT TGATGA
Finding Long ORFsFinding Long ORFs First step to distinguish between a coding First step to distinguish between a coding
and a non-coding region is to look at theand a non-coding region is to look at the frequency of stop codonsfrequency of stop codons
Sequence similarity search (database Sequence similarity search (database search)search)
When no sequence similarity is found, an When no sequence similarity is found, an ORF can still be considered gene-like ORF can still be considered gene-like according to some statistical features:according to some statistical features:
the three-base periodicity the three-base periodicity
higher G+C contenthigher G+C content
signal sequence patternssignal sequence patterns
Finding Long ORFsFinding Long ORFs
Once a long ORF/ all ORFs above a certain Once a long ORF/ all ORFs above a certain threshold are identified,threshold are identified,
- these ORF sequences are called - these ORF sequences are called putativeputative codingcoding sequences sequences
- translate each ORF using the Universal - translate each ORF using the Universal Genetic code to obtain amino acid sequence Genetic code to obtain amino acid sequence
- search against the protein database for - search against the protein database for homologshomologs
Finding genes in ProkaryotesFinding genes in Prokaryotes
DrawbacksDrawbacks::
The addition or deletion of one or more The addition or deletion of one or more bases will cause all the codons scanned to be bases will cause all the codons scanned to be differentdifferent
sensitive tosensitive to frame shift errorsframe shift errors
Fails to identify very small coding regionsFails to identify very small coding regions Fails to identify the occurrence of Fails to identify the occurrence of
overlapping long ORFs on opposite DNA overlapping long ORFs on opposite DNA strands (genes and ‘shadow genes’)strands (genes and ‘shadow genes’)
Web-based toolsWeb-based tools
ORF FinderORF Finder (NCBI) (NCBI)
http://www.ncbi.nih.gov/gorf/gorf.htmlhttp://www.ncbi.nih.gov/gorf/gorf.html
EMBOSSEMBOSS
getorf getorf - Finds and extracts open reading - Finds and extracts open reading framesframes
plotorf plotorf - Plot potential open reading frames- Plot potential open reading frames
Sixpack Sixpack - Display a DNA sequence with 6-- Display a DNA sequence with 6-frame translation and ORFsframe translation and ORFs
http://www.hgmp.mrc.ac.uk/Software/http://www.hgmp.mrc.ac.uk/Software/EMBOSS/Apps/getorf.htmlEMBOSS/Apps/getorf.html
Homology Search Homology Search
This involvesThis involves Sequence-basedSequence-based Database Database SearchingSearching
• DNA Database searching DNA Database searching
• Protein Database searchingProtein Database searching
Homology SearchHomology Search
Why search databases?Why search databases?
When one obtains a new DNA sequence, When one obtains a new DNA sequence, one needs to know:one needs to know:
whether it alreadywhether it already existsexists in the databanksin the databanks
whether it has anywhether it has any homologous homologous sequencessequences (i.e., sequences derived from a (i.e., sequences derived from a common ancestry) in the databasescommon ancestry) in the databases
Given a putative coding ORF, search forGiven a putative coding ORF, search for homologous proteinshomologous proteins – proteins similar in – proteins similar in their folding or structure or function.their folding or structure or function.
Homology SearchHomology Search
DNA vs. Protein Searches DNA vs. Protein Searches
Use protein for database Use protein for database similarity searches whenever similarity searches whenever
possiblepossible
Homology SearchHomology Search
Three main search tools used for database Three main search tools used for database search:search:
• BLAST - BLAST - algorithm by Karlin & Altschulalgorithm by Karlin & Altschul
http://www.ncbi.nlm.nih.gov/BLAST/http://www.ncbi.nlm.nih.gov/BLAST/
• FastA - FastA - algorithm by Pearson & Lipmanalgorithm by Pearson & Lipman
http://www.ebi.ac.uk/fasta33/http://www.ebi.ac.uk/fasta33/
• Smith-Waterman (SW) algorithm Smith-Waterman (SW) algorithm
-- dynamic programming algorithmdynamic programming algorithm
Limitations of Homology SearchLimitations of Homology Search
Only Only limited numberlimited number of genes are available in of genes are available in various databases.various databases.
Currently only Currently only 50%50% of the sequences are of the sequences are found to be similar to previously known found to be similar to previously known sequences.sequences.
It should always be kept in mind that It should always be kept in mind that similarity-based methodssimilarity-based methods are only as reliableare only as reliable as the databases that are searched, andas the databases that are searched, and apparent homologyapparent homology can becan be misleadingmisleading at at timestimes
Content-based MethodsContent-based Methods
At the core of all gene identification programs
– there exist one or more coding measures
A coding statistic - a function that computes the likelihood that the sequence is coding for a protein.
A good knowledge of core coding statistics is important to understand how gene identification programs work.
Classification of Coding MeasuresClassification of Coding Measures
Coding statistics measure • base compositional bias • periodicity in base occurrence • codon usage bias
Main distinction is between • measures dependent of a model of coding DNA• measures independent of such a model.
Model dependent coding statistics capture the specific features of coding DNA:
Unequal usage of codons in the coding regions - a universal feature of the genomes
Dependencies between nucleotide positions
Base compositional bias between codon positions
- requires a representative sample of coding DNA from the species under consideration to estimate the model's parameters
Markov ModelsMarkov Models
Dependencies between nucleotide positions in coding regions
- can be explicitly described by means of Markov Models
In Markov Models
- the probability of a nucleotide at
a particular codon position depends
on the nucleotide(s) preceding it.
Probability of a DNA sequence of
length L:
)( sx|txPa 1iist
L
2ix,x1 i1i
axPxP )()(transition probabilities
Markov ModelsMarkov ModelsTable III: Probabilities of the four nucleotides at the different codon positions conditioned to the nucleotide in the preceding codon position
Model independent coding statistics capture only the “universal” features of coding DNA:
Position Asymmetry – how asymmetric is the distribution of nucleotides at the 3 triplet positions
Periodic Correlation - correlations between nucleotide positions
- do not require a sample of coding DNA
Signal-based MethodsSignal-based Methods
GT
AG
Signal – a string of DNA recognized by the cellular machinery
Signals for gene identificationSignals for gene identification
There are many signals associated with There are many signals associated with genes, each of whichgenes, each of which suggestssuggests but but does not provedoes not prove the existence of a genethe existence of a gene
Most of these signals can be modeled Most of these signals can be modeled using using weight matricesweight matrices
Signals for gene identificationSignals for gene identification
CpG IslandsCpG Islands – identify the 2% of the – identify the 2% of the genome that codes for proteins genome that codes for proteins
Start & Stop CodonsStart & Stop Codons – signifies the start & – signifies the start & end of a coding regionend of a coding region
Transcription Start Site Transcription Start Site – to identify the – to identify the start of coding regionstart of coding region
Donor & Acceptor SitesDonor & Acceptor Sites - signifies the - signifies the start & end of intronic regionsstart & end of intronic regions
Cap Site Cap Site – found in the 5’ UTR – found in the 5’ UTR
SignalsSignals for gene identificationfor gene identification
PromotersPromoters – to initiate transcription – to initiate transcription (found in 5’ UTR region)(found in 5’ UTR region)
EnhancersEnhancers – regulates gene expression, – regulates gene expression, (found in 5’ or 3’ UTR regions, intronic (found in 5’ or 3’ UTR regions, intronic regions, orregions, or up to few Kb away from the gene)up to few Kb away from the gene)
Transcription Factor Binding Sites Transcription Factor Binding Sites – – short DNA sequences where proteins bind short DNA sequences where proteins bind to initiate transcription /translation processto initiate transcription /translation process
Poly-A Site Poly-A Site – identify the end of coding – identify the end of coding region region (found in 3’ UTR region)(found in 3’ UTR region)
Promoter DetectionPromoter Detection Not all ORFs are genesNot all ORFs are genes
True coding regions have specific sequences True coding regions have specific sequences upstream of the start site known as upstream of the start site known as promoterspromoters where the RNA polymerase binds where the RNA polymerase binds to initiate transcription, e.g., in to initiate transcription, e.g., in E. coliE. coli::
No two patterns are identicalNo two patterns are identical All genes do not have these patternsAll genes do not have these patterns
Consensuspatterns
Positional Weight MatrixPositional Weight Matrixfor TATA boxfor TATA box
11 22 33 44 55 66
AA 22 9595 2626 5959 5151 11
CC 99 22 1414 1313 2020 33
GG 1010 11 1616 1515 1212 00
TT 7979 22 4444 1313 1717 9696
Complications in Gene PredictionComplications in Gene Prediction
The problem of gene identification is further complicated in case of eukaryotes by the vast variation that is found in the structure of genes.
On an average, a vertebrate gene is 30Kb long. Of this, the coding region is only about 1Kb.
The coding region typically consists of 6 exons, each about 150bp long.
These are average statistics
Complications in Gene PredictionComplications in Gene Prediction
Huge variations from the average are observed
Biggest human gene, dystrophin is 2.4Mb long.
Blood coagulation human factor VIII gene is ~ 186Kb. It has 26 exons with sizes varying from 69 bp to 3106 bp and its 25 introns range in size from 207 to 32,400 bp.
An average 5’ UTR is 750bp long, but it can be longer and span several exons (for e.g., in MAGE family).
On an average, the 3’ UTR is about 450bp long, but for e.g., in case of the gene for Kallman’s syndrome, the length exceeds 4Kb
Some facts about human genesSome facts about human genes Comprise about 3% of the genomeComprise about 3% of the genome Average gene length: ~ 8,000 bpAverage gene length: ~ 8,000 bp Average of 5-6 exons/geneAverage of 5-6 exons/gene Average exon length: ~200 bpAverage exon length: ~200 bp Average intron length: ~2,000 bpAverage intron length: ~2,000 bp ~8% genes have a single exon~8% genes have a single exon
Some exons can be as small as 1 or 3 bpSome exons can be as small as 1 or 3 bp
Complications in Gene PredictionComplications in Gene Prediction
In higher eukaryotes the gene finding becomes far more difficult because
• It is now necessary to combine multiple ORFs to obtain a spliced coding region.
• Alternative splicing is not uncommon,
• Exons can be very short, and introns can be very long.
Given the nature of genomic sequence in humans, where large introns are known to exist, there is definitely a need for highly specific gene finding algorithms.
GENSCAN: http://genes.mit.edu/GENSCAN.html
Probabilistic model based on HMM. The different states of the model correspond to different functional units on a gene, e.g., promoter region, exon, intron etc.
It uses a homogenous 5th order Markov model for non-coding regions and a 3-periodic (inhomogenous) 5th order Markov model for coding regions
Signals are modeled by weight matrixes, weight arrays and maximal dependence decomposition techniques.
Species: Vertebrates, Maize Arabidopsis, Trained on human genes, Accuracy lower for non-vertebrates
Gene prediction programsGene prediction programsGene prediction programsGene prediction programs
Fgene: Uses Dynamic Programming to find the optimal combinations of exons, promoters, and polyA sites detected by a pattern recognition algorithm.
Species: Human, Drosophila, Nematode, Yeast, Plant
http://dot.imgen.bcm.tmc.edu.9331/gene-finder/gf.html
GrailExp: Uses HMM as the underlying computational technique to determine its genomic predictions. Uses Neural Network to combine the information from various gene finding signals
Species: Human, Mouse
http://compbio.ornl.gov/grailexp/
Gene prediction programsGene prediction programsGene prediction programsGene prediction programs