Upload
kanupriya-tiwari
View
220
Download
0
Embed Size (px)
Citation preview
8/9/2019 CL662 PW 02 Gene Finding
1/39
Gene FindingGene Finding
8/9/2019 CL662 PW 02 Gene Finding
2/39
What is the problem is gene finding?
bi ol ogi cal spel l i ngi smuchmor esl oppyt hanengi lshspel l i ngpr ot ei nswi t ht hesamef unct i onf r omt wodi f f er ent or gani smsar eal most al waysspel t di f fer ent l ysi mi l ar l yi ndnamanyi nt er est i ngsi gnal s
var ygr eat l ywi t hi nevenwi t hi nt hesamegenome
8/9/2019 CL662 PW 02 Gene Finding
3/39
What is the problem is gene finding?
bi ol ogi cal spel l i ngi smuchmor esl oppyt hanengi lshspel l i ngpr ot ei nswi t ht hesamef unct i onf r omt wodi f f er ent or gani smsar eal most al waysspel t di f fer ent l ysi mi l ar l yi ndnamanyi nt er est i ngsi gnal s
var ygr eat l ywi t hi nevent hesamegenome
Bi ol ogi cal spel l i ng i s much mor e sl oppy
t han Engl i sh spel l i ng. Pr ot ei ns wi t h t hesame f unct i on f r om t wo di f f er ent or gani sms
ar e al most al ways spel t di f f er ent l y.Si mi l ar l y, i n DNA, many i nt er est i ng si gnal svar y gr eat l y wi t hi n even t he same genome.
8/9/2019 CL662 PW 02 Gene Finding
4/39
Sequences are observer dependent
ATGATTCTAGGAGAATCGTCTAATCGAATGGCA-------TAAAGTCTACT
Observer A: DNA sequencer
ATGA CT
TTCTAGGAGAATCGTCTAATCGAATGGCA-------TAAAGTCTA
Transcription
startTranscription
stop
Observer C: Ribosome
ATGATTCT AATCGTCTAATCGA AGTCTACT
AGGAG ATGGCA-------TAA
RibosomalBinding Site
Startcodon
StopCodon
Observer B: RNA polymerase
8/9/2019 CL662 PW 02 Gene Finding
5/39
ATGATTCTAGGAGAATCGTCTAATCGAATGGCA-------TAAAGTCTACT
We observe this:
ATGA CT
TTCTAGGAGAATCGTCTAATCGAATGGCA-------TAAAGTCTA
Transcription
startTranscription
stop
ATGATTCT AATCGTCTAATCGA AGTCTACT
AGGAG ATGGCA-------TAA
RibosomalBinding Site
Startcodon
StopCodon
We need to infer about:
or
8/9/2019 CL662 PW 02 Gene Finding
6/39
To our benefit
There is great deal of order in biological
sequences
Conserved stretches of sequences arerecognized by various bio-molecules which are
part of information decoding / processing
machinery.
Our goal: Find the subtle similarities & patterns.
8/9/2019 CL662 PW 02 Gene Finding
7/39
What are genes?
AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC
TTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAA
TATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTACACAACATCCATGAAACGCATTAGCACCACC
ATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGCTGACGCGTACAGGAAACACAGAAAAAAG
CCCGCACCTGACAGTGCGGGCTTTTTTTTTCGACCAAAGGTAACGAGGTAACAACCATGCGAGTGTTGAA
GTTCGGCGGTACATCAGTGGCAAATGCAGAACGTTTTCTGCGTGTTGCCGATATTCTGGAAAGCAATGCC
AGGCAGGGGCAGGTGGCCACCGTCCTCTCTGCCCCCGCCAAAATCACCAACCACCTGGTGGCGATGATT
Genes are individual stretches of DNA that encode the
sequence of amino acids comprising a particular protein.
The 64 possible nucleotide triplets (codons) represent the
20 amino acids using a degenerate code.
Start Codon: ATG
Stop Codon: TGA, TAA, TAG
In eukaryotes, coding regions are separated by non-
coding regions (introns).
8/9/2019 CL662 PW 02 Gene Finding
8/39
Finding Genes (or regions of interest in DNA sequence)
Important Signals
Prokaryotes
The start codon
The stop codon
Eukaryotes
The start codon
all donor sites (the
beginning of each intron
all acceptor sites (the end
of each intron)
The stop codon
Problem: EveryATG is not a valid start codon:
Verify the start codon (analyze region around the start codon).
Identify additional subtle signals.
8/9/2019 CL662 PW 02 Gene Finding
9/39
Some useful signals for Gene Finding
DNA
mRNA
protein
TATA box
Protein
synthesis
starts
Protein
synthesis
stops
x x
m-RNA expression
start & end
Ribosomalbinding site
U
pstreama
ctivating
s
equences(UA
S)
8/9/2019 CL662 PW 02 Gene Finding
10/39
A logo of RBS and startA logo of RBS and start codoncodon in E. coli genes.in E. coli genes.
8/9/2019 CL662 PW 02 Gene Finding
11/39
Key concepts used in Gene Detection.Key concepts used in Gene Detection.
8/9/2019 CL662 PW 02 Gene Finding
12/39
Gene Finding MethodsGene Finding Methods
Content Based Methods: Overall, bulkContent Based Methods: Overall, bulkproperties of the sequence. E.g. codon bias,properties of the sequence. E.g. codon bias,hexamer frequency.hexamer frequency.
SiteSite--based or signal sensing based: E.g. donorbased or signal sensing based: E.g. donorand acceptor splice sites, binding sites forand acceptor splice sites, binding sites fortranscription factors, polyA tracts, RBS, start andtranscription factors, polyA tracts, RBS, start andstop codon.stop codon.
Comparative methods: Translated sequemcesComparative methods: Translated sequemcesare subjected to database searches againstare subjected to database searches againstprotein sequences.protein sequences.
8/9/2019 CL662 PW 02 Gene Finding
13/39
DNA
mRNA
protein
TATA box
Start Stop
x x
m-RNA expression
start & end
RBS
Upstreama
ctiv
ating
sequences(UAS)
Using the subtle signals in automated gene finding
Homology to known genes: DNAsequences are likely to be protein
coding regions if they are
homologous to known protein
coding regions in other genomes
Codon Bias: From the 64 codon
degenerate code, certain codons
are preferentially used by species.
Amino acid bias.
Start and stop codons. Length of aregion between a start and stop
codon.
Prokaryotes: Ribosomal binding
site in the vicin ity of the start
codon.
8/9/2019 CL662 PW 02 Gene Finding
14/39
The cloverleaf structure of aThe cloverleaf structure of a tRNAtRNA molecule showing thosemolecule showing those
features that are usedfeatures that are used tRNAscantRNAscan for detection.for detection.
8/9/2019 CL662 PW 02 Gene Finding
15/39
The structure of a tRNA molecule: (A) Phe-tRNA molecule
showing the arrangement of the base pairing and loops
and (B) the 3-D structure.
8/9/2019 CL662 PW 02 Gene Finding
16/39
The start and stop signals for prokaryotic transcription:
Start signal- short nucleotide sequences that bind
transcr iption enzymes. Stop signal: short loop structurepreventing the transcription apparatus from continuing.
8/9/2019 CL662 PW 02 Gene Finding
17/39
Frequency of occurrence of different aminoFrequency of occurrence of different amino
acidacid codonscodons in genes andin genes and intergenicintergenic DNA.DNA.
8/9/2019 CL662 PW 02 Gene Finding
18/39
Gene Search by Homology
Similarity of a portion of DNA with a known sequence can
be used as both positive & negative evidence for
likelihood for being a coding region.
Positive evidence: Similarity to known genes in other
organisms (Comparative evidence).
Negative evidence: Similarity to repeat sequences (repeat
masker)
Can provide clues about gene location and function.
Can locate only about half of all human genes currently.
8/9/2019 CL662 PW 02 Gene Finding
19/39
Important methods used in Gene Detection.Important methods used in Gene Detection.
8/9/2019 CL662 PW 02 Gene Finding
20/39
The start and stop signals for eukaryotic transcription
8/9/2019 CL662 PW 02 Gene Finding
21/39
A schematic of the splicing of anA schematic of the splicing of an intronintron
8/9/2019 CL662 PW 02 Gene Finding
22/39
A segment of E. coli genome that has been fully
annotated, illustrated using the Artemis program.
8/9/2019 CL662 PW 02 Gene Finding
23/39
A detailed view of a tRNA coding region and
the secondary structure of the tRNA molecule.
8/9/2019 CL662 PW 02 Gene Finding
24/39
Eukaryotic DNA to protein.
8/9/2019 CL662 PW 02 Gene Finding
25/39
Schematic representation of the ALDH10
gene with exons colored blue.
8/9/2019 CL662 PW 02 Gene Finding
26/39
A flowchart of steps involved in the
identification and annotation of gene sequences.
8/9/2019 CL662 PW 02 Gene Finding
27/39
Problems in gene finding
Length of a gene is variable.
Al l signals are probabi l istic and have inherent
(sometimes unknown) variabi l i ty.
It is di ffi cult to quanti fy the acceptable level of
variabi l ity for each signal.
Example: An ATG does not always mean a valid
start codon.
There are exceptions to every (almost) rule
Example: Protein coding regions wi thout the
RBS in their vic inity of start codon may get
expressed.
8/9/2019 CL662 PW 02 Gene Finding
28/39
Partial sequence classification (Tagging)
The tagging problem:
Given: A set of tags L
Training examples of sequences showing the breakup
of the sequence into the set of tags
Learn to breakup a sequence into tags (classification
of parts of sequences)
Examples:
Text segmentation: Break sequence of words forming
an address string into subparts like Road, City name.
Continuous speech recognition: Identify words in
cont inuous speech
Gene finding: Identify boundaries of the protein coding
regions in DNA sequence, identify exon / introns, etc.
8/9/2019 CL662 PW 02 Gene Finding
29/39
A system described at any t ime as being
in one of a set of N distinct states, S1,
S2, ---, SN.
The system undergoes a change ofstate (possib ly back to the same set).
Full description: Specification of
current state as well as all predecessor
states. First order Markov chain:
descrip tion truncated to just the
current state and the predecessorstate.
P[q t = Sj | q t-1 = Si, q t-2 = Sk, - - -]
= P[q t = Sj | q t-1 = Si]
Transition probabilities:
ai,j = P[q t = Sj | q t-1 = Si, qt-2 = Sk, - - -]
With ai,j 0 and 1
1
N
j
ija
S1S2
S3 S4
a41
a34
a21
Discrete Markov Processes
8/9/2019 CL662 PW 02 Gene Finding
30/39
Example of an Observable Sequence: Weather Predict ion
Rain /Snow
S1
Cloudy
S2
Sunny
S3
The weather on a day t is characterized
by one of the three states.
Transition probabilities:
Given that the weather on day 1 is sunny,
what is the probability that the weatherfor the next 5 days wil l be sun-sun-rain-
rain-sun?
Find P(O|Model) = product of all the
concerned transi tion probabilities. Initial state Probabilities
i = P[q1=Si] 1 i N
8.01.01.0
2.06.02.0
3.03.04.0
}{ij
aA
Each state is di rectly observable in a Markov Chain.
a32
a23
8/9/2019 CL662 PW 02 Gene Finding
31/39
Example of Hidden Markov Chain:
State is not directly observable
Players A and B A has a set of coins with
dif ferent biases
A repeatedly
Picks arbitrary coin
Tosses it arbit rary number
of times
B observes H/T (symbols)
Guesses transition points
and biases The actual event is hidden
from B.
HMMs are doubly stochastic models: Occurrence of a state
and the observed sym bol in that state.
8/9/2019 CL662 PW 02 Gene Finding
32/39
Elements of an HMM
Observed sequence (represented as symbols)
O = O1 O2 - - - OT (T=duration of the sequence)
Sequence of states (typically hidden)
Q = q1 q2 - - - qT
N, the number of states in the model. Although the states arehidden, there is often some physical signif icance attached to
the states. S={S1, S2, - - -, SN}
M, the number of dist inct observation symbols per state.
V={v1, v2, - - -, vM}
The state transition probability distribution A = {aij}
The observation symbol probability d istribution in state j,
B = {bj(k)} or the Emission frequency matrix.
The initial state distribution = {i}
Thus, the model parameters: = (A, B, )
8/9/2019 CL662 PW 02 Gene Finding
33/39
Three basic problems for HMM
Given the observation sequence O = O1 O2 - - - OT and a
model = (A, B, ), how do we eff iciently compute P(O|
), theprobability that the of the observation sequence, given the
model. Correct solution feasible.
Example: profileHMM----classif ication of a protein
sequence based on competing HMMs for dif ferent
protein families.
Given the observation sequence O = O1 O2 - - - OT and a
model , how do we choose a corresponding state sequence
Q = q1 q2 - - - qT. No single correct solut ion exists----Need to
apply some optimality criteria
Example: Finding the protein coding regions and Exon /
intron boundaries in an anonymous sequence of DNA.
How do we adjust the model parameters = (A, B, ), to
maximize P(O|), O training sequence. Toughest problem.
8/9/2019 CL662 PW 02 Gene Finding
34/39
Two sequence mining problems in biology
1. Finding genes in DNA sequences (2nd
Problem in HMM)
2. Classifying proteins according to family (1st
Problem in HMM)
The 3rd Problem in HMM needs to be tackled in
both 1 and 2 above.
8/9/2019 CL662 PW 02 Gene Finding
35/39
HMM for genes with introns (spliced genes)
GTxxxxxInterior
IntronxxxxxxxAGccc ccc
GTxxxxxInterior
Intron
xxxxxxxAGccc cc c ccc
GTxxxxx InteriorIntron
xxxxxxxAGccc c cc ccc
Intron
models
Donor
Model
Acceptor
model
c
c
c
start model
stop model
coding
model
8/9/2019 CL662 PW 02 Gene Finding
36/39
Hidden Markov model of a prokaryotic nucleotide
sequence used in the GeneMark.hmm algorithm.
8/9/2019 CL662 PW 02 Gene Finding
37/39
Mathematical Problem Statement for Gene Finding
For an anonymous DNA sequence
S = {b1,b2,.., bL}
where, bi = A, T, G, C
Determine the functional role of each nucleotide
A = {a1,a2,.., aL}
where, ai= 0 if non-coding
ai = 1 if coding on direct strand
ai = 2 if coding on complementary strand
8/9/2019 CL662 PW 02 Gene Finding
38/39
Variable Duration HMM for gene finding
The trajectory A is represented as a sequence of M hidden
states having duration di:
A={( a1d1) ( a2d2) . . ( aMdM) } where di =L
Objective in Gene finding :
To find the trajectory
A*={( a1*d1*) ( a2*d2* ) . . . ( aM*dM*) }
which has the largest probability of occurring simultaneously
with sequence S compared to all other possible trajectories.
8/9/2019 CL662 PW 02 Gene Finding
39/39
Gene Finding & Training Sets
Majori ty of the current gene finding algori thms /
programs uti li ze a species specific training set to
develop the statisti cal models.
Training set involves experimentall y determined genes.
For organisms such as E. coli, we have a large training
set (about 325 known genes f rom experimental /
biochemical veri fication).
Question: How to develop models for new -organisms
for which very few genes are experimentally
characterized.