CL662 PW 02 Gene Finding

8/9/2019 CL662 PW 02 Gene Finding

1/39

Gene FindingGene Finding


2/39

What is the problem is gene finding?

bi ol ogi cal spel l i ngi smuchmor esl oppyt hanengi lshspel l i ngpr ot ei nswi t ht hesamef unct i onf r omt wodi f f er ent or gani smsar eal most al waysspel t di f fer ent l ysi mi l ar l yi ndnamanyi nt er est i ngsi gnal s

var ygr eat l ywi t hi nevenwi t hi nt hesamegenome


3/39

What is the problem is gene finding?

bi ol ogi cal spel l i ngi smuchmor esl oppyt hanengi lshspel l i ngpr ot ei nswi t ht hesamef unct i onf r omt wodi f f er ent or gani smsar eal most al waysspel t di f fer ent l ysi mi l ar l yi ndnamanyi nt er est i ngsi gnal s

var ygr eat l ywi t hi nevent hesamegenome

Bi ol ogi cal spel l i ng i s much mor e sl oppy

t han Engl i sh spel l i ng. Pr ot ei ns wi t h t hesame f unct i on f r om t wo di f f er ent or gani sms

ar e al most al ways spel t di f f er ent l y.Si mi l ar l y, i n DNA, many i nt er est i ng si gnal svar y gr eat l y wi t hi n even t he same genome.


4/39

Sequences are observer dependent

ATGATTCTAGGAGAATCGTCTAATCGAATGGCA-------TAAAGTCTACT

Observer A: DNA sequencer

ATGA CT

TTCTAGGAGAATCGTCTAATCGAATGGCA-------TAAAGTCTA

Transcription

startTranscription

stop

Observer C: Ribosome

ATGATTCT AATCGTCTAATCGA AGTCTACT

AGGAG ATGGCA-------TAA

RibosomalBinding Site

Startcodon

StopCodon

Observer B: RNA polymerase


5/39

ATGATTCTAGGAGAATCGTCTAATCGAATGGCA-------TAAAGTCTACT

We observe this:

ATGA CT

TTCTAGGAGAATCGTCTAATCGAATGGCA-------TAAAGTCTA

Transcription

startTranscription

stop

ATGATTCT AATCGTCTAATCGA AGTCTACT

AGGAG ATGGCA-------TAA

RibosomalBinding Site

Startcodon

StopCodon

We need to infer about:

or


6/39

To our benefit

There is great deal of order in biological

sequences

Conserved stretches of sequences arerecognized by various bio-molecules which are

part of information decoding / processing

machinery.

Our goal: Find the subtle similarities & patterns.


7/39

What are genes?

AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC

TTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAA

TATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTACACAACATCCATGAAACGCATTAGCACCACC

ATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGCTGACGCGTACAGGAAACACAGAAAAAAG

CCCGCACCTGACAGTGCGGGCTTTTTTTTTCGACCAAAGGTAACGAGGTAACAACCATGCGAGTGTTGAA

GTTCGGCGGTACATCAGTGGCAAATGCAGAACGTTTTCTGCGTGTTGCCGATATTCTGGAAAGCAATGCC

AGGCAGGGGCAGGTGGCCACCGTCCTCTCTGCCCCCGCCAAAATCACCAACCACCTGGTGGCGATGATT

Genes are individual stretches of DNA that encode the

sequence of amino acids comprising a particular protein.

The 64 possible nucleotide triplets (codons) represent the

20 amino acids using a degenerate code.

Start Codon: ATG

Stop Codon: TGA, TAA, TAG

In eukaryotes, coding regions are separated by non-

coding regions (introns).


8/39

Finding Genes (or regions of interest in DNA sequence)

Important Signals

Prokaryotes

The start codon

The stop codon

Eukaryotes

The start codon

all donor sites (the

beginning of each intron

all acceptor sites (the end

of each intron)

The stop codon

Problem: EveryATG is not a valid start codon:

Verify the start codon (analyze region around the start codon).

Identify additional subtle signals.


9/39

Some useful signals for Gene Finding

DNA

mRNA

protein

TATA box

Protein

synthesis

starts

Protein

synthesis

stops

x x

m-RNA expression

start & end

Ribosomalbinding site

U

pstreama

ctivating

s

equences(UA

S)


10/39

A logo of RBS and startA logo of RBS and start codoncodon in E. coli genes.in E. coli genes.


11/39

Key concepts used in Gene Detection.Key concepts used in Gene Detection.


12/39

Gene Finding MethodsGene Finding Methods

Content Based Methods: Overall, bulkContent Based Methods: Overall, bulkproperties of the sequence. E.g. codon bias,properties of the sequence. E.g. codon bias,hexamer frequency.hexamer frequency.

SiteSite--based or signal sensing based: E.g. donorbased or signal sensing based: E.g. donorand acceptor splice sites, binding sites forand acceptor splice sites, binding sites fortranscription factors, polyA tracts, RBS, start andtranscription factors, polyA tracts, RBS, start andstop codon.stop codon.

Comparative methods: Translated sequemcesComparative methods: Translated sequemcesare subjected to database searches againstare subjected to database searches againstprotein sequences.protein sequences.


13/39

DNA

mRNA

protein

TATA box

Start Stop

x x

m-RNA expression

start & end

RBS

Upstreama

ctiv

ating

sequences(UAS)

Using the subtle signals in automated gene finding

Homology to known genes: DNAsequences are likely to be protein

coding regions if they are

homologous to known protein

coding regions in other genomes

Codon Bias: From the 64 codon

degenerate code, certain codons

are preferentially used by species.

Amino acid bias.

Start and stop codons. Length of aregion between a start and stop

codon.

Prokaryotes: Ribosomal binding

site in the vicin ity of the start

codon.


14/39

The cloverleaf structure of aThe cloverleaf structure of a tRNAtRNA molecule showing thosemolecule showing those

features that are usedfeatures that are used tRNAscantRNAscan for detection.for detection.


15/39

The structure of a tRNA molecule: (A) Phe-tRNA molecule

showing the arrangement of the base pairing and loops

and (B) the 3-D structure.


16/39

The start and stop signals for prokaryotic transcription:

Start signal- short nucleotide sequences that bind

transcr iption enzymes. Stop signal: short loop structurepreventing the transcription apparatus from continuing.


17/39

Frequency of occurrence of different aminoFrequency of occurrence of different amino

acidacid codonscodons in genes andin genes and intergenicintergenic DNA.DNA.


18/39

Gene Search by Homology

Similarity of a portion of DNA with a known sequence can

be used as both positive & negative evidence for

likelihood for being a coding region.

Positive evidence: Similarity to known genes in other

organisms (Comparative evidence).

Negative evidence: Similarity to repeat sequences (repeat

masker)

Can provide clues about gene location and function.

Can locate only about half of all human genes currently.


19/39

Important methods used in Gene Detection.Important methods used in Gene Detection.


20/39

The start and stop signals for eukaryotic transcription


21/39

A schematic of the splicing of anA schematic of the splicing of an intronintron


22/39

A segment of E. coli genome that has been fully

annotated, illustrated using the Artemis program.


23/39

A detailed view of a tRNA coding region and

the secondary structure of the tRNA molecule.


24/39

Eukaryotic DNA to protein.


25/39

Schematic representation of the ALDH10

gene with exons colored blue.


26/39

A flowchart of steps involved in the

identification and annotation of gene sequences.


27/39

Problems in gene finding

Length of a gene is variable.

Al l signals are probabi l istic and have inherent

(sometimes unknown) variabi l i ty.

It is di ffi cult to quanti fy the acceptable level of

variabi l ity for each signal.

Example: An ATG does not always mean a valid

start codon.

There are exceptions to every (almost) rule

Example: Protein coding regions wi thout the

RBS in their vic inity of start codon may get

expressed.


28/39

Partial sequence classification (Tagging)

The tagging problem:

Given: A set of tags L

Training examples of sequences showing the breakup

of the sequence into the set of tags

Learn to breakup a sequence into tags (classification

of parts of sequences)

Examples:

Text segmentation: Break sequence of words forming

an address string into subparts like Road, City name.

Continuous speech recognition: Identify words in

cont inuous speech

Gene finding: Identify boundaries of the protein coding

regions in DNA sequence, identify exon / introns, etc.


29/39

A system described at any t ime as being

in one of a set of N distinct states, S1,

S2, ---, SN.

The system undergoes a change ofstate (possib ly back to the same set).

Full description: Specification of

current state as well as all predecessor

states. First order Markov chain:

descrip tion truncated to just the

current state and the predecessorstate.

P[q t = Sj | q t-1 = Si, q t-2 = Sk, - - -]

= P[q t = Sj | q t-1 = Si]

Transition probabilities:

ai,j = P[q t = Sj | q t-1 = Si, qt-2 = Sk, - - -]

With ai,j 0 and 1

1

N

j

ija

S1S2

S3 S4

a41

a34

a21

Discrete Markov Processes


30/39

Example of an Observable Sequence: Weather Predict ion

Rain /Snow

S1

Cloudy

S2

Sunny

S3

The weather on a day t is characterized

by one of the three states.

Transition probabilities:

Given that the weather on day 1 is sunny,

what is the probability that the weatherfor the next 5 days wil l be sun-sun-rain-

rain-sun?

Find P(O|Model) = product of all the

concerned transi tion probabilities. Initial state Probabilities

i = P[q1=Si] 1 i N

8.01.01.0

2.06.02.0

3.03.04.0

}{ij

aA

Each state is di rectly observable in a Markov Chain.

a32

a23


31/39

Example of Hidden Markov Chain:

State is not directly observable

Players A and B A has a set of coins with

dif ferent biases

A repeatedly

Picks arbitrary coin

Tosses it arbit rary number

of times

B observes H/T (symbols)

Guesses transition points

and biases The actual event is hidden

from B.

HMMs are doubly stochastic models: Occurrence of a state

and the observed sym bol in that state.


32/39

Elements of an HMM

Observed sequence (represented as symbols)

O = O1 O2 - - - OT (T=duration of the sequence)

Sequence of states (typically hidden)

Q = q1 q2 - - - qT

N, the number of states in the model. Although the states arehidden, there is often some physical signif icance attached to

the states. S={S1, S2, - - -, SN}

M, the number of dist inct observation symbols per state.

V={v1, v2, - - -, vM}

The state transition probability distribution A = {aij}

The observation symbol probability d istribution in state j,

B = {bj(k)} or the Emission frequency matrix.

The initial state distribution = {i}

Thus, the model parameters: = (A, B, )


33/39

Three basic problems for HMM

Given the observation sequence O = O1 O2 - - - OT and a

model = (A, B, ), how do we eff iciently compute P(O|

), theprobability that the of the observation sequence, given the

model. Correct solution feasible.

Example: profileHMM----classif ication of a protein

sequence based on competing HMMs for dif ferent

protein families.

Given the observation sequence O = O1 O2 - - - OT and a

model , how do we choose a corresponding state sequence

Q = q1 q2 - - - qT. No single correct solut ion exists----Need to

apply some optimality criteria

Example: Finding the protein coding regions and Exon /

intron boundaries in an anonymous sequence of DNA.

How do we adjust the model parameters = (A, B, ), to

maximize P(O|), O training sequence. Toughest problem.


34/39

Two sequence mining problems in biology

1. Finding genes in DNA sequences (2nd

Problem in HMM)

2. Classifying proteins according to family (1st

Problem in HMM)

The 3rd Problem in HMM needs to be tackled in

both 1 and 2 above.


35/39

HMM for genes with introns (spliced genes)

GTxxxxxInterior

IntronxxxxxxxAGccc ccc

GTxxxxxInterior

Intron

xxxxxxxAGccc cc c ccc

GTxxxxx InteriorIntron

xxxxxxxAGccc c cc ccc

Intron

models

Donor

Model

Acceptor

model

c

c

c

start model

stop model

coding

model


36/39

Hidden Markov model of a prokaryotic nucleotide

sequence used in the GeneMark.hmm algorithm.


37/39

Mathematical Problem Statement for Gene Finding

For an anonymous DNA sequence

S = {b1,b2,.., bL}

where, bi = A, T, G, C

Determine the functional role of each nucleotide

A = {a1,a2,.., aL}

where, ai= 0 if non-coding

ai = 1 if coding on direct strand

ai = 2 if coding on complementary strand


38/39

Variable Duration HMM for gene finding

The trajectory A is represented as a sequence of M hidden

states having duration di:

A={( a1d1) ( a2d2) . . ( aMdM) } where di =L

Objective in Gene finding :

To find the trajectory

A*={( a1*d1*) ( a2*d2* ) . . . ( aM*dM*) }

which has the largest probability of occurring simultaneously

with sequence S compared to all other possible trajectories.


39/39

Gene Finding & Training Sets

Majori ty of the current gene finding algori thms /

programs uti li ze a species specific training set to

develop the statisti cal models.

Training set involves experimentall y determined genes.

For organisms such as E. coli, we have a large training

set (about 325 known genes f rom experimental /

biochemical veri fication).

Question: How to develop models for new -organisms

for which very few genes are experimentally

characterized.

Documents

CL662 PW 02 Gene Finding