36
Algorithms for Finding Genes Rhys Price Jones Anne R. Haake

Algorithms for Finding Genes Rhys Price Jones Anne R. Haake

Embed Size (px)

Citation preview

Page 1: Algorithms for Finding Genes Rhys Price Jones Anne R. Haake

Algorithms for Finding Genes

Rhys Price JonesAnne R. Haake

Page 2: Algorithms for Finding Genes Rhys Price Jones Anne R. Haake

There’s something about Genes

The human genome consists of about 3 billion base pairs

Some of that can be transcribed and translated to produce proteins

Most of it does not What is it that makes some

sequences of DNA encode protein?

Page 3: Algorithms for Finding Genes Rhys Price Jones Anne R. Haake

The Genetic Code Pick a spot in your sequence Group nucleotides from that point

on in triplets. Each triplet encodes for an amino

acid, or a “stop”, or a “start” TGA, TAG, TAA code for “stop” But there are exceptions!

Page 4: Algorithms for Finding Genes Rhys Price Jones Anne R. Haake

Complications In the 1960s it was thought that a

gene was a linear structure of nucleotides directly linked in triplets to amino acids in the resulting protein

This view did not last long with the discovery in the late 1960s of genes within genes and overlapping genes

In 1977, split genes were discovered

Page 5: Algorithms for Finding Genes Rhys Price Jones Anne R. Haake

Introns and Exons Most eukaryotic genes are interrupted

by non-coding sections and broken into pieces called exons. The interrupting sequences are called introns

Some researchers have used a biological approach and searched for splicing sites at intron-exon junctions

Catalogs of splice sites were created in the 1980s

But unreliable

Page 6: Algorithms for Finding Genes Rhys Price Jones Anne R. Haake

Can you tell if it’s a gene? What is it about some sections of DNA that

make them encode for proteins? Compare

TTCTTCTCCAAGAGCAGGGCTTAATTCTATGCTTCCAGGCGAAAGACTGCATGGCTAACAAAGCAACGCCTAACACATTCCTAAGCAATTGGCTTGCACC

GTCTTCCCGAGGGTGTTTCTCCAATGGAAAGAGGCGTCGCTGGGCACCCGCCGGGAACGGCCGGGTGACCACCCGGTCATTGTGAACGGAAGTTTCGAGA

Is there anything that distinguishes the second from the first?

We’ll return to this issue

Page 7: Algorithms for Finding Genes Rhys Price Jones Anne R. Haake

Before we embark on the biological problem

We will look at an easier analogy Sometimes called the Crooked Casino Basic idea:

sequence of values generated by one or more processes

from the values, try to deduce which process produced which parts of the sequence

Page 8: Algorithms for Finding Genes Rhys Price Jones Anne R. Haake

A useful analogy The loaded die I rolled a die 80 times and got

23636626516113441416663242646331422242145353556233136631321454662631116546622542

Some of the time it was a fair die Some of the time the die produced 6s

half the time and 1-5 randomly the rest of the time

Page 9: Algorithms for Finding Genes Rhys Price Jones Anne R. Haake

Confession (append (fair 20) (loaded 10) (fair 20)

(loaded 10) (fair 20)) (define fair

(lambda (n) (if (zero? n) '() (cons (1+ (random 6)) (fair (1- n))))))

(define loaded (lambda (n) (if (zero? n)

'()(cons (if (< (random 2) 1) 6 (1+ (random 5))) (loaded (1- n))))))

23636626516113441416663242646331422242145353556233136631321454662631116546622542

Page 10: Algorithms for Finding Genes Rhys Price Jones Anne R. Haake

There are no computer algorithms to recognize genes reliably Statistical approaches Similarity-based approaches Others

Page 11: Algorithms for Finding Genes Rhys Price Jones Anne R. Haake

Similarity Methods Searching sequences for regions that

have the “look and feel” of a gene (Dan Gusfield)

An early step is the investigation of ORFs

(in Eukaryotes) look for possible splicing sites and try to assemble exons

Combine sequence comparison and database search, seeking similar genes

Page 12: Algorithms for Finding Genes Rhys Price Jones Anne R. Haake

Statistical Approaches Look for features that appear frequently in

genes but not elsewhere As for the loaded die sequence above, that is

not 100% reliable But then, what is? This is Biology and it’s had

billions of years to figure out how to outfox us! If you want to learn about nature, to appreciate

nature, it is necessary to understand the language that she speaks in. She offers her information only in one form; we are not so unhumble as to demand that she change before we pay any attention. (R. Feynman)

Page 13: Algorithms for Finding Genes Rhys Price Jones Anne R. Haake

Relative Frequency of Stop Codons 3 of the possible 64 nucleotide triplets

are stop codons We would thus expect stop codons to

form approximately 5% of any stretch of random DNA, giving an average distance between stop codons of about 20 codons

Very long stretches of triplets containing no stop codons (long ORFs) might, therefore, indicate gene possibilities

Page 14: Algorithms for Finding Genes Rhys Price Jones Anne R. Haake

Rather like looking for regions where my die was fair By seeking a shortage of 6s Similarly, there is (organism-specific)

codon bias of various kinds, and this can be exploited by gene-finders

People and programs have looked for such bias features as: regularity in codon frequencies regularity in bicodon frequencies periodicity homogeneity vs. complexity

Page 15: Algorithms for Finding Genes Rhys Price Jones Anne R. Haake

CpG Islands CG dinucleotides are often written CpG

to avoid confusion with the base pair C-G In the human genome CpG is a rarer bird

than would be expected in purely random sequences (there are chemical reasons for this involving methylation)

In the start regions of many genes, however, the methylation process is suppressed, and CpG dinucleotides appear more frequently than elsewhere

Page 16: Algorithms for Finding Genes Rhys Price Jones Anne R. Haake

Rather like looking for stretches where my die was loaded

By seeking a plethora of 6s But now we’ll look for high

incidences of CpG

Page 17: Algorithms for Finding Genes Rhys Price Jones Anne R. Haake

How can we detect shortage or plethora of CpGs?

Markov Models Assume that the next state

depends only on the previous state If the states are x1 x2 ... xn

We have a matrix of probabilities pij

pij is the probability that state xj will follow xi

Page 18: Algorithms for Finding Genes Rhys Price Jones Anne R. Haake

Transition matrices CpG islands A C G T A 18 27 43 12C 17 37 27 19G 16 34 37 13T 8 35 38 18

elsewhere A C G T A 30 20 28 21C 32 30 8 30G 25 25 30 20T 18 24 29 29 (adapted from Durbin et al, 1998)

Page 19: Algorithms for Finding Genes Rhys Price Jones Anne R. Haake

How do we get those numbers?

empirically lots of hard wet and dry lab work We can think of these numbers as parameters of our model later we’ll pay more attention to

model parameters

Page 20: Algorithms for Finding Genes Rhys Price Jones Anne R. Haake

For our loaded die example we have

0 means non-6 loaded 0 60 50 50 6 50 50

unloaded 0 60 83 17 6 83 17

Page 21: Algorithms for Finding Genes Rhys Price Jones Anne R. Haake

Discrimination Given a sequence x we can calculate a log odds score that it

was generated by one of two models. S(x) = log (P(x | model A) / P(x | model B))

= sum of log((prob xi-1 to xi in A) / (prob xi-1 to xi in B)) Demonstration

(define beta (lambda (x y) (cond ((and (= x 6) (< y 6)) (log (/ .83 .5))) ((and (= x 6) (= y 6)) (log (/ .17 .5))) ((and (< x 6) (< y 6)) (log (/ .83 .5))) ((and (< x 6) (= y 6)) (log (/ .17 .5))))))

(define s (lambda (l) (cond ((null? (cdr l)) 0) (else (+ (beta (car l) (cadr l)) (s (cdr l)))))))

Page 22: Algorithms for Finding Genes Rhys Price Jones Anne R. Haake

Similar Analysis for CpG Islands Visit the code (show mary) (show gene) (scorecpg mary) (scorecpg gene) Sliding Window

(slidewindow 20 mary) (show (classify mary))

Page 23: Algorithms for Finding Genes Rhys Price Jones Anne R. Haake

Hidden Markov Models The Markov model we have just seen could be used for

locating CpG islands For example you could calculate the log odds score of a

moving window of, say 50 nucleotides Hidden Markov Models try to combine the two models in

a single model For each state, say x, in the Markov model, we imagine

two states xA and xB for the two models A and B Both new states “emit” the same character But we don’t know whether it was CA or CB that

“emitted” the nucleotide C

Page 24: Algorithms for Finding Genes Rhys Price Jones Anne R. Haake

HMMs Hidden Markov Models are so prevalent in

bioinformatics, we’ll just refer to them as HMMs Given

a sequence of “emitted signals” -- the observed sequence and a matrix of transition probabilities and emission

probabilities For a path of hidden states, calculate the conditional

probability of the sequence given the path Then try to find an optimal path for the sequence

such that the conditional probability is maximized

Page 25: Algorithms for Finding Genes Rhys Price Jones Anne R. Haake

HMM “combines” Markov Models

G+

C+

T+

T-

G-C-A-

A+

18

27

43

12

17

3727

19

16 34

37

13

8 35

38

18

30

20

28

21

32

308

30

25 25

20

30

18 2429

29

Page 26: Algorithms for Finding Genes Rhys Price Jones Anne R. Haake

Must be able to move between models

Figure out these transition probabilities

G+

C+

T+

T-

G-C-A-

A+

Page 27: Algorithms for Finding Genes Rhys Price Jones Anne R. Haake

Add a begin and an end state

and figure even more transition probabilities

G+

C+

T+

T-

G-C-A-

A+

B E

Page 28: Algorithms for Finding Genes Rhys Price Jones Anne R. Haake

Basic Idea Each state emits a nucleotide (except B

and E) A+ and A- both emit A C+ and C- both emit C You know the sequence of emissions You don’t know the exact sequences of

states Was it A+ C+ C+ G- that produced ACCG Or was it A- C- C+ G-

You want to find the most likely sequence

Page 29: Algorithms for Finding Genes Rhys Price Jones Anne R. Haake

Parameters of the HMM How do we know the transition

probabilities between hidden states? empirical, hard lab work as above for the CpG island transition

probabilities imperfect data will lead to imperfect

model Is there an alternative?

Page 30: Algorithms for Finding Genes Rhys Price Jones Anne R. Haake

Machine Learning Grew out of other disciplines, including

statistical model fitting Tries to automate the process as much as

possible use very flexible models lots of parameters let the program figure them out

Based on ... known data and properties training sets

Induction and inference

Page 31: Algorithms for Finding Genes Rhys Price Jones Anne R. Haake

Bayesian Modeling

Understand the hypotheses (model) Assign prior probabilities to the

hypotheses Use Bayes’ theorem (and the rest of

“probability calculus”) to evaluate posterior probabilities for the hypotheses in light of actual data

Page 32: Algorithms for Finding Genes Rhys Price Jones Anne R. Haake

Neural Networks Very flexible Inputs Thresholds Weights Output Variations:

additional layers loops

Page 33: Algorithms for Finding Genes Rhys Price Jones Anne R. Haake

Training Neural Networks

Need set of inputs and desired outputs

Run the network with a given input Compare to desired output Adjust weights to improve Repeat...

Page 34: Algorithms for Finding Genes Rhys Price Jones Anne R. Haake

Training HMMs Similar idea Begin with guesses for the

parameters of the HMM Run it on your training set, and see

how well it predicts whatever you designed it for

Use the results to adjust the parameters

Repeat...

Page 35: Algorithms for Finding Genes Rhys Price Jones Anne R. Haake

Hope that

When you provide brand new inputs

The trained network will produce useful outputs

Page 36: Algorithms for Finding Genes Rhys Price Jones Anne R. Haake

Some gene finder programs

GRAIL – neural net discriminant Genie – HMM GENSCAN – “Semi Markov Model” GeneParser – neural net FGENEH etc (Baylor) – Classification by Linear

Discrimination Analysis (LDA) a well-known statistical technique

GeneID – rule-based