29
Motif Finding [1]: Ch 4.4-4.6, 4.8-4.10, 5.5, 12.2-12.4

Motif Finding [1]: Ch 4.4-4.6, 4.8-4.10, 5.5, 12.2-12.4

Embed Size (px)

Citation preview

Page 1: Motif Finding [1]: Ch 4.4-4.6, 4.8-4.10, 5.5, 12.2-12.4

Motif Finding

[1]: Ch 4.4-4.6, 4.8-4.10, 5.5, 12.2-12.4

Page 2: Motif Finding [1]: Ch 4.4-4.6, 4.8-4.10, 5.5, 12.2-12.4

Biological Motivation

• Infection from Bacteria and Pathogens (germs)

• Organisms have immunity genes, usually dormant

• Immunity genes “switched on” when organism is infected and produce proteins that destroy Bacteria and Pathogens, and cure

• Biologist want to know “Who turned them on?”

• For fly substring similar to TCGGGGATTTCC within the gene (i.e., DNA sequence) turn them on

• TCGGGGATTTCC is called regulatory motif

Page 3: Motif Finding [1]: Ch 4.4-4.6, 4.8-4.10, 5.5, 12.2-12.4

Random Sample

atgaccgggatactgataccgtatttggcctaggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg

acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatactgggcataaggtaca

tgagtatccctgggatgacttttgggaacactatagtgctctcccgatttttgaatatgtaggatcattcgccagggtccga

gctgagaattggatgaccttgtaagtgttttccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga

tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatggcccacttagtccacttatag

gtcaatcatgttcttgtgaatggatttttaactgagggcatagaccgcttggcgcacccaaattcagtgtgggcgagcgcaa

cggttttggcccttgttagaggcccccgtactgatggaaactttcaattatgagagagctaatctatcgcgtgcgtgttcat

aacttgagttggtttcgaaaatgctctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta

ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatttcaacgtatgccgaaccgaaagggaag

ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttctgggtactgatagca

Page 4: Motif Finding [1]: Ch 4.4-4.6, 4.8-4.10, 5.5, 12.2-12.4

Implanting Motif AAAAAAAGGGGGGG

atgaccgggatactgatAAAAAAAAGGGGGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg

acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataAAAAAAAAGGGGGGGa

tgagtatccctgggatgacttAAAAAAAAGGGGGGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccga

gctgagaattggatgAAAAAAAAGGGGGGGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga

tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAAAAAAAAGGGGGGGcttatag

gtcaatcatgttcttgtgaatggatttAAAAAAAAGGGGGGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaa

cggttttggcccttgttagaggcccccgtAAAAAAAAGGGGGGGcaattatgagagagctaatctatcgcgtgcgtgttcat

aacttgagttAAAAAAAAGGGGGGGctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta

ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatAAAAAAAAGGGGGGGaccgaaagggaag

ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttAAAAAAAAGGGGGGGa

Page 5: Motif Finding [1]: Ch 4.4-4.6, 4.8-4.10, 5.5, 12.2-12.4

Where is the Implanted Motif?

atgaccgggatactgataaaaaaaagggggggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg

acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataaaaaaaaaggggggga

tgagtatccctgggatgacttaaaaaaaagggggggtgctctcccgatttttgaatatgtaggatcattcgccagggtccga

gctgagaattggatgaaaaaaaagggggggtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga

tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaataaaaaaaagggggggcttatag

gtcaatcatgttcttgtgaatggatttaaaaaaaaggggggggaccgcttggcgcacccaaattcagtgtgggcgagcgcaa

cggttttggcccttgttagaggcccccgtaaaaaaaagggggggcaattatgagagagctaatctatcgcgtgcgtgttcat

aacttgagttaaaaaaaagggggggctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta

ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcataaaaaaaagggggggaccgaaagggaag

ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttaaaaaaaaggggggga

Page 6: Motif Finding [1]: Ch 4.4-4.6, 4.8-4.10, 5.5, 12.2-12.4

Implanting Motif AAAAAAGGGGGGG with Four Mutations/Changes

atgaccgggatactgatAgAAgAAAGGttGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg

acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacAAtAAAAcGGcGGGa

tgagtatccctgggatgacttAAAAtAAtGGaGtGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccga

gctgagaattggatgcAAAAAAAGGGattGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga

tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAtAAtAAAGGaaGGGcttatag

gtcaatcatgttcttgtgaatggatttAAcAAtAAGGGctGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaa

cggttttggcccttgttagaggcccccgtAtAAAcAAGGaGGGccaattatgagagagctaatctatcgcgtgcgtgttcat

aacttgagttAAAAAAtAGGGaGccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta

ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatActAAAAAGGaGcGGaccgaaagggaag

ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttActAAAAAGGaGcGGa

Page 7: Motif Finding [1]: Ch 4.4-4.6, 4.8-4.10, 5.5, 12.2-12.4

Where is the Motif??? atgaccgggatactgatagaagaaaggttgggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg

acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacaataaaacggcggga

tgagtatccctgggatgacttaaaataatggagtggtgctctcccgatttttgaatatgtaggatcattcgccagggtccga

gctgagaattggatgcaaaaaaagggattgtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga

tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatataataaaggaagggcttatag

gtcaatcatgttcttgtgaatggatttaacaataagggctgggaccgcttggcgcacccaaattcagtgtgggcgagcgcaa

cggttttggcccttgttagaggcccccgtataaacaaggagggccaattatgagagagctaatctatcgcgtgcgtgttcat

aacttgagttaaaaaatagggagccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta

ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatactaaaaaggagcggaccgaaagggaag

ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttactaaaaaggagcgga

Page 8: Motif Finding [1]: Ch 4.4-4.6, 4.8-4.10, 5.5, 12.2-12.4

How to Find Regulatory Motif?

• How to find regulatory motif from immunity genes

• What we know and what we don’t and what we want to find?

• We know: – At least one regulatory motif in each immunity gene DNA sequence– They looks similar– Length l of the motif

• We don’t know: – The exact pattern of the motif– The location of the motif– Number of occurrence

• Want to find– A substring of size l that is close to all regulatory motifs

Page 9: Motif Finding [1]: Ch 4.4-4.6, 4.8-4.10, 5.5, 12.2-12.4

A Similar Problem

• The Motif Finding Problem is similar to the problem posed by Edgar Allan Poe (1809 – 1849) in his Gold Bug story

Page 10: Motif Finding [1]: Ch 4.4-4.6, 4.8-4.10, 5.5, 12.2-12.4

The Gold Bug Problem

• Given a secret message:

53++!305))6*;4826)4+.)4+);806*;48!8`60))85;]8*:+*8!83(88)5*!; 46(;88*96*?;8)*+(;485);5*!2:*+(;4956*2(5*-4)8`8*; 4069285);)6!8)4++;1(+9;48081;8:8+1;48!85;4)485!528806*81(+9;48;(88;4(+?34;48)4+;161;:188;+?;

• Decipher the message encrypted in the fragment

Page 11: Motif Finding [1]: Ch 4.4-4.6, 4.8-4.10, 5.5, 12.2-12.4

Symbol Frequencies in the Gold Bug Message

• Gold Bug Message:

• English Language:e t a o i n s r h l d c u m f p g w y b v k x j q z

Most frequent Least frequent

Symbol 8 ; 4 ) + * 5 6 ( ! 1 0 2 9 3 : ? ` - ] .Frequency 34 25 19 16 15 14 12 11 9 8 7 6 5 5 4 4 3 2 1 1 1

Page 12: Motif Finding [1]: Ch 4.4-4.6, 4.8-4.10, 5.5, 12.2-12.4

First Attempt

• By simply mapping the most frequent symbols to the most frequent letters of the alphabet:

sfiilfcsoorntaeuroaikoaiotecrntaeleyrcooestvenpinelefheeosnlt

arhteenmrnwteonihtaesotsnlupnihtamsrnuhsnbaoeyentacrmuesotorl

eoaiitdhimtaecedtepeidtaelestaoaeslsueecrnedhimtaetheetahiwfa

taeoaitdrdtpdeetiwt

• The result does not make sense

Page 13: Motif Finding [1]: Ch 4.4-4.6, 4.8-4.10, 5.5, 12.2-12.4

l-tuple count

• A better approach:– Examine frequencies of l-tuples, combinations of 2

symbols, 3 symbols, etc.– “The” is the most frequent 3-tuple in English and “;48”

is the most frequent 3-tuple in the encrypted text– Make inferences of unknown symbols by examining

other frequent l-tuples

Page 14: Motif Finding [1]: Ch 4.4-4.6, 4.8-4.10, 5.5, 12.2-12.4

The ;48 clue

• Mapping “the” to “;48” and substituting all occurrences of the symbols:

53++!305))6*the26)h+.)h+)te06*the!e`60))e5t]e*:+*e!e3(ee)5*!t

h6(tee*96*?te)*+(the5)t5*!2:*+(th956*2(5*h)e`e*th0692e5)t)6!e

)h++t1(+9the0e1te:e+1the!e5th)he5!52ee06*e1(+9thet(eeth(+?3ht

he)h+t161t:1eet+?t

Page 15: Motif Finding [1]: Ch 4.4-4.6, 4.8-4.10, 5.5, 12.2-12.4

Second Attempt

• Make inferences:

53++!305))6*the26)h+.)h+)te06*the!e`60))e5t]e*:+*e!e3(ee)5*!th6(tee*96*?te)*+(the5)t5*!2:*+(th956*2(5*h)e`e*th0692e5)t)6!e)h++t1(+9the0e1te:e+1the!e5th)he5!52ee06*e1(+9thet(eeth(+?3hthe)h+t161t:1eet+?t

• “thet(ee” most likely means “the tree”– Infer “(“ = “r”– “th(+?3h” becomes “thr+?3h”– Can we guess “+” and “?”?

Page 16: Motif Finding [1]: Ch 4.4-4.6, 4.8-4.10, 5.5, 12.2-12.4

The Solution

• After figuring out all the mappings, the final message is:

AGOODGLASSINTHEBISHOPSHOSTELINTHEDEVILSSEATWENYONEDEGRE

ESANDTHIRTEENMINUTESNORTHEASTANDBYNORTHMAINBRANCHSEVENT HLIMBEASTSIDESHOOTFROMTHELEFTEYEOFTHEDEATHSHEADABEELINE

FROMTHETREETHROUGHTHESHOTFIFTYFEETOUT

Page 17: Motif Finding [1]: Ch 4.4-4.6, 4.8-4.10, 5.5, 12.2-12.4

The Solution

A GOOD GLASS IN THE BISHOP’S HOSTEL IN THE DEVIL’S SEA,

TWENY ONE DEGREES AND THIRTEEN MINUTES NORTHEAST AND BY NORTH,

MAIN BRANCH SEVENTH LIMB, EAST SIDE, SHOOT FROM THE LEFT EYE OF

THE DEATH’S HEAD A BEE LINE FROM THE TREE THROUGH THE SHOT,

FIFTY FEET OUT.

Page 18: Motif Finding [1]: Ch 4.4-4.6, 4.8-4.10, 5.5, 12.2-12.4

Motif Finding is harder than Gold Bug problem

• We don’t have the complete dictionary of motifs yet

• The “genetic” language does not have a standard “grammar”

• Only a small fraction of nucleotide sequences encode for motifs; the size of data is enormous

Page 19: Motif Finding [1]: Ch 4.4-4.6, 4.8-4.10, 5.5, 12.2-12.4

The Motif Finding Problem

• Given random samples of DNA sequences:

cctgatagacgctatctggctatccacgtacgtaggtcctctgtgcgaatctatgcgtttccaaccat

agtactggtgtacatttgatacgtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc

aaacgtacgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt

agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtacgtataca

ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaacgtacgtc

• Find the pattern/motif of length l that is implanted in each of the individual sequences

Page 20: Motif Finding [1]: Ch 4.4-4.6, 4.8-4.10, 5.5, 12.2-12.4

The Motif Finding Problem

• The patterns revealed with no mutations:

cctgatagacgctatctggctatccacgtacgtaggtcctctgtgcgaatctatgcgtttccaaccat

agtactggtgtacatttgatacgtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc

aaacgtacgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt

agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtacgtataca

ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaacgtacgtc

acgtacgtConsensus String, this is the motif

Page 21: Motif Finding [1]: Ch 4.4-4.6, 4.8-4.10, 5.5, 12.2-12.4

The Motif Finding Problem

• The patterns with 2 mutations:

cctgatagacgctatctggctatccaGgtacTtaggtcctctgtgcgaatctatgcgtttccaaccat

agtactggtgtacatttgatCcAtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc

aaacgtTAgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt

agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtCcAtataca

ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaCcgtacgGc

What is the consensus string here?

Page 22: Motif Finding [1]: Ch 4.4-4.6, 4.8-4.10, 5.5, 12.2-12.4

Parameters

cctgatagacgctatctggctatccaGgtacTtaggtcctctgtgcgaatctatgcgtttccaaccat

agtactggtgtacatttgatCcAtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc

aaacgtTAgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt

agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtCcAtataca

ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaCcgtacgGc

l = 8

t=5

s1 = 26 s2 = 21 s3= 3 s4 = 56 s5 = 60 s

DNA

n = 69

Page 23: Motif Finding [1]: Ch 4.4-4.6, 4.8-4.10, 5.5, 12.2-12.4

Scoring Motifs

• For s = (s1, … st) and DNA

• Score(s,DNA)=

• Find s with maximum score

• What is the best/worst score?

a G g t a c T t C c A t a c g t a c g t T A g t a c g t C c A t C c g t a c g G _________________ A 3 0 1 0 3 1 1 0 C 2 4 0 0 1 4 0 0 G 0 1 4 0 0 0 3 1 T 0 0 0 5 1 0 1 4 _________________

Consensus a c g t a c g t

Score = 3+4+4+5+3+4+3+4=30

l

t

l

i GCTAk

ikcount1 },,,{

),(max

Page 24: Motif Finding [1]: Ch 4.4-4.6, 4.8-4.10, 5.5, 12.2-12.4

BruteForceMotifSearch

1. BruteForceMotifSearch(DNA, t, n, l)2. bestScore 0

3. for each s=(s1,s2 , . . ., st) from (1,1 . . . 1) to (n-l+1, . . ., n-l+1)4. if (Score(s,DNA) > bestScore)5. bestScore score(s, DNA)

6. bestMotif (s1,s2 , . . . , st) 7. return bestMotif

Cost• (n - l + 1)t possible sets of starting positions• In each iteration O(lt) operations for scoring, total O(lt nt)

Page 25: Motif Finding [1]: Ch 4.4-4.6, 4.8-4.10, 5.5, 12.2-12.4

A Different Look

• Given v = “acgtacgt” and s acgtacgt

cctgatagacgctatctggctatccacgtacAtaggtcctctgtgcgaatctatgcgtttccaaccat acgtacgtagtactggtgtacatttgatacgtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc acgtacgtaaaAgtCcgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt acgtacgtagcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtacgtataca acgtacgtctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaacgtaGgtc

• TotalDistance(v,DNA) = (min for each sequence over all positions)

2

1

0

0

1

Page 26: Motif Finding [1]: Ch 4.4-4.6, 4.8-4.10, 5.5, 12.2-12.4

The Problem

• Input: A t x n matrix DNA, and l, the length of the pattern to find

• Output: A string v of l nucleotides that minimizes TotalDistance(v,DNA) over all strings of that length

Page 27: Motif Finding [1]: Ch 4.4-4.6, 4.8-4.10, 5.5, 12.2-12.4

Median String Search Brute Force Algorithm1. MedianStringSearch (DNA, t, n, l)

2. bestString AAA…A

3. bestDistance ∞

4. for each l-mer s from AAA…A to TTT…T

5. if TotalDistance(s,DNA) < bestDistance

6. bestDistanceTotalDistance(s,DNA)

7. bestWord s

8. return bestWord

Cost • 4l possible l-mer• Time to compute minimum distance for each string O(n)• Total O(nt 4l)

Page 28: Motif Finding [1]: Ch 4.4-4.6, 4.8-4.10, 5.5, 12.2-12.4

Motif Finding Problem == Median String Problem a G g t a c T t C c A t a c g tAlignment a c g t T A g t a c g t C c A t C c g t a c g G _________________ A 3 0 1 0 3 1 1 0Profile C 2 4 0 0 1 4 0 0 G 0 1 4 0 0 0 3 1 T 0 0 0 5 1 0 1 4 _________________

Consensus a c g t a c g t

Score 3+4+4+5+3+4+3+4

TotalDistance 2+1+1+0+2+1+2+1

Sum 5 5 5 5 5 5 5 5

• At any column iScorei + TotalDistancei = t

• For l columns Score + TotalDistance = l * t

• Score = l * t - TotalDistance

• Motif Finding = O(l nt)Median String = O(nt 4l)

l

t

Page 29: Motif Finding [1]: Ch 4.4-4.6, 4.8-4.10, 5.5, 12.2-12.4

Self Study

• Can you convert the two brute force algorithms to branch and bound algorithms to reduce the # cheking ?