32
Exhaustive search (cont’d) CS 466 Saurabh Sinha

Exhaustive search (cont’d) CS 466 Saurabh Sinha. Finding motifs ab initio Enumerate all possible strings of some fixed (small) length For each such string

Embed Size (px)

Citation preview

Page 1: Exhaustive search (cont’d) CS 466 Saurabh Sinha. Finding motifs ab initio Enumerate all possible strings of some fixed (small) length For each such string

Exhaustive search (cont’d)

CS 466

Saurabh Sinha

Page 2: Exhaustive search (cont’d) CS 466 Saurabh Sinha. Finding motifs ab initio Enumerate all possible strings of some fixed (small) length For each such string

Finding motifs ab initio

• Enumerate all possible strings of some fixed (small) length

• For each such string (“motif”) count its occurrences in the promoters

• Report the most frequently occurring motif

• Does the true motif pop out ?

Chapter 4.5-4.9

Page 3: Exhaustive search (cont’d) CS 466 Saurabh Sinha. Finding motifs ab initio Enumerate all possible strings of some fixed (small) length For each such string

Simple statistics

• Consider 10 promoters, each 100 bp long

• Suppose a secret motif ATGCAACT has been “planted”

in each promoter

• Our enumerative method counts every possible “8-mer”

• Expected number of occurrences of an 8-mer is 10 x

100 x (1/4)8 ≈ 0.015

• Most likely, an arbitrary 8-mer will occur at most once,

may be twice

• 10 occurrences of ATGCAACT will stand out

Page 4: Exhaustive search (cont’d) CS 466 Saurabh Sinha. Finding motifs ab initio Enumerate all possible strings of some fixed (small) length For each such string

Variation in binding sites

• Motif occurrences will not always be exact copies of the consensus string

• The transcription factor can usually tolerate some variability in its binding sites

• It’s possible that none of the 10 occurrences of our motif ATGCAACT is actualy this precise string

Page 5: Exhaustive search (cont’d) CS 466 Saurabh Sinha. Finding motifs ab initio Enumerate all possible strings of some fixed (small) length For each such string

A new motif model

• To define a motif, lets say we know where the motif occurrence starts in the sequence

• The motif start positions in their sequences can be represented as s = (s1,s2,s3,…,st)

Page 6: Exhaustive search (cont’d) CS 466 Saurabh Sinha. Finding motifs ab initio Enumerate all possible strings of some fixed (small) length For each such string

Motifs: Profiles and Consensus

a G g t a c T t

C c A t a c g tAlignment a c g t T A g t a c g t C c A t C c g t a c g G

_________________

A 3 0 1 0 3 1 1 0Profile C 2 4 0 0 1 4 0 0 G 0 1 4 0 0 0 3 1 T 0 0 0 5 1 0 1 4

_________________

Consensus A C G T A C G T

• Line up the patterns by their start indexes

s = (s1, s2, …, st)

• Construct matrix profile with frequencies of each nucleotide in columns

• Consensus nucleotide in each position has the highest score in column

Page 7: Exhaustive search (cont’d) CS 466 Saurabh Sinha. Finding motifs ab initio Enumerate all possible strings of some fixed (small) length For each such string

Profile matrices

• Suppose there were t sequences to begin with

• Consider a column of a profile matrix

• The column may be (t, 0, 0, 0)

– A perfectly conserved column

• The column may be (t/4, t/4, t/4, t/4)

– A completely uniform column

• “Good” profile matrices should have more

conserved columns

Page 8: Exhaustive search (cont’d) CS 466 Saurabh Sinha. Finding motifs ab initio Enumerate all possible strings of some fixed (small) length For each such string

Scoring Motifs

• Given s = (s1, … st) and DNA

Score(s,DNA) =

a G g t a c T t C c A t a c g t a c g t T A g t a c g t C c A t C c g t a c g G _________________ A 3 0 1 0 3 1 1 0 C 2 4 0 0 1 4 0 0 G 0 1 4 0 0 0 3 1 T 0 0 0 5 1 0 1 4 _________________

Consensus a c g t a c g t

Score 3+4+4+5+3+4+3+4=30

l

t

∑= ∈

l

i GCTAk

ikcount1 },,,{

),(max

Page 9: Exhaustive search (cont’d) CS 466 Saurabh Sinha. Finding motifs ab initio Enumerate all possible strings of some fixed (small) length For each such string

Good profile matrices

• Goal is to find the starting positions s=(s1,…st) to maximize the score(s,DNA) of the resulting profile matrix

• This is one formulation of the “motif finding problem”

Page 10: Exhaustive search (cont’d) CS 466 Saurabh Sinha. Finding motifs ab initio Enumerate all possible strings of some fixed (small) length For each such string

Another formulation

• Hamming distance between two strings v and w is

dH(v,w) = number of mismatches between v and w

• Given an array of starting positions s=(s1,…st), we

define dH(v, s) = ∑idH(v,si)

• Define: TotalDist(v, DNA) = mins dH(v,s)

• Computing TotalDist is easy

– find closest string to v in each input sequence

Page 11: Exhaustive search (cont’d) CS 466 Saurabh Sinha. Finding motifs ab initio Enumerate all possible strings of some fixed (small) length For each such string

The median string problem

• Find v that minimizes TotalDist(v)

• A double minimization (mins, minv)

• Equivalent to motif finding problem– Show this

Page 12: Exhaustive search (cont’d) CS 466 Saurabh Sinha. Finding motifs ab initio Enumerate all possible strings of some fixed (small) length For each such string

Naïve time complexity

• Motif finding problem: Consider every (s1,…st): O((n-l+1)t)

• Median string problem: Consider every l-mer: O(4l). Relatively fast !

• Common form of both expressions: Find a vector of L variables, each variable can take k values: O(kL)

Page 13: Exhaustive search (cont’d) CS 466 Saurabh Sinha. Finding motifs ab initio Enumerate all possible strings of some fixed (small) length For each such string

An algorithm to enumerate !

• Want to generate all strings in {1,2,3,4}L

11…1111…1211…1311…14..

44…44

NEXTLEAF(a, L, k)for i := L to 1 if ai < k ai := ai+1 return a ai := 1return a

Increment the least significant digit; and “carry over” to next position if necessary

ALLLEAVES(L, k)a := (1,..,1)while true output a a := NEXTLEAF(a,L,k) if a = (1,..,1) return

Page 14: Exhaustive search (cont’d) CS 466 Saurabh Sinha. Finding motifs ab initio Enumerate all possible strings of some fixed (small) length For each such string

“Seach Tree” for enumeration

--

Order of steps

11 12 13 14 21 22 23 24 31 32 33 34 41 42 43 44

1- 2- 3- 4-

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

12 3

4

12 3

4

Page 15: Exhaustive search (cont’d) CS 466 Saurabh Sinha. Finding motifs ab initio Enumerate all possible strings of some fixed (small) length For each such string

Visiting all the vertices in tree

• Not just the leaves, but the internal nodes also– Why? Why not only the leaves of the tree?– We’ll see later.

• PreOrder traversal of a tree• PreOrder(node):

1. Visit node2. PreOrder(left child of node)3. PreOrder(right child of node)

• How about a non- “recursive” version?

Page 16: Exhaustive search (cont’d) CS 466 Saurabh Sinha. Finding motifs ab initio Enumerate all possible strings of some fixed (small) length For each such string

Visit the Next Vertex

1. NextVertex(a,i,L,k) // a : the array of digits2. if i < L // i : prefix length 3. a i+1 1 // L: max length4. return ( a,i+1) // k : max digit value5. else6. for j L to 17. if aj < k8. aj aj +19. return( a,j )10. return(a,0)

Page 17: Exhaustive search (cont’d) CS 466 Saurabh Sinha. Finding motifs ab initio Enumerate all possible strings of some fixed (small) length For each such string

In words

• If at an internal node, just go one level deeper.

• If at a leaf node, – go to next leaf– if moved to a non-sibling in this process,

jump up

Page 18: Exhaustive search (cont’d) CS 466 Saurabh Sinha. Finding motifs ab initio Enumerate all possible strings of some fixed (small) length For each such string

Bypassing

• What if we wish to skip an entire subtree (at some internal node) during the tree traversal ?

BYPASS(a, i, L, k)for j := i to 1 if aj < k aj := aj+1 return (a,j)return (a,0)

Page 19: Exhaustive search (cont’d) CS 466 Saurabh Sinha. Finding motifs ab initio Enumerate all possible strings of some fixed (small) length For each such string

Brute Force Solution for the Motif finding problem

1. BruteForceMotifSearchAgain(DNA, t, n, l)2. s (1,1,…, 1)3. bestScore Score(s,DNA)4. while forever5. s NextLeaf (s, t, n-l+1)6. if (Score(s,DNA) > bestScore)7. bestScore Score(s, DNA)8. bestMotif (s1,s2 , . . . , st) 9. return bestMotif

O(l(n-l+1)t)

Page 20: Exhaustive search (cont’d) CS 466 Saurabh Sinha. Finding motifs ab initio Enumerate all possible strings of some fixed (small) length For each such string

Can We Do Better?

• Sets of s=(s1, s2, …,st) may have a weak profile for the first i positions (s1, s2, …,si)

• Every row of alignment may add at most l to Score• Optimism: if all subsequent (t-i) positions (si+1, …st) add

(t – i ) * l to Score(s,i,DNA)

• If Score(s,i,DNA) + (t – i ) * l < BestScore, it makes no sense to search in vertices of the current subtree– Use ByPass()

• “Branch and bound” strategy– This saves us from looking at (n – l + 1)t-i leaves

Page 21: Exhaustive search (cont’d) CS 466 Saurabh Sinha. Finding motifs ab initio Enumerate all possible strings of some fixed (small) length For each such string

Pseudocode for Branch and Bound Motif Search

1. BranchAndBoundMotifSearch(DNA,t,n,l)2. s (1,…,1)3. bestScore 04. i 15. while i > 06. if i < t7. optimisticScore Score(s, i, DNA) +(t – i ) * l8. if optimisticScore < bestScore9. (s, i) Bypass(s,i, n-l +1)10. else 11. (s, i) NextVertex(s, i, n-l +1)12. else 13. if Score(s,DNA) > bestScore14. bestScore Score(s)15. bestMotif (s1, s2, s3, …, st)16. (s,i) NextVertex(s,i,t,n-l + 1)17. return bestMotif

Page 22: Exhaustive search (cont’d) CS 466 Saurabh Sinha. Finding motifs ab initio Enumerate all possible strings of some fixed (small) length For each such string

The median string problem

• Enumerate 4l strings v

• For each v, compute TotalDist(v, DNA)– This requires linear scan of DNA, i.e., O(nt)

• Overall: O(nt4l)

• Improvement by branch and bound ?

• During enumeration of l-mers, suppose we are at some prefix v’, and find that TotalDist(v’,DNA) > BestDistanceSoFar.

• Why enumerate further ?

Page 23: Exhaustive search (cont’d) CS 466 Saurabh Sinha. Finding motifs ab initio Enumerate all possible strings of some fixed (small) length For each such string

Bounded Median String Search1. BranchAndBoundMedianStringSearch(DNA,t,n,l )2. s (1,…,1)3. bestDistance ∞4. i 15. while i > 06. if i < l7. prefix string corresponding to the first i nucleotides of s8. optimisticDistance TotalDistance(prefix,DNA)9. if optimisticDistance > bestDistance10. (s, i ) Bypass(s,i, l, 4)11. else 12. (s, i ) NextVertex(s, i, l, 4)13. else 14. word nucleotide string corresponding to s15. if TotalDistance(s,DNA) < bestDistance16. bestDistance TotalDistance(word, DNA)17. bestWord word18. (s,i ) NextVertex(s,i,l, 4)19. return bestWord

Page 24: Exhaustive search (cont’d) CS 466 Saurabh Sinha. Finding motifs ab initio Enumerate all possible strings of some fixed (small) length For each such string

Greedy Algorithms

Page 25: Exhaustive search (cont’d) CS 466 Saurabh Sinha. Finding motifs ab initio Enumerate all possible strings of some fixed (small) length For each such string

A greedy approach to the motif finding problem

• Given t sequences of length n each, to find a profile matrix of length l.

• Enumerative approach O(l nt) – Impractical

• Instead consider a more practical algorithm called “GREEDYMOTIFSEARCH”

Chapter 5.5

Page 26: Exhaustive search (cont’d) CS 466 Saurabh Sinha. Finding motifs ab initio Enumerate all possible strings of some fixed (small) length For each such string

Greedy Motif Search

• Find two closest l-mers in sequences 1 and 2 and form 2 x l alignment matrix with Score(s,2,DNA)• At each of the following t-2 iterations, finds a “best” l-mer in

sequence i from the perspective of the already constructed (i-1) x l alignment matrix for the first (i-1) sequences

• In other words, it finds an l-mer in sequence i maximizing Score(s,i,DNA)

under the assumption that the first (i-1) l-mers have been already chosen

• Sacrifices optimal solution for speed: in fact the bulk of the time is actually spent locating the first 2 l-mers

Page 27: Exhaustive search (cont’d) CS 466 Saurabh Sinha. Finding motifs ab initio Enumerate all possible strings of some fixed (small) length For each such string

Greedy Motif Search pseudocode

• GREEDYMOTIFSEARCH (DNA, t, n, l)• bestMotif := (1,…,1)• s := (1,…,1)• for s1=1 to n-l+1

for s2 = 1 to n-l+1 if (Score(s,2,DNA) > Score(bestMotif,2,DNA)

bestMotif1 := s1

bestMotif2 := s2

• s1 := bestMotif1; s2 := bestMotif2

• for i = 3 to t

for si = 1 to n-l+1 if (Score(s,i,DNA) > Score(bestMotif,i,DNA)

bestMotifi := si

si := bestMotifi • Return bestMotif

Page 28: Exhaustive search (cont’d) CS 466 Saurabh Sinha. Finding motifs ab initio Enumerate all possible strings of some fixed (small) length For each such string

A digression

• Score of a profile matrix looks only at the “majority” base in each column, not at the entire distribution

• The issue of non-uniform “background” frequencies of bases in the genome

• A better “score” of a profile matrix ?

Page 29: Exhaustive search (cont’d) CS 466 Saurabh Sinha. Finding motifs ab initio Enumerate all possible strings of some fixed (small) length For each such string

Information Content• First convert a “profile matrix” to a “position

weight matrix” or PWM– Convert frequencies to probabilities

• PWM W: Wk = frequency of base at position k

• q = frequency of base by chance• Information content of W:

Wβk logWβk

qββ ∈{A ,C ,G,T}

∑k

Page 30: Exhaustive search (cont’d) CS 466 Saurabh Sinha. Finding motifs ab initio Enumerate all possible strings of some fixed (small) length For each such string

Information Content

• If Wk is always equal to q, i.e., if W is similar to random sequence, information content of W is 0.

• If W is different from q, information content is high.

Page 31: Exhaustive search (cont’d) CS 466 Saurabh Sinha. Finding motifs ab initio Enumerate all possible strings of some fixed (small) length For each such string

Greedy Motif Search• Can be trivially modified to use “Information Content” as the

score

• Use statistical criteria to evaluate significance of Information Content

• At each step, instead of choosing the top (1) partial motif, keep the top k partial motifs– “Beam search”

• The program “CONSENSUS” from Stormo lab.

• Further Reading: Hertz, Hartzell & Stormo, CABIOS (1990) http://bioinf.kvl.dk/~gorodkin/teach/bioinf2004/hertz90.pdf

Page 32: Exhaustive search (cont’d) CS 466 Saurabh Sinha. Finding motifs ab initio Enumerate all possible strings of some fixed (small) length For each such string

More on Greedy algorithms in next lecture