43
262 Lecture 15, Win06, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 15, Win06, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

  • View
    222

  • Download
    1

Embed Size (px)

Citation preview

Page 1: CS262 Lecture 15, Win06, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 15, Win06, Batzoglou

Rapid Global Alignments

How to align genomic sequences in (more or less) linear time

Page 2: CS262 Lecture 15, Win06, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 15, Win06, Batzoglou

Page 3: CS262 Lecture 15, Win06, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 15, Win06, Batzoglou

Main Idea

Genomic regions of interest contain islands of similarity, such as genes

1. Find local alignments

2. Chain an optimal subset of them

3. Refine/complete the alignment

Systems that use this idea to various degrees:

MUMmer, GLASS, DIALIGN, CHAOS, AVID, LAGAN, TBA, & others

Page 4: CS262 Lecture 15, Win06, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 15, Win06, Batzoglou

Saving cells in DP

1. Find local alignments

2. Chain -O(NlogN) L.I.S.

3. Restricted DP

Page 5: CS262 Lecture 15, Win06, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 15, Win06, Batzoglou

Methods to CHAIN Local Alignments

Sparse Dynamic ProgrammingO(N log N)

Page 6: CS262 Lecture 15, Win06, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 15, Win06, Batzoglou

The Problem: Find a Chain of Local Alignments

(x,y) (x’,y’)

requires

x < x’y < y’

Each local alignment has a weight

FIND the chain with highest total weight

Page 7: CS262 Lecture 15, Win06, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 15, Win06, Batzoglou

Sparse Dynamic Programming – L.I.S.

Let input be w: w1,…, wn

INITIALIZATION:L: 1-indexed array, L[1] w1

B: 0-indexed array of backpointers; B[0] = 0P: array used for traceback// L[j]: smallest last element wi of j-long LIS seen so far

ALGORITHMfor i = 2 to n { Find j such that L[j] < w[i] ≤ L[j+1] L[j+1] w[i]

B[j+1] iP[i] B[j]

}

That’s it!!!• Running time?

Page 8: CS262 Lecture 15, Win06, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 15, Win06, Batzoglou

Sparse LCS expressed as LIS

Create a sequence w

• Every matching point (i, j), is inserted into w as follows:

• For each column j = 1…m, insert in w the points (i, j), in decreasing row i order

• The 11 example points are inserted in the order given

• a = (y, x), b = (y’, x’) can be chained iff

a is before b in w, and y < y’

15 3 24 16 20 4 24 3 11 18

6

4

2 7

1 8

10

9

5

11

3

4

20

24

3

11

15

11

4

18

20

x

y

Page 9: CS262 Lecture 15, Win06, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 15, Win06, Batzoglou

Sparse LCS expressed as LIS

Create a sequence w

w = (4,2) (3,3) (10,5) (2,5) (8,6) (1,6) (3,7) (4,8) (7,9) (5,9) (9,10)

Consider now w’s elements as ordered lexicographically, where

• (y, x) < (y’, x’) if y < y’

Claim: An increasing subsequence of w is a common subsequence of x and y

15 3 24 16 20 4 24 3 11 18

6

4

2 7

1 8

10

9

5

11

3

4

20

24

3

11

15

11

4

18

20

x

y

Page 10: CS262 Lecture 15, Win06, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 15, Win06, Batzoglou

Sparse Dynamic Programming for LIS

Example:w = (4,2) (3,3) (10,5) (2,5) (8,6)

(1,6) (3,7) (4,8) (7,9) (5,9) (9,10)

L =1. (4,2)2. (3,3)3. (3,3) (10,5)4. (2,5) (10,5)5. (2,5) (8,6)6. (1,6) (8,6)7. (1,6) (3,7)8. (1,6) (3,7) (4,8)9. (1,6) (3,7) (4,8) (7,9)10. (1,6) (3,7) (4,8) (5,9)11. (1,6) (3,7) (4,8) (5,9) (9,10) Longest common subsequence:

s = 4, 24, 3, 11, 18

15 3 24 16 20 4 24 3 11 18

6

4

2 7

1 8

10

9

5

11

3

4

20

24

3

11

15

11

4

18

20

x

y

Page 11: CS262 Lecture 15, Win06, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 15, Win06, Batzoglou

Sparse DP for rectangle chaining

• 1,…, N: rectangles

• (hj, lj): y-coordinates of rectangle j

• w(j): weight of rectangle j

• V(j): optimal score of chain ending in j

• L: list of triplets (lj, V(j), j)

L is sorted by lj: smallest (top) to largest (bottom) value

L is implemented as a balanced binary tree

y

h

l

Page 12: CS262 Lecture 15, Win06, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 15, Win06, Batzoglou

Sparse DP for rectangle chaining

Main idea:

• Sweep through x-coordinates

• To the right of b, anything chainable to a is chainable to b

• Therefore, if V(b) > V(a), rectangle a is “useless” – remove it

• In L, keep rectangles j sorted with increasing lj-coordinates sorted with increasing V(j) score

V(b)V(a)

Page 13: CS262 Lecture 15, Win06, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 15, Win06, Batzoglou

Sparse DP for rectangle chaining

Go through rectangle x-coordinates, from lowest to highest:

1. When on the leftmost end of rectangle i:

a. j: rectangle in L, with largest lj < hi

b. V(i) = w(i) + V(j)

2. When on the rightmost end of i:

a. k: rectangle in L, with largest lk lib. If V(i) V(k):

i. INSERT (li, V(i), i) in L

ii. REMOVE all (lj, V(j), j) with V(j) V(i) & lj li

i

j

k

Page 14: CS262 Lecture 15, Win06, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 15, Win06, Batzoglou

Example

x

y

1: 5

3: 3

2: 6

4: 45: 2

2

56

91011

1214

1516

Page 15: CS262 Lecture 15, Win06, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 15, Win06, Batzoglou

Time Analysis

1. Sorting the x-coords takes O(N log N)

2. Going through x-coords: N steps

3. Each of N steps requires O(log N) time:

• Searching L takes log N• Inserting to L takes log N• All deletions are consecutive, so log N per deletion• Each element is deleted at most once: N log N for all deletions

• Recall that INSERT, DELETE, SUCCESSOR, take O(log N) time in a balanced binary search tree

Page 16: CS262 Lecture 15, Win06, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 15, Win06, Batzoglou

Examples

Human Genome BrowserABC

Page 17: CS262 Lecture 15, Win06, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 15, Win06, Batzoglou

Whole-genome alignment Rat—Mouse—Human

Page 18: CS262 Lecture 15, Win06, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 15, Win06, Batzoglou

Next 2 years: 20+ mammals, & many other animals, will be sequenced & aligned

Page 19: CS262 Lecture 15, Win06, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 15, Win06, Batzoglou

Protein Classification

Page 20: CS262 Lecture 15, Win06, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 15, Win06, Batzoglou

PDB GrowthN

ew

PD

B s

tru

ctu

res

Page 21: CS262 Lecture 15, Win06, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 15, Win06, Batzoglou

Protein classification

• Number of protein sequences grow exponentially• Number of solved structures grow exponentially• Number of new folds identified very small (and close to constant)• Protein classification can

Generate overview of structure types Detect similarities (evolutionary relationships) between protein sequences

Morten Nielsen,CBS, BioCentrum, DTU

SCOP release 1.67, Class # folds # superfamilies # families

All alpha proteins 202 342 550

All beta proteins 141 280 529

Alpha and beta proteins (a/b) 130 213 593

Alpha and beta proteins (a+b) 260 386 650

Multi-domain proteins 40 40 55

Membrane & cell surface 42 82 91

Small proteins 72 104 162

Total 887 1447 2630

Page 22: CS262 Lecture 15, Win06, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 15, Win06, Batzoglou

Protein world

Protein fold

Protein structure classification

Protein superfamily

Protein family

Morten Nielsen,CBS, BioCentrum, DTU

Page 23: CS262 Lecture 15, Win06, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 15, Win06, Batzoglou

Structure Classification Databases

• SCOP Manual classification (A. Murzin) scop.berkeley.edu

• CATH Semi manual classification (C. Orengo) www.biochem.ucl.ac.uk/bsm/cath

• FSSP Automatic classification (L. Holm)

www.ebi.ac.uk/dali/fssp/fssp.html

Morten Nielsen,CBS, BioCentrum, DTU

Page 24: CS262 Lecture 15, Win06, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 15, Win06, Batzoglou

Protein Classification

• Given a new protein, can we place it in its “correct” position within an existing protein hierarchy?

Methods

• BLAST / PsiBLAST

• Profile HMMs

• Supervised Machine Learning methods

Fold

Family

Superfamily

Proteins

?

new protein

Page 25: CS262 Lecture 15, Win06, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 15, Win06, Batzoglou

PSI-BLAST

Given a sequence query x, and database D

1. Find all pairwise alignments of x to sequences in D

2. Collect all matches of x to y with some minimum significance

3. Construct position specific matrix M• Each sequence y is given a weight so that many similar sequences cannot have

much influence on a position (Henikoff & Henikoff 1994)

4. Using the matrix M, search D for more matches

5. Iterate 1–4 until convergence

Profile M

Page 26: CS262 Lecture 15, Win06, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 15, Win06, Batzoglou

Profiles & Profile HMMs

• Psi-BLAST builds a profile

• Profile HMMs: more elaborate versions of a profile Intuitively, profile that models gaps

Page 27: CS262 Lecture 15, Win06, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 15, Win06, Batzoglou

Page 28: CS262 Lecture 15, Win06, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 15, Win06, Batzoglou

Profile HMMs

• Each M state has a position-specific pre-computed substitution table• Each I state has position-specific gap penalties (and in principle can

have its own emission distributions)• Each D state also has position-specific gap penalties

In principle, D-D transitions can also be customized per position

M1 M2 Mm

Protein Family F

BEGIN I0 I1 Im-1

D1 D2 Dm

ENDIm

Dm-1

Page 29: CS262 Lecture 15, Win06, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 15, Win06, Batzoglou

Profile HMMs

transition between match states – αM(i)M(i+1)

transitions between match and insert states – αM(i)I(i), αI(i)M(i+1)

transition within insert state – αI(i)I(i)

transition between match and delete states – αM(i)D(i+1), αD(i)M(i+1)

transition within delete state – αD(i)D(i+1)

emission of amino acid b at a state S – εS(b)

M1 M2 Mm

Protein Family F

BEGIN I0 I1 Im-1

D1 D2 Dm

ENDIm

Dm-1

Page 30: CS262 Lecture 15, Win06, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 15, Win06, Batzoglou

Profile HMMs

transition probabilities ~ frequency of a transition in alignment emission probabilities ~ frequency of an emission in alignment pseudocounts are usually introduced

M1 M2 Mm

Protein Family F

BEGIN I0 I1 Im-1

D1 D2 Dm

ENDIm

Dm-1

aA

Aklkl

k ll

'

'

e aE a

E akk

ka

( )( )

( )'

' '

Page 31: CS262 Lecture 15, Win06, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 15, Win06, Batzoglou

Alignment of a protein to a profile HMM

To align sequence x1…xn to a profile HMM:

We will find the most likely alignment with the Viterbi DP algorithm

• Define Vj

M(i): score of best alignment of x1…xi to the HMM ending in xi being emitted from Mj

VjI(i): score of best alignment of x1…xi to the HMM ending in xi being

emitted from Ij

VjD(i): score of best alignment of x1…xi to the HMM ending in Dj (xi is

the last character emitted before Dj)

• Denote by qa the frequency of amino acid a in a ‘random’ protein

• You can fill-in the details! (or, read Durbin et al. Chapter 5)

Page 32: CS262 Lecture 15, Win06, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 15, Win06, Batzoglou

How to build a profile HMM

Page 33: CS262 Lecture 15, Win06, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 15, Win06, Batzoglou

Resources on the web

• HMMer – a free profile HMM software http://hmmer.wustl.edu/

• SAM – another free profile HMM software http://www.cse.ucsc.edu/research/compbio/sam.html

• PFAM – database of alignments and HMMs for protein families and domains http://www.sanger.ac.uk/Software/Pfam/

• SCOP – a structural classification of proteins http://scop.berkeley.edu/data/scop.b.html

Page 34: CS262 Lecture 15, Win06, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 15, Win06, Batzoglou

Classification with Profile HMMs

M1 M2 Mm

BEGIN I0 I1 Im-1

D1 D2 Dm

ENDIm

Dm-1

Fold

Family

Superfamily

?M1 M2 Mm

BEGIN I0 I1 Im-1

D1 D2 Dm

ENDIm

Dm-1

M1 M2 Mm

BEGIN I0 I1 Im-1

D1 D2 Dm

ENDIm

Dm-1

new protein

Page 35: CS262 Lecture 15, Win06, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 15, Win06, Batzoglou

Classification with Profile HMMs

• How generative models work

Training examples ( sequences known to be members of family ): positive Tuning parameters with a priori knowledge Model assigns a probability to any given protein sequence Idea: The sequence from the family (hopefully) yield a higher probability

than sequences outside the family

• Log-likelihood ratio as score

P(X | H1) P(H1) P(H1|X) P(X) P(H1|X) L(X) = log -------------------------- = log --------------------- = log --------------

P(X | H0) P(H0) P(H0|X) P(X) P(H0|X)

Page 36: CS262 Lecture 15, Win06, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 15, Win06, Batzoglou

Generative Models

Page 37: CS262 Lecture 15, Win06, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 15, Win06, Batzoglou

Generative Models

Page 38: CS262 Lecture 15, Win06, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 15, Win06, Batzoglou

Generative Models

Page 39: CS262 Lecture 15, Win06, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 15, Win06, Batzoglou

Generative Models

Page 40: CS262 Lecture 15, Win06, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 15, Win06, Batzoglou

Discriminative Models -- SVM

v

Decision Rule:red: vTx > 0

marginIf x1 … xn training examples,

sign(iixiTx) “decides” where x falls

• Train i to achieve best margin

Large Margin for |v| < 1 Margin of 1 for small |v|

Page 41: CS262 Lecture 15, Win06, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 15, Win06, Batzoglou

k-mer based SVMs for protein classification

• For given word size k, and mismatch tolerance l, define

K(X, Y) = # distinct k-long word occurrences with ≤ l mismatches

• Define normalized kernel K’(X, Y) = K(X, Y)/ sqrt(K(X,X)K(Y,Y))

• SVM can be learned by supplying this kernel functionA B A C A R D I

A B R A D A B I

X

Y

K(X, Y) = 4

K’(X, Y) = 4/sqrt(7*7) = 4/7 Let k = 3; l = 1

Page 42: CS262 Lecture 15, Win06, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 15, Win06, Batzoglou

SVMs will find a few support vectors

v

After training, SVM has determined a small set of sequences, the support vectors, who need to be compared with query sequence X

Page 43: CS262 Lecture 15, Win06, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 15, Win06, Batzoglou

Benchmarks