42
Sequence Similarity

Sequence Similarity. The Viterbi algorithm for alignment Compute the following matrices (DP) M(i, j):most likely alignment of x 1 …x i with y 1 …y j

Embed Size (px)

Citation preview

Sequence Similarity

The Viterbi algorithm for alignment

• Compute the following matrices (DP) M(i, j): most likely alignment of x1…xi with y1…yj ending in state M

I(i, j): most likely alignment of x1…xi with y1…yj ending in state I

J(i, j): most likely alignment of x1…xi with y1…yj ending in state J

M(i, j) = log( Prob(xi, yj) ) +

max{ M(i-1, j-1) + log(1-2),

I(i-1, j) + log(1-), J(i, j-1) + log(1-) }

I(i, j) = max{ M(i-1, j) + log ,

I(i-1, j) + log }

MP(xi, yj)

IP(xi)

JP(yj)

log(1 – 2)

log(1 – )

log log log log

log(1 – )

log Prob(xi, yj)

One way to view the state paths – State M

x1

xm

y1 yn……

……

State I

x1

xm

y1 yn……

……

State J

x1

xm

y1 yn……

……

Putting it all together

States I(i, j) are connected with states J and M (i-1, j)

States J(i, j) are connected with states I and M (i-1, j)

States M(i, j) are connected with states J and I (i-1, j-1)

x1

xm

y1 yn……

……

Putting it all together

States I(i, j) are connected with states J and M (i-1, j)

States J(i, j) are connected with states I and M (i-1, j)

States M(i, j) are connected with states J and I (i-1, j-1)

Optimal solution is the best scoring path from top-left to bottom-right corner

This gives the likeliest alignment according to our HMM

x1

xm

y1 yn……

……

Yet another way to represent this model

Mx1 Mxm

Sequence X

BEGIN Iy Iy

Ix Ix

END

We are aligning, or threading, sequence Y through sequence X

Every time yj lands in state xi, we get substitution score s(xi, yj)

Every time yj is gapped, or some xi is skipped, we pay gap penalty

From this model, we can compute additional statistics

• P(xi ~ yj | x, y) The probability that positions i, j align, given that sequences x and y align

P(xi ~ yj | x, y) = α: alignmentP(α | x, y) 1(xi ~ yj in α)

We will not cover the details, but

this quantity can also be

calculated with DP

MP(xi, yj)

IP(xi)

JP(yj)

log(1 – 2)

log(1 – )

log log log log

log(1 – )

log Prob(xi, yj)

Fast database search – BLAST

(Basic Local Alignment Search Tool)

Main idea:

1. Construct a dictionary of all the words in the query

2. Initiate a local alignment for each word match between query and DB

Running Time: O(MN)

However, orders of magnitude faster than Smith-Waterman

query

DB

BLAST Original Version

• Dictionary:

All words of length k (~11 nucl.; ~4 aa)

Alignment initiated between words of alignment score T (typically T = k)

• Alignment:

Ungapped extensions until score below statistical threshold

• Output:

All local alignments with score > statistical threshold

……

……

query

DB

query

scan

PSI-BLAST

Given a sequence query x, and database D

1. Find all pairwise alignments of x to sequences in D

2. Collect all matches of x to y with some minimum significance

3. Construct position specific matrix M• Each sequence y is given a weight so that many similar sequences cannot have

much influence on a position (Henikoff & Henikoff 1994)

4. Using the matrix M, search D for more matches

5. Iterate 1–4 until convergence

Profile M

BLAST Variants

• BLASTN – genomic sequences• BLASTP – proteins• BLASTX – translated genome versus proteins• TBLASTN – proteins versus translated genomes• TBLASTX – translated genome versus translated genome• PSIBLAST – iterated BLAST search

http://www.ncbi.nlm.nih.gov/BLAST

Multiple Sequence Multiple Sequence AlignmentsAlignments

Protein Phylogenies

• Proteins evolve by both duplication and species divergence

Definition

• Given N sequences x1, x2,…, xN: Insert gaps (-) in each sequence xi, such that

• All sequences have the same length L

• Score of the global map is maximum

• A faint similarity between two sequences becomes significant if present in many

• Multiple alignments can help improve the pairwise alignments

Scoring Function: Sum Of Pairs

Definition: Induced pairwise alignment

A pairwise alignment induced by the multiple alignment

Example:

x: AC-GCGG-C y: AC-GC-GAG z: GCCGC-GAG

Induces:

x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAGy: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG

Sum Of Pairs (cont’d)

• Heuristic way to incorporate evolution tree:

Human

Mouse

Chicken• Weighted SOP:

S(m) = k<l wkl s(mk, ml)

wkl: weight decreasing with distance

Duck

A Profile Representation

• Given a multiple alignment M = m1…mn Replace each column mi with profile entry pi

• Frequency of each letter in • # gaps• Optional: # gap openings, extensions, closings

Can think of this as a “likelihood” of each letter in each position

- A G G C T A T C A C C T G T A G – C T A C C A - - - G C A G – C T A C C A - - - G C A G – C T A T C A C – G G C A G – C T A T C G C – G G

A 1 1 .8 C .6 1 .4 1 .6 .2G 1 .2 .2 .4 1T .2 1 .6 .2- .2 .8 .4 .8 .4

Multiple Sequence Alignments

Algorithms

Multidimensional DP

Generalization of Needleman-Wunsh:

S(m) = i S(mi)

(sum of column scores)

F(i1,i2,…,iN): Optimal alignment up to (i1, …, iN)

F(i1,i2,…,iN) = max(all neighbors of cube)(F(nbr)+S(nbr))

• Example: in 3D (three sequences):

• 7 neighbors/cell

F(i,j,k) = max{ F(i-1,j-1,k-1)+S(xi, xj, xk),F(i-1,j-1,k )+S(xi, xj, - ),F(i-1,j ,k-1)+S(xi, -, xk),F(i-1,j ,k )+S(xi, -, - ),F(i ,j-1,k-1)+S( -, xj, xk),F(i ,j-1,k )+S( -, xj, xk),F(i ,j ,k-1)+S( -, -, xk) }

Multidimensional DP

Running Time:

1. Size of matrix: LN;

Where L = length of each sequence

N = number of sequences

2. Neighbors/cell: 2N – 1

Therefore………………………… O(2N LN)

Multidimensional DP

Running Time:

1. Size of matrix: LN;

Where L = length of each sequence

N = number of sequences

2. Neighbors/cell: 2N – 1

Therefore………………………… O(2N LN)

Multidimensional DP

• How do gap states generalize?

• VERY badly! Require 2N states, one per combination of

gapped/ungapped sequences Running time: O(2N 2N LN) = O(4N LN)

XY XYZ Z

Y YZ

X XZ

Progressive Alignment

• When evolutionary tree is known:

Align closest first, in the order of the tree In each step, align two sequences x, y, or profiles px, py, to generate a new

alignment with associated profile presult

Weighted version: Tree edges have weights, proportional to the divergence in that edge New profile is a weighted average of two old profiles

x

w

y

z

pxy

pzw

pxyzw

Progressive Alignment

• When evolutionary tree is known:

Align closest first, in the order of the tree In each step, align two sequences x, y, or profiles px, py, to generate a new

alignment with associated profile presult

Weighted version: Tree edges have weights, proportional to the divergence in that edge New profile is a weighted average of two old profiles

x

w

y

z

Example

Profile: (A, C, G, T, -)px = (0.8, 0.2, 0, 0, 0)py = (0.6, 0, 0, 0, 0.4)

s(px, py) = 0.8*0.6*s(A, A) + 0.2*0.6*s(C, A) + 0.8*0.4*s(A, -) + 0.2*0.4*s(C, -)

Result: pxy = (0.7, 0.1, 0, 0, 0.2)

s(px, -) = 0.8*1.0*s(A, -) + 0.2*1.0*s(C, -)

Result: px- = (0.4, 0.1, 0, 0, 0.5)

Progressive Alignment

• When evolutionary tree is unknown:

Perform all pairwise alignments Define distance matrix D, where D(x, y) is a measure of evolutionary

distance, based on pairwise alignment Construct a tree Align on the tree

x

w

y

z?

Heuristics to improve alignments

• Iterative refinement schemes

• A*-based search

• Consistency

• Simulated Annealing

• …

Iterative Refinement

One problem of progressive alignment:• Initial alignments are “frozen” even when new evidence comes

Example:

x: GAAGTTy: GAC-TT

z: GAACTGw: GTACTG

Frozen!

Now clear correct y = GA-CTT

Iterative Refinement

Algorithm (Barton-Stenberg):

1. For j = 1 to N,Remove xj, and realign to x1…

xj-1xj+1…xN

2. Repeat 4 until convergence

x

y

z

x,z fixed projection

allow y to vary

Iterative Refinement

Example: align (x,y), (z,w), (xy, zw):

x: GAAGTTAy: GAC-TTAz: GAACTGAw: GTACTGA

After realigning y:

x: GAAGTTAy: G-ACTTA + 3 matchesz: GAACTGAw: GTACTGA

Variant:Refinement on a tree“tree partitioning”

Iterative Refinement

Example: align (x,y), (z,w), (xy, zw):

x: GAAGTTAy: GAC-TTAz: GAACTGAw: GTACTGA

After realigning y:

x: GAAGTTAy: G-ACTTA + 3 matchesz: GAACTGAw: GTACTGA

Iterative Refinement

Example not handled well:

x: GAAGTTAy1: GAC-TTAy2: GAC-TTAy3: GAC-TTA

z: GAACTGAw: GTACTGA

Realigning any single yi changes nothing

Some Resources

http://www.ncbi.nlm.nih.gov/BLAST

BLAST & PSI-BLAST

http://www.ebi.ac.uk/clustalw/

CLUSTALW – most widely used

http://phylogenomics.berkeley.edu/cgi-bin/muscle/input_muscle.py

MUSCLE – most scalable

http://probcons.stanford.edu/

PROBCONS – most accurate

MUSCLE at a glance

1. Fast measurement of all pairwise distances between sequences • DDRAFT(x, y) defined in terms of # common k-mers (k~3) – O(N2 L logL) time

2. Build tree TDRAFT based on DDRAFT, with a hierarchical clustering method (UPGMA)

3. Progressive alignment over TDRAFT, resulting in multiple alignment MDRAFT

4. Measure distances D(x, y) based on MDRAFT

5. Build tree T based on D

6. Progressive alignment over T, to build M

7. Iterative refinement; for many rounds, do:• Tree Partitioning: Split M on one branch and realign the two resulting profiles• If new alignment M’ has better sum-of-pairs score than previous one, accept

PROBCONS: Probabilistic Consistency-based Multiple

Alignment of Proteins

INSERTINSERT

XXINSERTINSERT

YY

MATCHMATCH

xxiiyyjj

――yyjj

xxii――

INSERTINSERTXX

INSERTINSERTYY

MATCHMATCH

A pair-HMM model of pairwise alignment

Parameterizes a probability distribution, P(A), over all possible alignments of all possible pairs of sequences

Transition probabilities ~ gap penalties

Emission probabilities ~ substitution matrix

ABRACA-DABRAAB-ACARDI---

xxyy

xxiiyyjj

――yyjj

xxii

――

Computing Pairwise Alignments

• The Viterbi algorithm conditional distribution P(α | x, y) reflects model’s uncertainty over the

“correct” alignment of x and y identifies highest probability alignment, αviterbi, in O(L2) time

Caveat: the most likely alignment is not the most accurate Alternative: find the alignment of maximum expected accuracy

P(α)P(α)

P(α | x, y)P(α | x, y)

ααviterbiviterbi

The Lazy-Teacher Analogy

• 10 students take a 10-question true-false quiz• How do you make the answer key?

Approach #1: Use the answer sheet of the best student! Approach #2: Weighted majority vote!

A- AAB A- A

B+ B+B+B- B- C

4. F4. F 4. T 4. F 4. F

4. F4. F 4. F 4. F 4. T

Viterbi vs. Maximum Expected Accuracy (MEA)

Viterbi

• picks single alignment with highest chance of being completely correct

• mathematically, finds the alignment α that maximizes

Eα*[1{α = α*}]

Maximum Expected Accuracy

• picks alignment with highest expected number of correct predictions

• mathematically, finds the alignment α that maximizes

Eα*[accuracy(α, α*)]

AA

4. T A- AAB A- A

B+ B+B+B- B- C

4. F4. F 4. T 4. F 4. F

4. F4. F 4. F 4. F 4. T