38
Sequence Similarity

Sequence Similarity

  • Upload
    carl

  • View
    43

  • Download
    0

Embed Size (px)

DESCRIPTION

Sequence Similarity. x i. ―. x i. y j. MATCH. PROBCONS: Probabilistic Consistency-based Multiple Alignment of Proteins. INSERT X. INSERT Y. ―. y j. x i. y j. MATCH. INSERT X. INSERT Y. x i. ―. ―. y j. A pair-HMM model of pairwise alignment. - PowerPoint PPT Presentation

Citation preview

Page 1: Sequence Similarity

Sequence Similarity

Page 2: Sequence Similarity

PROBCONS: Probabilistic Consistency-based Multiple

Alignment of Proteins

INSERTINSERTXX

INSERTINSERTYY

MATCHMATCH

xxiiyyjj

――yyjj

xxii――

Page 3: Sequence Similarity

INSERTINSERTXX

INSERTINSERTYY

MATCHMATCH

A pair-HMM model of pairwise alignment

Parameterizes a probability distribution, P(A), over all possible alignments of all possible pairs of sequences

Transition probabilities ~ gap penalties

Emission probabilities ~ substitution matrix (from BLOSUM)

ABRACA-DABRAAB-ACARDI---

xxyy

xxiiyyjj

――yyjj

xxii――

Page 4: Sequence Similarity

Computing Pairwise Alignments

• The Viterbi algorithm conditional distribution P(α | x, y) reflects model’s uncertainty over the “correct”

alignment of x and y identifies highest probability alignment, αviterbi, in O(L2) time

Caveat: the most likely alignment is not the most accurate Alternative: find the alignment of maximum expected accuracy

P(α)P(α)

P(α | x, y)P(α | x, y)

ααviterbiviterbi

Page 5: Sequence Similarity

The Lazy-Teacher Analogy

• 10 students take a 10-question true-false quiz• How do you make the answer key?

Approach #1: Use the answer sheet of the best student! Approach #2: Weighted majority vote!

A- AAB A- A

B+ B+B+B- B- C

4. F4. F 4. T 4. F 4. F

4. F4. F 4. F 4. F 4. T

Page 6: Sequence Similarity

Viterbi vs. Maximum Expected Accuracy (MEA)

Viterbi

• picks single alignment with highest chance of being completely correct

• mathematically, finds the alignment α that maximizes

Eα*[1{α = α*}]

Maximum Expected Accuracy

• picks alignment with highest expected number of correct predictions

• mathematically, finds the alignment α that maximizes

Eα*[accuracy(α, α*)]

AA4. T A- AAB A- A

B+ B+B+B- B- C

4. F4. F 4. T 4. F 4. F

4. F4. F 4. F 4. F 4. T

Page 7: Sequence Similarity

Computing MEA alignments

• Defineaccuracy (α, α*) =

Eα*(accuracy(α, α*) | x, y) ~ Eα*(∑(xi, yj) in α1((xi, yj) in α*) | x,y)

= ∑α’P(α’ | x, y) ∑(xi, yj) in α 1((xi, yj) in α’)

= ∑(xi, yj) in α ∑α’P(α’ | x, y) 1((xi, yj) in α’)

= ∑(xi, yj) in α P(xi, yj in α’ | x, y)

• Define M[i, j] = posterior probability that xi is aligned to yj

# of correct predicted matches# of correct predicted matcheslength of shorter sequencelength of shorter sequence

Page 8: Sequence Similarity

Computing MEA alignments

• Define

accuracy (α, α*) =

• Then, MEA alignment is highest summing path through the matrix

M[i, j] = P(xi is aligned to yj | x, y)

• M[i, j] = posterior probability that xi is aligned to yj Can compute with forward, backward dynamic programming in

O(L2) time

# of correct predicted matches# of correct predicted matcheslength of shorter sequencelength of shorter sequence

Page 9: Sequence Similarity

Computing MEA alignments

• Defineaccuracy (α, α*) =

• Then, MEA alignment is highest summing path through the matrix

M[i, j] = P(xi is aligned to yj | x, y)

• M[I, j] = posterior probability that xi is aligned to yj Can compute with forward, backward dynamic programming in

O(L2) time

# of correct predicted matches# of correct predicted matcheslength of shorter sequencelength of shorter sequence

Page 10: Sequence Similarity

The consistency signal

zz

xx

yy

xxii

yyjj yyj’j’

zzkk

Page 11: Sequence Similarity

To estimate P(xi yj | x, y, z)

Method 1: triplet-HMM

P(xi ~ yj | x, y, z) = ∑k P(xi~yj~zk | x, y, z)

Parameters trained with unsupervised EM

Running time: O(N3L3)N: # sequencesL: sequence lengths

XYZ Y

XYX

XZ

YZZ

1

1

1 2

2

2

1

1

1

2

2

2

1

1

1

1

11

2

2

22

22

211 211

211

121

121 121

Page 12: Sequence Similarity

Probabilistic consistency

• Compute P(xi is aligned to yj | x, y) P(xi is aligned to yj | x, y, z)

• 2 approaches: 1) Exact – triplet HMM, O(L3) time 2) Approximate – use independence assumptions

∑k P(xi ~ zk and zk ~ yj | x, y, z) =

∑k P(xi ~ zk | x, z) P(zk ~ yj | x, y, z, xi ~ zk) (assume indep.)

∑k P(xi ~ zk | x, z) P(zk ~ yj | z, y)

Page 13: Sequence Similarity

Probabilistic consistency

• Compute P(xi is aligned to yj | x, y, z)

To compute P(xi ~ yj | x, y, z) ~ ∑k P(xi ~ zk | x, z) P(zk ~ yj | z, y)

Notice that for any given i, most entries k and j will be close to 0-- sparse matrices

Pxy|z PxzPzy

Finally, let

Pxy|S 1/|S| ∑z in S PxzPzy

Page 14: Sequence Similarity

Multiple sequence alignment

• A straightforward generalization sum-of-pairs tree-based progressive alignment iterative refinement

ABRACA-DABRAAB-ACARDI---ABRA---DABI-

AB-ACARDI---ABRA---DABI-

ABRACADABRAABRA--DABI-

ABRACA-DABRAAB-ACARDI---

Page 15: Sequence Similarity

Multiple sequence alignment

• A straightforward generalization sum-of-pairs tree-based progressive alignment iterative refinement

ABRACA-DABRAAB-ACARDI---ABRA---DABI-

AB-ACARDI---ABRA---DABI-

ABRACADABRAABRA--DABI-

ABRACA-DABRAAB-ACARDI---

ABRACA-DABRAAB-ACARDI---ABRA---DABI-

ABACARDIABRACADABRA

ABRACA-DABRAAB-ACARDI---

ABRADABI

ABRACA-DABRAAB-ACARDI---ABRA---DABI-

ABACARDI

ABRACADABRAABRA--DABI-

ABRACA-DABRAAB-ACARD--I-ABRA---DABI-

Page 16: Sequence Similarity

Summary of PROBCONS Algorithm

Given K sequences to be aligned,

(1) Compute M[i, j] for all pairs of sequences, x and y

(2) Use probabilistic consistency to reestimate M[i, j]

(3) Build a tree of the sequences by connecting closest first • “Closest” defined according to expected accuracy • EA(x, y) = E(accuracy) of MEA alignment of x and y

(4) Perform progressive alignment along the tree• Score of a column: sum-of-pairs M[i, j]

(5) Apply iterative refinement

Page 17: Sequence Similarity

Training/testing methodology

• 3 reference benchmark sets

• PROBCONS parameters trained via unsupervised EM on unaligned sequences from BAliBASE.

• Quality score:

Q(α, α*) =

BAliBASEBAliBASE PREFABPREFAB SABmarkSABmark

# of correct predicted matches# of correct predicted matchestotal # of true matchestotal # of true matches

Page 18: Sequence Similarity

Evaluation of Algorithm Components

Algorithm Quality(74)

Time(sec)

Viterbi 0.375 0.72MEA 0.403 1.6PC (O(L3)) 0.431 584.2PC x 1 (O(L2)) 0.422 1.7PC x 2 (O(L2)) 0.427 1.9Progressive PC x 2 (O(L2)) 0.432 1.9Progressive PC x 2 (O(L2)) + IR 0.435 3.3

all-pairsall-pairspairwisepairwise

multiplemultiple

Page 19: Sequence Similarity

Performance of different alignment tools

Algorithm BAliBASE(237)

PREFAB(1932)

SABmark(698)

Q t Q t Q tAlign-m 0.804 19:25 - - 0.352 56:44DIALIGN 0.832 2:53 0.572 12:25:00 0.410 8:28CLUSTALW 0.861 1:07 0.589 2:57:00 0.439 2:16MAFFT 0.882 1:18 0.648 2:36:00 0.442 7:33T-Coffee 0.883 21:31 0.636 144:51:00 0.456 59:10MUSCLE 0.896 1:05 0.648 3:11:00 0.464 20:42PROBCONS 0.910 5:32 0.668 19:41:00 0.505 17:20

Page 20: Sequence Similarity

Resources for alignment

Protein Multiple Alignershttp://www.ebi.ac.uk/clustalw/ CLUSTALW – most widely used (1994)

http://phylogenomics.berkeley.edu/cgi-bin/muscle/input_muscle.py

MUSCLE – most scalable (2004)

http://probcons.stanford.edu/ PROBCONS – most accurate (2004)

Some more protein multiple aligners:

MULTALIGN, MSA, DIALIGN, DCA, MACAW, TCOFFEE, MAFFT, DSC, MUSEQUAL, TOPLIGN, SACHMO, MATCHBOX, PRRN, SAM, MAXHOM, STRAP, ALIGN, AMAS, PILEUP, etc…….

ProbCons: Chuong (Tom) Do

Page 21: Sequence Similarity

Profile hidden Markov models for sequence famillies

Page 22: Sequence Similarity
Page 23: Sequence Similarity

PFAM

Protein FAMilies database of alignments

• Profile HMMs describe each family

• For each family in Pfam you can: Look at multiple alignments View protein domain architectures Examine species distribution Follow links to other databases View known protein structures

Page 24: Sequence Similarity

PFAM

Pfam-A – curated multiple alignments Grows slowly; quality controlled by experts

Pfam-B – automatic clustering (ProDom derived) New sequences instantly incorporated; unchecked

• Search by: Sequence, keyword, domain, taxonomy

• Browsing by family or genome

• Evolutionary tree

• Source of seed alignments: Pfam-B families Published articles ‘Domain hunting' studies

Page 25: Sequence Similarity
Page 26: Sequence Similarity
Page 27: Sequence Similarity
Page 28: Sequence Similarity
Page 29: Sequence Similarity
Page 30: Sequence Similarity
Page 31: Sequence Similarity

Profile HMMs

• Each M state has a position-specific pre-computed substitution table• Each I state has position-specific gap penalties (and in principle can

have its own emission distributions)• Each D state also has position-specific gap penalties

In principle, D-D transitions can also be customized per position

M1 M2 Mm

Protein Family F

BEGIN I0 I1 Im-1

D1 D2 Dm

ENDIm

Dm-1

Page 32: Sequence Similarity

Profile HMMs

transition between match states – αM(i)M(i+1)

transitions between match and insert states – αM(i)I(i), αI(i)M(i+1)

transition within insert state – αI(i)I(i)

transition between match and delete states – αM(i)D(i+1), αD(i)M(i+1)

transition within delete state – αD(i)D(i+1)

emission of amino acid b at a state S – εS(b)

M1 M2 Mm

Protein Family F

BEGIN I0 I1 Im-1

D1 D2 Dm

ENDIm

Dm-1

Page 33: Sequence Similarity

Profile HMMs

transition probabilities ~ frequency of a transition in alignment emission probabilities ~ frequency of an emission in alignment pseudocounts are usually introduced

M1 M2 Mm

Protein Family F

BEGIN I0 I1 Im-1

D1 D2 Dm

ENDIm

Dm-1

aAAklkl

k ll

'

'

e aE aE akk

ka

( )( )

( )'

' '

Page 34: Sequence Similarity

Alignment of a protein to a profile HMM

To align sequence x1…xn to a profile HMM:

We will find the most likely alignment with the Viterbi DP algorithm

• Define Vj

M(i): score of best alignment of x1…xi to the HMM ending in xi being emitted from Mj

VjI(i): score of best alignment of x1…xi to the HMM ending in xi being

emitted from Ij

VjD(i): score of best alignment of x1…xi to the HMM ending in Dj (xi is

the last character emitted before Dj)

• Denote by qa the frequency of amino acid a in a ‘random’ protein

Page 35: Sequence Similarity

Alignment of a protein to a profile HMM

Vj-1M(i – 1) + log αM(j-1)M(j)

• VjM(i) = log (εM(j)(xi) / qxi) + max Vj-1

I(i – 1) + log αI(j-1)M(j)

Vj-1D(i – 1) + log αD(j-1)M(j)

VjM(i – 1) + log αM(j)I(j)

• VjI(i) = log (εI(j)(xi) / qxi) + max Vj

I(i – 1) + log αI(j)I(j)

VjD(i – 1) + log αD(j)I(j)

Vj-1M(i) + log αM(j-1)D(j)

• VjD(i) = max Vj-1

I(i) + log αI(j-1)D(j)

Vj-1D(i) + log αD(j-1)D(j)

Page 36: Sequence Similarity

Weight of each sequence

• One simple weighting scheme is to find how much edge length each leaf contributes Example: edge 1 belongs to a Example: edge 3 belongs both to a, and to b: e3e1/(e1+e2) goes to a

Δwi = ecurrent wi / (leaves k below ecurrentwk)

ab

cd

efghi

13

2

Page 37: Sequence Similarity

How to build a profile HMM

Page 38: Sequence Similarity

Resources on the web

• HMMer – a free profile HMM software http://hmmer.wustl.edu/

• SAM – another free profile HMM software http://www.cse.ucsc.edu/research/compbio/sam.html

• PFAM – database of alignments and HMMs for protein families and domains http://www.sanger.ac.uk/Software/Pfam/

• SCOP – a structural classification of proteins http://scop.berkeley.edu/data/scop.b.html