Sequence Similarity

Sequence Similarity

PROBCONS: Probabilistic Consistency-based Multiple

Alignment of Proteins

INSERTINSERTXX

INSERTINSERTYY

MATCHMATCH

xxiiyyjj

――yyjj

xxii――

INSERTINSERTXX

INSERTINSERTYY

MATCHMATCH

A pair-HMM model of pairwise alignment

Parameterizes a probability distribution, P(A), over all possible alignments of all possible pairs of sequences

Transition probabilities ~ gap penalties

Emission probabilities ~ substitution matrix (from BLOSUM)

ABRACA-DABRAAB-ACARDI---

xxyy

xxiiyyjj

――yyjj

xxii――

Computing Pairwise Alignments

• The Viterbi algorithm conditional distribution P(α | x, y) reflects model’s uncertainty over the “correct”

alignment of x and y identifies highest probability alignment, αviterbi, in O(L2) time

Caveat: the most likely alignment is not the most accurate Alternative: find the alignment of maximum expected accuracy

P(α)P(α)

P(α | x, y)P(α | x, y)

ααviterbiviterbi

The Lazy-Teacher Analogy

• 10 students take a 10-question true-false quiz• How do you make the answer key?

Approach #1: Use the answer sheet of the best student! Approach #2: Weighted majority vote!

A- AAB A- A

B+ B+B+B- B- C

4. F4. F 4. T 4. F 4. F

4. F4. F 4. F 4. F 4. T

Viterbi vs. Maximum Expected Accuracy (MEA)

Viterbi

• picks single alignment with highest chance of being completely correct

• mathematically, finds the alignment α that maximizes

Eα*[1{α = α*}]

Maximum Expected Accuracy

• picks alignment with highest expected number of correct predictions

• mathematically, finds the alignment α that maximizes

Eα*[accuracy(α, α*)]

AA4. T A- AAB A- A

B+ B+B+B- B- C

4. F4. F 4. T 4. F 4. F

4. F4. F 4. F 4. F 4. T

Computing MEA alignments

• Defineaccuracy (α, α*) =

Eα*(accuracy(α, α*) | x, y) ~ Eα*(∑(xi, yj) in α1((xi, yj) in α*) | x,y)

= ∑α’P(α’ | x, y) ∑(xi, yj) in α 1((xi, yj) in α’)

= ∑(xi, yj) in α ∑α’P(α’ | x, y) 1((xi, yj) in α’)

= ∑(xi, yj) in α P(xi, yj in α’ | x, y)

• Define M[i, j] = posterior probability that xi is aligned to yj

# of correct predicted matches# of correct predicted matcheslength of shorter sequencelength of shorter sequence


• Define

accuracy (α, α*) =

• Then, MEA alignment is highest summing path through the matrix

M[i, j] = P(xi is aligned to yj | x, y)

• M[i, j] = posterior probability that xi is aligned to yj Can compute with forward, backward dynamic programming in

O(L2) time



• Defineaccuracy (α, α*) =

• Then, MEA alignment is highest summing path through the matrix

M[i, j] = P(xi is aligned to yj | x, y)

• M[I, j] = posterior probability that xi is aligned to yj Can compute with forward, backward dynamic programming in

O(L2) time


The consistency signal

zz

xx

yy

xxii

yyjj yyj’j’

zzkk

To estimate P(xi yj | x, y, z)

Method 1: triplet-HMM

P(xi ~ yj | x, y, z) = ∑k P(xi~yj~zk | x, y, z)

Parameters trained with unsupervised EM

Running time: O(N3L3)N: # sequencesL: sequence lengths

XYZ Y

XYX

XZ

YZZ

1

1

1 2

2

2

1

1

1

2

2

2

1

1

1

1

11

2

2

22

22

211 211

211

121

121 121

Probabilistic consistency

• Compute P(xi is aligned to yj | x, y) P(xi is aligned to yj | x, y, z)

• 2 approaches: 1) Exact – triplet HMM, O(L3) time 2) Approximate – use independence assumptions

∑k P(xi ~ zk and zk ~ yj | x, y, z) =

∑k P(xi ~ zk | x, z) P(zk ~ yj | x, y, z, xi ~ zk) (assume indep.)

∑k P(xi ~ zk | x, z) P(zk ~ yj | z, y)

Probabilistic consistency

• Compute P(xi is aligned to yj | x, y, z)

To compute P(xi ~ yj | x, y, z) ~ ∑k P(xi ~ zk | x, z) P(zk ~ yj | z, y)

Notice that for any given i, most entries k and j will be close to 0-- sparse matrices

Pxy|z PxzPzy

Finally, let

Pxy|S 1/|S| ∑z in S PxzPzy

Multiple sequence alignment

• A straightforward generalization sum-of-pairs tree-based progressive alignment iterative refinement

ABRACA-DABRAAB-ACARDI---ABRA---DABI-

AB-ACARDI---ABRA---DABI-

ABRACADABRAABRA--DABI-


Multiple sequence alignment

• A straightforward generalization sum-of-pairs tree-based progressive alignment iterative refinement


AB-ACARDI---ABRA---DABI-




ABACARDIABRACADABRA


ABRADABI


ABACARDI


ABRACA-DABRAAB-ACARD--I-ABRA---DABI-

Summary of PROBCONS Algorithm

Given K sequences to be aligned,

(1) Compute M[i, j] for all pairs of sequences, x and y

(2) Use probabilistic consistency to reestimate M[i, j]

(3) Build a tree of the sequences by connecting closest first • “Closest” defined according to expected accuracy • EA(x, y) = E(accuracy) of MEA alignment of x and y

(4) Perform progressive alignment along the tree• Score of a column: sum-of-pairs M[i, j]

(5) Apply iterative refinement

Training/testing methodology

• 3 reference benchmark sets

• PROBCONS parameters trained via unsupervised EM on unaligned sequences from BAliBASE.

• Quality score:

Q(α, α*) =

BAliBASEBAliBASE PREFABPREFAB SABmarkSABmark

# of correct predicted matches# of correct predicted matchestotal # of true matchestotal # of true matches

Evaluation of Algorithm Components

Algorithm Quality(74)

Time(sec)

Viterbi 0.375 0.72MEA 0.403 1.6PC (O(L3)) 0.431 584.2PC x 1 (O(L2)) 0.422 1.7PC x 2 (O(L2)) 0.427 1.9Progressive PC x 2 (O(L2)) 0.432 1.9Progressive PC x 2 (O(L2)) + IR 0.435 3.3

all-pairsall-pairspairwisepairwise

multiplemultiple

Performance of different alignment tools

Algorithm BAliBASE(237)

PREFAB(1932)

SABmark(698)

Q t Q t Q tAlign-m 0.804 19:25 - - 0.352 56:44DIALIGN 0.832 2:53 0.572 12:25:00 0.410 8:28CLUSTALW 0.861 1:07 0.589 2:57:00 0.439 2:16MAFFT 0.882 1:18 0.648 2:36:00 0.442 7:33T-Coffee 0.883 21:31 0.636 144:51:00 0.456 59:10MUSCLE 0.896 1:05 0.648 3:11:00 0.464 20:42PROBCONS 0.910 5:32 0.668 19:41:00 0.505 17:20

Resources for alignment

Protein Multiple Alignershttp://www.ebi.ac.uk/clustalw/ CLUSTALW – most widely used (1994)

http://phylogenomics.berkeley.edu/cgi-bin/muscle/input_muscle.py

MUSCLE – most scalable (2004)

http://probcons.stanford.edu/ PROBCONS – most accurate (2004)

Some more protein multiple aligners:

MULTALIGN, MSA, DIALIGN, DCA, MACAW, TCOFFEE, MAFFT, DSC, MUSEQUAL, TOPLIGN, SACHMO, MATCHBOX, PRRN, SAM, MAXHOM, STRAP, ALIGN, AMAS, PILEUP, etc…….

ProbCons: Chuong (Tom) Do

http://www.ebi.ac.uk/clustalw/



http://probcons.stanford.edu/

Profile hidden Markov models for sequence famillies

PFAM

Protein FAMilies database of alignments

• Profile HMMs describe each family

• For each family in Pfam you can: Look at multiple alignments View protein domain architectures Examine species distribution Follow links to other databases View known protein structures

PFAM

Pfam-A – curated multiple alignments Grows slowly; quality controlled by experts

Pfam-B – automatic clustering (ProDom derived) New sequences instantly incorporated; unchecked

• Search by: Sequence, keyword, domain, taxonomy

• Browsing by family or genome

• Evolutionary tree

• Source of seed alignments: Pfam-B families Published articles ‘Domain hunting' studies

Profile HMMs

• Each M state has a position-specific pre-computed substitution table• Each I state has position-specific gap penalties (and in principle can

have its own emission distributions)• Each D state also has position-specific gap penalties

In principle, D-D transitions can also be customized per position

M1 M2 Mm

Protein Family F

BEGIN I0 I1 Im-1

D1 D2 Dm

ENDIm

Dm-1

Profile HMMs

transition between match states – αM(i)M(i+1)

transitions between match and insert states – αM(i)I(i), αI(i)M(i+1)

transition within insert state – αI(i)I(i)

transition between match and delete states – αM(i)D(i+1), αD(i)M(i+1)

transition within delete state – αD(i)D(i+1)

emission of amino acid b at a state S – εS(b)

M1 M2 Mm

Protein Family F

BEGIN I0 I1 Im-1

D1 D2 Dm

ENDIm

Dm-1

Profile HMMs

transition probabilities ~ frequency of a transition in alignment emission probabilities ~ frequency of an emission in alignment pseudocounts are usually introduced

M1 M2 Mm

Protein Family F

BEGIN I0 I1 Im-1

D1 D2 Dm

ENDIm

Dm-1

aAAklkl

k ll

'

'

e aE aE akk

ka

( )( )

( )'

' '

Alignment of a protein to a profile HMM

To align sequence x1…xn to a profile HMM:

We will find the most likely alignment with the Viterbi DP algorithm

• Define Vj

M(i): score of best alignment of x1…xi to the HMM ending in xi being emitted from Mj

VjI(i): score of best alignment of x1…xi to the HMM ending in xi being

emitted from Ij

VjD(i): score of best alignment of x1…xi to the HMM ending in Dj (xi is

the last character emitted before Dj)

• Denote by qa the frequency of amino acid a in a ‘random’ protein

Alignment of a protein to a profile HMM

Vj-1M(i – 1) + log αM(j-1)M(j)

• VjM(i) = log (εM(j)(xi) / qxi) + max Vj-1

I(i – 1) + log αI(j-1)M(j)

Vj-1D(i – 1) + log αD(j-1)M(j)

VjM(i – 1) + log αM(j)I(j)

• VjI(i) = log (εI(j)(xi) / qxi) + max Vj

I(i – 1) + log αI(j)I(j)

VjD(i – 1) + log αD(j)I(j)

Vj-1M(i) + log αM(j-1)D(j)

• VjD(i) = max Vj-1

I(i) + log αI(j-1)D(j)

Vj-1D(i) + log αD(j-1)D(j)

Weight of each sequence

• One simple weighting scheme is to find how much edge length each leaf contributes Example: edge 1 belongs to a Example: edge 3 belongs both to a, and to b: e3e1/(e1+e2) goes to a

Δwi = ecurrent wi / (leaves k below ecurrentwk)

ab

cd

efghi

13

2

How to build a profile HMM

Resources on the web

• HMMer – a free profile HMM software http://hmmer.wustl.edu/

• SAM – another free profile HMM software http://www.cse.ucsc.edu/research/compbio/sam.html

• PFAM – database of alignments and HMMs for protein families and domains http://www.sanger.ac.uk/Software/Pfam/

• SCOP – a structural classification of proteins http://scop.berkeley.edu/data/scop.b.html

http://hmmer.wustl.edu/

http://www.cse.ucsc.edu/research/compbio/sam.html

http://www.sanger.ac.uk/Software/Pfam/

http://scop.berkeley.edu/data/scop.b.html

http://scop.berkeley.edu/data/scop.b.html

Documents

Sequence Similarity