19
Multiple Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 13, 2004 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign

Multiple Sequence Alignment (cont.)

  • Upload
    cwen

  • View
    38

  • Download
    0

Embed Size (px)

DESCRIPTION

Multiple Sequence Alignment (cont.). (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 13, 2004 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign. Today’s topic: Approximate Algorithms for Multiple Alignment. Feng-Doolittle alignment - PowerPoint PPT Presentation

Citation preview

Page 1: Multiple Sequence Alignment  (cont.)

Multiple Sequence Alignment (cont.)

(Lecture for CS397-CXZ Algorithms in Bioinformatics)

Feb. 13, 2004

ChengXiang Zhai

Department of Computer Science

University of Illinois, Urbana-Champaign

Page 2: Multiple Sequence Alignment  (cont.)

Today’s topic: Approximate Algorithms for Multiple Alignment

• Feng-Doolittle alignment

• Improvements:

– Profile alignment

– Iterative refinement

• ClustalW (A multiple alignment tool)

Page 3: Multiple Sequence Alignment  (cont.)

Feng-Doolittle Progressive Alignment

• Step 1: Compute all possible pairwise alignments

• Step 2: Convert alignment scores to distances

• Step 3: Construct a “guide tree” by clustering

• Step 4: Progressive alignment based on the guide tree (bottom up)

Note that variations are possible at each step!

Page 4: Multiple Sequence Alignment  (cont.)

Detour…. Some background about clustering

Page 5: Multiple Sequence Alignment  (cont.)

How to Compute Group Similarity?

Given two groups g1 and g2,

Single-link algorithm: s(g1,g2)= similarity of the closest pair

complete-link algorithm: s(g1,g2)= similarity of the farthest pair

average-link algorithm: s(g1,g2)= average of similarity of all pairs

Three Popular Methods:

Page 6: Multiple Sequence Alignment  (cont.)

Three Methods Illustrated

Single-link algorithm

?

g1 g2

complete-link algorithm

……

average-link algorithm

Page 7: Multiple Sequence Alignment  (cont.)

Comparison of the Three Methods

• Single-link– “Loose” clusters

– Individual decision, sensitive to outliers

• Complete-link– “Tight” clusters

– Individual decision, sensitive to outliers

• Average-link – “In between”

– Group decision, insensitive to outliers

• Which one is the best? Depends on what you need!

Page 8: Multiple Sequence Alignment  (cont.)

Feng-Doolittle: Clustering Example

X1

X2

X3

X4

X1 X2 X3 X4

20 15 11 3 4

21 30 5 3 1

22 5 25 12 11

3 4 12 40 9

4 1 11 9 30

X1 X2

X3X4

X1 X2 X3 X4

X5

X5

15 12

10

X5

4.5

X5

max

log log obs randeff

rand

S SD S

S S

Length normalization

max maxmax

( ) ( )

2

S x S yS

Similarity matrix (from pairwise alignment)

Page 9: Multiple Sequence Alignment  (cont.)

Feng-Doolittle: How to generate a multiple alignment?

• At each step consider all possible pairwise alignments and pick the best one (3 cases):

– Sequence vs. sequence

– Sequence vs. group

– group vs. group

• “Once a gap, always a gap”

– gap is replaced by a neutral symbol X

– X can be matched with any symbol, including a gap without penalty

Page 10: Multiple Sequence Alignment  (cont.)

Problems with Feng-Doolittle

• All alignments are completely determined by pairwise alignment (restricted search space)

• No backtracking (subalignment is “frozen”)

– No way to correct an early mistake

– Non-optimality: Mismatches and gaps at highly conserved region should be penalized more, but we can’t tell where is a highly conserved region early in the process

Profile alignment

Iterative refinement

Page 11: Multiple Sequence Alignment  (cont.)

Profile Alignment

• Aligning two alignments/profiles

• Treat each alignment as “frozen”

• Alignment them with a possible “column gap”

,

( ) ( , )

( , ) ( , ) ( , )

k li i i

i i k l N

k l k l k li i i i i i

i k l n i n k l N i k n n l N

S m s m m

s m m s m m s m m

Fixed for any two given alignments Only need to optimize this part

Page 12: Multiple Sequence Alignment  (cont.)

Iterative Refinement

• Re-assigning a sequence to a different cluster/profile

• Repeatedly do this for a fixed number of times or until the score converges

• Essentially to enlarge the search space

Page 13: Multiple Sequence Alignment  (cont.)

ClustalW: A Multiple Alignment Tool

• Essentially following Feng-Doolittle

– Do pairwise alignment (dynamic programming)

– Do score conversion/normalization (Kimura’s model)

– Construct a guide tree (neighbour-journing clustering)

– Progressively align all sequences using profile alignment

Page 14: Multiple Sequence Alignment  (cont.)

ClustalW Heuristics

• Avoid penalizing minority sequences

– Sequence weighting

– Consider “evolution time” (using different sub. Matrices)

• More reasonable gap penalty, e.g.,

– Depends on the actual residues at or around the positions (e.g., hydrophobic residues give higher gap penalty)

– Increase the gap penalty if it’s near a well-conserved region (e.g., perfectly aligned column)

• Postpone low-score alignment until more profile information is available.

Page 15: Multiple Sequence Alignment  (cont.)

Heuristic 1: Sequence Weighting

• Motivation: address sample bias

• Idea:

– Down weighting sequences that are very similar to other sequences

– Each sequence gets a weight

– Scoring based on weights

w1: peeksavtalw2: peeksavlal

w3:egewglvlhvw4:aaektkirsa

( ) ( , ) ( , ) ( , ) ( , )iS m M t v M t i M l v M l i

1 3 1 4 2 3 2 4'( ) ( , ) ( , ) ( , ) ( , )iS m w w M t v w w M t i w w M l v w w M l i

Sequence weighting

Page 16: Multiple Sequence Alignment  (cont.)

Heuristic 2: Sophisticated Gap Weighting

• Initially,– GOP: “gap open penalty”

– GEP: “gap extension penalty”

• Adjusted gap penalty– Dependence on the weight matrix

– Dependence on the similarity of sequences

– Dependence on lengths of the sequences

– Dependence on the difference in the lengths of the sequences

– Position-specific gap penalties

– Lowered gap penalties at existing gaps

– Increased gap penalties near existing gaps

– Reduced gap penalties in hydrophilic stretches

– Residue-specific penalties

Page 17: Multiple Sequence Alignment  (cont.)

Gap Adjustment Heuristics

• Weight matrix:– Gap penalties should be comparable with weights

• Similarity of sequences– GOP should be larger for closely related sequences

• Sequence length– Long sequences tend to have higher scores

• Difference in sequence lengths– Avoid too many gaps in the short sequence

GOP = {GOP+log[min(N,M)]}* (avg residue mismatch score) * (percent identity scaling factor)

N, M = sequence lengths

GEP = GEP *[1.0+|log(N/M)|] N>M

Page 18: Multiple Sequence Alignment  (cont.)

Gap Adjustment Heuristics (cont.)

• Position-specific gap penalties

– Lowered gap penalties at existing gaps

– Increased gap penalties near existing gaps

– Reduced gap penalties in hydrophilic stretches (5 AAs)

– Residue-specific penalties (specified in a table)

GOP = GOP * 0.3 *(no. of sequences without a gap/no. of sequences)

GOP = GOP * {2+[(8-distance from gap) *2]/8}

GOP = GOP * 1/3 If no gaps, and one sequence has a hydrophilic stretch

GOP = GOP * avgFactor If no gaps and no hydrophilic stretch.

Average over all the residues at the position

Page 19: Multiple Sequence Alignment  (cont.)

Heuristic 3: Delayed Alignment of ‘Divergent Sequences

• Divergence measure: Average percentage of identity with any other sequence

• Apply a threshold (e.g., 40% identity) to detect divergent sequences(“outliers”)

• Postpone the alignment of divergent sequences until all of the rest have been aligned