Pairwise Sequence Alignment Sushmita Roy BMI/CS 576 Sushmita Roy sroy@biostat.wisc.edu Sep 10 th,...

Pairwise Sequence Alignment

Sushmita RoyBMI/CS 576

www.biostat.wisc.edu/bmi576/Sushmita Roy

sroy@biostat.wisc.eduSep 10th, 2013

BMI/CS 576

Key concepts in today’s class

• What is pairwise sequence alignment?• How to handle gaps?

– Gap penalties

• What are the issues in sequence alignment?• What are the different types of alignment problems

– Global alignment– Local alignment

• Dynamic programming solution to finding the best alignment• Given a scoring function, find the best alignment of two

sequences– Global– Local

What is sequence alignment

The task of locating equivalent regions of two or more sequences to maximize their overall similarity

Why sequence alignment?

• Identifying similarity between two sequences

• Sequence specifies the function of a protein

• Similarity in sequence can imply similarity in function.– Assign function to uncharacterized sequences based on characterized

sequences

• Sequence from different species can be compared to estimate the evolutionary relationships between species – We will come back to this in Phylogenetic trees.

Alignment example

T H I S S E Q U E N C E

T H A T S E Q U E N C E

How to compare these two sequences?

T H I S S E Q U E N C E

T H A T I S A S E Q U E N C E

Need to incorporate gaps

_ _ _T H I S S E Q U E N C E

T H I S _ _ _ S E Q U E N C E

Alignment 1: 3 gaps, 8 matches Alignment 2: 3 gaps, 9 matches

Issues in sequence alignment

• What type of alignment?– Align the entire sequence or part of it?

• How to score an alignment?– the sequences we’re comparing typically differ in length– some amino acid pairs are more substitutable than others

• How to find the alignment?– Algorithms for alignment

• How to tell if the alignment is biologically meaningful?– Computing significance of an alignment

Pairwise alignment: task definition

Given– a pair of sequences (DNA or protein)– a scoring function

Do– determine the common substrings in the sequences

such that the similarity score is maximized

Sequence can change through different mutations

• substitutions: ACGA AGGA

• insertions: ACGA ACGGA

• deletions: ACGA AGAGaps

Scoring function

• Scoring function comprises– Substitution matrix s(a,b): indicates the score of aligning a with b– Gap penalty function γ(g) where g is the gap length

• DNA has a simple scoring function– Consider match and mismatch

• Proteins have a more complex scoring function– Some amino acid pairs might be more substitutable versus others– Scores might capture

• similar physical and chemical properties• evolutionary relationships• More next class

Gap penalty function

• Different gap penalty functions require somewhat different scoring algorithms

• Linear score

– γ(g)=-dg, where d is a fixed cost, g is the gap length

– Longer the gap higher the penalty

• Affine score

– γ(g)=-(d +(g-1)e), d and e are fixed penalties and e<d.

– d : gap open penalty

– e : gap extension penalty

Types of pairwise sequence alignment problems

• Global alignment (Needleman-Wunsch algorithm)– Align the entire sequence

• Local alignment (Smith-Waterman algorithm)– Align part of the sequence

Global alignment

• Given two sequences u and v and a scoring function, find the alignment with the maximal score.

• The number of possible alignments is very big.

Number of possible alignments

CAATGA ATTGATSequence 1 Sequence 2

CAATGA__ATTGAT

CAATGAATTGAT

CAATGA_AT_TGAT

Alignment 1 Alignment 2 Alignment 3

Number of possible alignments

• There are

2)!()!2(2

≈=⎟⎟⎠

⎞⎜⎜⎝

possible global alignments for 2 sequences of length n

• E.g. two sequences of length 100 have possible alignments

• But we can use dynamic programming to find an optimal alignment efficiently

7710≈

Dynamic programming (DP)

• Divide and conquer

• Solve a problem using pre-computed solutions for subparts of the problem

Dynamic programming for sequence alignment

• Two sequences: AAAC, AAG• x: AAAC

– xi: Denotes the ith letter

• y: AAG– yi denotes the jth letter in y.

• Assume cost of gap is d.• Assume we have aligned x1..xi to y1..yj

• There are three possibilities

AAAAAG

AA_AAG

AAAAA_

Dynamic programming idea

• Let x’s length be n• Let y’s length be m• construct an (n+1) (m+1) matrix F• F ( j, i ) = score of the best alignment of x1…xi with y1…yj

score of best alignment ofAAA to AG

DP for global alignment with linear gap penalty

F( j,i) = max

F( j −1,i −1) +s(x i,y j )

F( j −1,i) − d

F( j,i −1) − d

⎨ ⎪

⎩ ⎪

F(j,i)

F(j-1,i-1) F(j-1,i)

F(j,i-1)-d

-ds(xi,yj)

DP recurrence relation

Score of the best partial alignment between x1..xi and y1.. yj

DP algorithm sketch: global alignment

• Initialize first row and column of matrix

• Fill in rest of matrix from top to bottom, left to right

• For each F(i, j), save pointer(s) to cell(s) that resulted in best score

• F(m, n) holds the optimal alignment score;

• Trace back from F(m, n) to F(0, 0) to recover alignment

Global alignment example

• Suppose we choose the following scoring scheme

d = 2, where d is the gap penalty

Initializing matrix: global alignment with linear gap penalty

0 -3d-d -2d

Global alignment example

0 -6-2 -4

1 -3-1

-5 -4 -1

Trace back to retrieve the alignment

0 -6-2 -4

1 -3-1

-5 -4 -1

one optimal alignment

Computational complexity

• initialization: O(m), O(n) where sequence lengths are m, n• filling in rest of matrix: O(mn)• traceback: O(m + n)• Since sequences have nearly same length, the computational

complexity is

)( 2nO

Summary of DP

• Maximize F(j,i) using F(j-1,i-1), F(j-1,i) or F(j,i-1)• works for either DNA or protein sequences, although the

substitution matrices used differ• finds an optimal alignment• the exact algorithm (and computational complexity) depends

on gap penalty function

Local alignment

• Look for local regions sequence similarity.– Aligning substrings of x and y.

• Try to find the best alignment over all possible substrings.

• This seems difficult computationally.– But can be solved by DP.

Local alignment motivation

• Comparing protein sequences that share a domain (independently folded unit) but differ elsewhere

• Comparing DNA sequences that share a similar sequence pattern but differ elsewhere

• More sensitive when comparing highly diverged sequences– Selection on the genome differs between regions– Unconstrained regions are not very alignable

Local alignment DP

Similar to global alignment with a few exceptions• DP recurrence is different

• Top and left row are initialized to 0s.

F( j,i) = max

F( j −1,i −1) +s(x i, y j )

F( j −1,i) − d

F( j,i −1) − d

⎨ ⎪ ⎪

⎩ ⎪ ⎪

New in local: starts a new alignment

Local alignment DP algorithm

• Initialization: first row and first column initialized with 0’s

• Traceback:– Find maximum value of F(j, i)

• can be anywhere in matrix

– Stop when we get to a cell with value 0

Local alignment example

Pairwise Sequence Alignment Sushmita Roy BMI/CS 576 Sushmita Roy sroy@biostat.wisc.edu Sep 10 th,...

Documents

Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 sroy@biostat.wisc.edu Sep 16 th, 2014

Cp sushmita

Learning HMM parameters Sushmita Roy BMI/CS 576 sroy@biostat.wisc.edu Oct 21 st, 2014

Introduction to Molecular Biology, Genetics and Genomics Sushmita Roy sroy@biostat.wisc.edu September 6, 2012 BMI/CS 576

Primer on Probability for Discrete Variables BMI/CS 576 Colin Dewey cdewey@biostat.wisc.edu Fall 2010

Introduction to Phylogenetic trees Colin Dewey BMI/CS 576 colin.dewey@wisc.edu Fall 2015

Markov models and applications Sushmita Roy BMI/CS 576 sroy@biostat.wisc.edu Oct 7 th, 2014

Flat clustering approaches Colin Dewey BMI/CS 576 colin.dewey@wisc.edu Fall 2015

Introduction to Biological sequences Sushmita Roy sroy@biostat.wisc.edu September 4, 2014 BMI/CS 576

Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey cdewey@biostat.wisc.edu Fall 2010

Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015

Network analysis Sushmita Roy BMI/CS 576 sroy@biostat.wisc.edu Dec 3 rd, 2013

Hidden Markov models Sushmita Roy BMI/CS 576 sroy@biostat.wisc.edu Oct 16 th, 2014

Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2015

Introduction to Phylogenetic Trees BMI/CS 576 Colin Dewey cdewey@biostat.wisc.edu Fall 2010

Clustering Gene Expression Data BMI/CS 576 Colin Dewey cdewey@biostat.wisc.edu Fall 2010

Module Networks BMI/CS 576 Mark Craven December 2007

Multiple Sequence Alignment Sushmita Roy BMI/CS 576 sroy@biostat.wisc.edu Sep 18 th, 2014

Probabilistic methods for phylogenetic trees (Part 2) Sushmita Roy BMI/CS 576 sroy@biostat.wisc.edu Oct 7 th, 2014

Network analysis and applications Sushmita Roy BMI/CS 576 sroy@biostat.wisc.edu Dec 2 nd, 2014