Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment

  • View
    226

  • Download
    0

Embed Size (px)

Text of Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions...

Bioinformatics

Pairwise alignmentsIntroductionWhy do alignments?DefinitionsScoring alignmentsAlignment methodsSignificance of alignmentsDefinitionsAn alignment is a mutual arrangement of sequences, which exhibits where the sequences are similar, and where they differ.

An optimal alignment is one that exhibits the most correspondences and the least differences. It is the alignment with the highest score. May or may not be biologically meaningful.Why do alignments?Sequence Alignment is useful for discovering structural, functional and evolutional information in biological sequences.How to measure the similarityThree kinds of changes can occur at any given position within a sequence:MutationInsertionDeletionInsertion and deletion have been found to occur in nature at a significantly lower frequency than mutations.indelv : A T G T T A Tw : A T C G T A Cm = 7 n = 7 AT--GTAT--ATCG--A--Cletters of vletters of wTT5 matches2 insertions2 deletionsGiven 2 DNA sequences v and w:Alignment: 2 row representation ???4 matches3 mismatchsAligning DNA SequencesV = ATCTGATGW = TGCATACn = 8m = 7ATCTGATGTGCATACV W matchinsertiondeletionmismatchindels4123matchesmismatchesinsertions deletions

Scoring Matrices for Aligning DNA SequencesTransition --- substitutions in which a purine (A/G) is replaced by another purine (A/G) or a pyrimadine (C/T) is replaced by another pyrimadine (C/T).Transversions --- (A/G) (C/T)5-4-4-4G-45-4-4C-4-45-4T-4-4-45AGCTA1000G0100C0010T0001AGCTA1-5-5-1G-51-1-5C-5-11-5T-1-5-51AGCTA Identity matrix BLAST matrix Transition-Transversion matrixScoring a Sequence AlignmentMatch score:+1Mismatch score:+0Gap penalty:1

ACGTCTGATACGCCGTATAGTCTATCT ||||| ||| || ||||||||----CTGATTCGC---ATCGTCTATCT

Matches: 18 (+1)Mismatches: 2 0Gaps: 7 ( 1)Score = +11?Aligning protein sequencesFFDGGLQMQMLKDKFPMEGGQKDPKQRIAmino Acid Substitution MatricesPAM - point accepted mutation based on global alignment [evolutionary model]

BLOSUM - block substitutions based on local alignments [similarity among conserved sequences]Part of PAM 250 MatrixCSTPAGC12S02T-213P-3106A-21112G-310-115Log-odds = log ( )chance to see pair in homologous proteins chance to see pair in unrelated proteins by chancePAM 250 MatrixCSTPAGNDEQHRKMILVFYWC12S02T-213P-3106A-21112G-310-115N-410-1002D-500-10124E-500-100134Q-5-1-100-11224H-3-1-10-1-221136R-40-10-2-30-1-1126K-500-1-1-21001035M-5-2-1-2-1-3-2-3-2-1-2006I-2-10-2-1-3-2-2-2-2-2-2-225L-6-3-2-3-2-4-3-4-3-2-2-3-3426V-2-10-10-1-2-2-2-2-2-2-22424F-4-3-3-5-4-5-4-6-5-5-2-4-5012-19Y0-3-3-5-3-5-2-4-4-40-4-4-2-1-1-2710W-8-2-5-6-6-7-4-7-7-5-32-3-4-5-2-60017Scoring Matrix: ExampleARNKA5-2-1-1R-7-13N--70K---6 Notice that although R and K are different amino acids, they have a positive score. Why? They are both positively charged amino acids will not greatly change function of protein.AKRANRKAAANK-1+(-1)+(-2) +5+7+3=11Sequence Alignment ProblemT C A T GC A T T GElements of Dynamic ProgrammingDynamic Programming method is used to solve optimization problems to which optimal solutions depend on the optimal solutions to the subproblems. It involves

Characterize the structure of the optimal solutionsRecursively define the score of an optimal solution in terms of the scores of optimal solutions of sub-problemsCompute the solution in a bottom-up fashionTrace back the optimal solutionDynamic ProgrammingConsider two sequences:AAAT AGCTo find the optimal solution, if T is aligned with C, we have to find the best alignment between AAA and AG. Best solution depends on the best solutions of the subproblems. Dynamic ProgrammingConsider two sequences:AAAT AGCTo find the optimal solution, we have to find the best alignment between AAA and AG, AAA and AGC or AAAT and AG. Best solution depends on the best solutions of the subproblems. Dynamic ProgrammingOptimal solutions for the subproblems have to be solved recursively.

Let n be the size of sequence s = AAAT, m be the size of sequence t = AGC.Consider subproblems: matching the prefixes of s and t.t has ? possible prefixes, including empty strings has ? possible prefixes, including empty stringn+1m+1Dynamic ProgrammingWe would like to match s[1i] and t[1j]:Align s[1i] with t[1j-1] and match a space with t[j]Align s[1i-1] with t[1j-1] and match s[i] with t[j]Align s[1i-1] with t[1j] and match a space with s[i]

Similarity between s and t:

Score(s[1i],t[1j])=maxScore(s[1i],t[1j-1])+gap penaltyScore(s[1i-1],t[1j-1])+score(s[i],s[j])Score(s[1i-1],t[1j])+gap penaltyDefinitionsGlobal alignment - Needleman-Wunsch (1970) maximizes the number of matches between the sequences along the entire length of the sequences.

Local alignment - Smith-Waterman (1981) is a modification of the dynamic programming algorithm gives the highest scoring local match between two sequences.Example Let gap = -2match = 1 mismatch = -1. C A A AemptyCGAempty 1 -1 -3 -5 -1 0 -3 -4 -1 -1 -2 -8 -6 -4 -2 -2 -6 -4 -2 0AAACA-GC Complexity : O(mn)?Gap Penalty Scoring Indels: Naive ApproachA fixed penalty is given to every indel: - for 1 indel, -2 for 2 consecutive indels-3 for 3 consecutive indels, etc.

Can be too severe penalty for a series of 100 consecutive indelsAffine Gap PenaltiesIn nature, a series of k indels often come as a single event rather than a series of k single nucleotide events:ATA___GCATATATGCAT__A_GCATATATGCNormal scoring would give the same score for both alignmentsThis is more likely.This is less likely.Gap PenaltyGap opening penalty defines the cost for opening a gap in one of the sequences. The penalty must be tuned based on the default matrix.Gap extension penalty is an extra penalty proportional to the length of the gap. The gap extension penalty is always lower than gap opening penalty.

Optimal penalties vary from sequence to sequence, and finding the most adequate value is a matter of empirical trial and error.Accounting for GapsGaps- contiguous sequence of spaces in one of the rows

Score for a gap of length x is: -( + x) where x length of the gap, >0 is the penalty for introducing a gap: gap opening penalty will be large relative to : gap extension penalty because you do not want to add too much of a penalty for extending the gap.Affine Gap PenaltiesGap penalties: -- when there is 1 indel --2 when there are 2 indels --3 when there are 3 indels, etc. -- x (-gap opening - x gap extensions)Somehow reduced penalties (as compared to nave scoring) are given to runs of horizontal and vertical edgesAffine Gap Penalties and Edit GraphTo reflect affine gap penalties we have to add long horizontal and vertical edges to the edit graph. Each such edge of length x should have weight - - x *

There are many such edges!

Adding them to the graph increases the running time of the alignment algorithm by a factor of n (where n is the length of the sequence)

So the complexity increases from O(n2) to O(n3)Adding Affine Penalty Edges to the Edit GraphOptimal alignment Optimal alignment - one that exhibits the most correspondences. It is the alignment with the highest score. May or may not be biologically meaningful.Global alignment - Needleman-Wunsch (1970) maximizes the number of matches between the sequences along the entire length of the sequences.Local alignment - Smith-Waterman (1981) gives the highest scoring local match between two sequences.

Local AlignmentProblem first formulated:Smith and Waterman (1981)Problem:Find an optimal alignment between a substring of s and a substring of tAlgorithm: is a variant of the basic algorithm for global alignmentMotivationSearching for unknown domains or motifs within proteins from different familiesProteins encoded from Homeobox genes (only conserved in 1 region called Homeo domain 60 amino acids long)Identifying active sites of enzymesComparing long stretches of anonymous DNAQuerying databases where query word much smaller than sequences in databaseAnalyzing repeated elements within a single sequenceLocal AlignmentLet n be the size of sequence s = GATCACCT m be the size of sequence t = GATACCC.Consider subproblems: matching the suffixes of s and t.t has ? possible suffixes, including empty strings has ? possible suffixes, including empty string

n+1m+1DP for Local AlignmentMatch the suffixes of s[1i] and t[1j]:Align suffixes of s[1i] with t[1j-1] & match a space with t[j]Align suffixes of s[1i-1] with t[1j-1] & match s[i] with t[j]Align suffixes of s[1i-1] with t[1j] & match a space with s[i]Score(s[1i],t[1j])=maxScore(s[1i],t[1j-1])+gap penaltyScore(s[1i-1],t[1j-1])+score(s[i],s[j])Score(s[1i-1],t[1j])+gap penalty0Sij highest score for alignment between 2 prefixes ending at i and j 33Score of a local alignment that ends at location i and j.Local Alignment Let gap = -2match = 1 mismatch = -1.GATCACCTGATACCCCCCATAGemptyTCCACTAGempty000000000000000010000000000102110000032200001430010023201000031000012211GATCACCTGAT_ACCCLocal Alignment Let gap = -2match = 1 mismatch = -1.GATCACCTGATACCCCCCATAGemptyTCCACTAGempty000000000000000010000000000102110000032200001430010023201000031000012211GATCACCTGAT_ACCCLocal Alignment Let gap = -2match = 1 mismatch = -1.ACACACTA AGCACAC - ACACACTA- 000000000A010101001G000000000C001010100A010202001C002030310A010314202C001042531A010125344Smith & WatermanPlace each sequence along one axisPlace score 0 at the up-left cornerFill in 1st row & column with 0sFill in the matrix with max value of 4 possible values:0Vertical move: Score + gap penaltyHorizontal move: Score + gap penaltyDiagonal move: Score + match/mismatch scoreThe optimal align