56
Sequence Alignment Arun Goja MITCON BIOPHARMA

Sequence Alignment Arun Goja MITCON BIOPHARMA. Why do we want to compare sequences? Evolutionary relationships – Phylogenetic trees can be constructed

Embed Size (px)

Citation preview

Page 1: Sequence Alignment Arun Goja MITCON BIOPHARMA. Why do we want to compare sequences? Evolutionary relationships – Phylogenetic trees can be constructed

Sequence Alignment

Arun GojaMITCON BIOPHARMA

Page 2: Sequence Alignment Arun Goja MITCON BIOPHARMA. Why do we want to compare sequences? Evolutionary relationships – Phylogenetic trees can be constructed

Why do we want to compare sequences?

• Evolutionary relationships– Phylogenetic trees can be constructed based on comparison of

the sequences of a molecule (example: 16S rRNA) taken from different species

– Residues conserved during evolution play an important role

• Prediction of protein structure and function– Proteins which are very similar in sequence generally have

similar 3D structure and function as well– By searching a sequence of unknown structure against a

database of known proteins the structure and/or function can in many cases be predicted

Page 3: Sequence Alignment Arun Goja MITCON BIOPHARMA. Why do we want to compare sequences? Evolutionary relationships – Phylogenetic trees can be constructed

3

WHY ?

sequence alignment

Sequence alignment is important for:

* prediction of function* database searching* gene finding* sequence divergence* sequence assembly

Page 4: Sequence Alignment Arun Goja MITCON BIOPHARMA. Why do we want to compare sequences? Evolutionary relationships – Phylogenetic trees can be constructed

4

Over time, genes accumulate mutations

· Environmental factors• Radiation• Oxidation

· Mistakes in replication or repair· Deletions, Duplications· Insertions· Inversions· Point mutations

Page 5: Sequence Alignment Arun Goja MITCON BIOPHARMA. Why do we want to compare sequences? Evolutionary relationships – Phylogenetic trees can be constructed

5

Deletions

• Codon deletion:ACG ATA GCG TAT GTA TAG CCG…– Effect depends on the protein, position, etc.– Almost always deleterious– Sometimes lethal

• Frame shift mutation: ACG ATA GCG TAT GTA TAG CCG… ACG ATA GCG ATG TAT AGC CG?…– Almost always lethal

Page 6: Sequence Alignment Arun Goja MITCON BIOPHARMA. Why do we want to compare sequences? Evolutionary relationships – Phylogenetic trees can be constructed

6

Indels

• Comparing two genes it is generally impossible to tell if an indel is an insertion in one gene, or a deletion in another, unless ancestry is known:

ACGTCTGATACGCCGTATCGTCTATCTACGTCTGAT---CCGTATCGTCTATCT

Page 7: Sequence Alignment Arun Goja MITCON BIOPHARMA. Why do we want to compare sequences? Evolutionary relationships – Phylogenetic trees can be constructed

7

Comparing two sequences

• Point mutations, easy:ACGTCTGATACGCCGTATAGTCTATCTACGTCTGATTCGCCCTATCGTCTATCT

• Indels are difficult, must align sequences:ACGTCTGATACGCCGTATAGTCTATCTCTGATTCGCATCGTCTATCT

ACGTCTGATACGCCGTATAGTCTATCT----CTGATTCGC---ATCGTCTATCT

Page 8: Sequence Alignment Arun Goja MITCON BIOPHARMA. Why do we want to compare sequences? Evolutionary relationships – Phylogenetic trees can be constructed

8

Causes for sequence (dis)similarity

mutation: a nucleotide at a certain location is replaced by another nucleotide (e.g.: ATA → AGA)

insertion: at a certain location one new nucleotide is inserted inbetween two existing nucleotides (e.g.: AA → AGA)

deletion: at a certain location one existing nucleotide is deleted (e.g.: ACTG → AC-G)

indel: an insertion or a deletion

Page 9: Sequence Alignment Arun Goja MITCON BIOPHARMA. Why do we want to compare sequences? Evolutionary relationships – Phylogenetic trees can be constructed

Definition

• Homology: related by descent

• Homologous sequence positions

ATTGCGC ATTGCGC

ATCCGCC

ATTGCGC AT-CCGC

ATTGCGC

Page 10: Sequence Alignment Arun Goja MITCON BIOPHARMA. Why do we want to compare sequences? Evolutionary relationships – Phylogenetic trees can be constructed

Orthologous and paralogous

• Orthologous sequences differ because they are found in different species (a speciation event)

• Paralogous sequences differ due to a gene duplication event

• Sequences may be both orthologous and paralogous

Page 11: Sequence Alignment Arun Goja MITCON BIOPHARMA. Why do we want to compare sequences? Evolutionary relationships – Phylogenetic trees can be constructed

11

Sequence alignment - meaning

Sequence alignment is used to study the evolution of the sequences from a common ancestor such as protein sequences or DNA sequences.

Mismatches in the alignment correspond to mutations, and gaps correspond to insertions or deletions.

Sequence alignment also refers to the process of constructing significant alignments in a database of potentially unrelated sequences.

Page 12: Sequence Alignment Arun Goja MITCON BIOPHARMA. Why do we want to compare sequences? Evolutionary relationships – Phylogenetic trees can be constructed

12

Sequence alignment - definition

Sequence alignment is an arrangement of two or more sequences, highlighting their similarity.

The sequences are padded with gaps (dashes) so that wherever possible, columns contain identical characters from the sequences involved

tcctctgcctctgccatcat---caaccccaaagt|||| ||| ||||| ||||| ||||||||||||tcctgtgcatctgcaatcatgggcaaccccaaagt

Page 13: Sequence Alignment Arun Goja MITCON BIOPHARMA. Why do we want to compare sequences? Evolutionary relationships – Phylogenetic trees can be constructed

Pairwise alignment: the problemThe number of possible pairwise alignments increases explosively with the length of the sequences:

Two protein sequences of length 100 amino acids can be aligned in approximately 1060 different ways

Time needed to test all possibilities is same order of magnitude as the entire lifetime of the universe.

Page 14: Sequence Alignment Arun Goja MITCON BIOPHARMA. Why do we want to compare sequences? Evolutionary relationships – Phylogenetic trees can be constructed

Pairwise Alignment• The alignment of two sequences (DNA or

protein) is a relatively straightforward computational problem. – There are lots of possible alignments.

• Two sequences can always be aligned.• Sequence alignments have to be scored.• Often there is more than one solution with the

same score.

Page 15: Sequence Alignment Arun Goja MITCON BIOPHARMA. Why do we want to compare sequences? Evolutionary relationships – Phylogenetic trees can be constructed

Methods of Alignment

• By hand - slide sequences on two lines of a word processor

• Dot plot– with windows

• Rigorous mathematical approach– Dynamic programming (slow, optimal)

• Heuristic methods (fast, approximate)– BLAST and FASTA

• Word matching and hash tables0

Page 16: Sequence Alignment Arun Goja MITCON BIOPHARMA. Why do we want to compare sequences? Evolutionary relationships – Phylogenetic trees can be constructed

Align by Hand

GATCGCCTA_TTACGTCCTGGAC <-- --> AGGCATACGTA_GCCCTTTCGC

You still need some kind of scoring system to find the best alignment

Page 17: Sequence Alignment Arun Goja MITCON BIOPHARMA. Why do we want to compare sequences? Evolutionary relationships – Phylogenetic trees can be constructed

Percent Sequence Identity

• The extent to which two nucleotide or amino acid sequences are invariant

A C C T G A G – A G A C G T G – G C A G

70% identical

mismatchindel

Page 18: Sequence Alignment Arun Goja MITCON BIOPHARMA. Why do we want to compare sequences? Evolutionary relationships – Phylogenetic trees can be constructed

Dotplot:

A T T C A C A T A

T A C A T T A C G T A C

Sequence 1

Sequence 2

A dotplot gives an overview of all possible alignments

Page 19: Sequence Alignment Arun Goja MITCON BIOPHARMA. Why do we want to compare sequences? Evolutionary relationships – Phylogenetic trees can be constructed

Dotplot:

A T T C A C A T A

T A C A T T A C G T A C

T A C A T T A C G T A C

A T A C A C T T A

Sequence 1

Sequence 2

One possible alignment:

In a dotplot each diagonal corresponds to a possible (ungapped) alignment

Page 20: Sequence Alignment Arun Goja MITCON BIOPHARMA. Why do we want to compare sequences? Evolutionary relationships – Phylogenetic trees can be constructed

Insertions / Deletions in a Dotplot

TACTGTCAT T A C T G T T C A T

Sequence 1

Sequence 2

T A C T G - T C A T| | | | | | | | |T A C T G T T C A T

Page 21: Sequence Alignment Arun Goja MITCON BIOPHARMA. Why do we want to compare sequences? Evolutionary relationships – Phylogenetic trees can be constructed

Alignment methods

• Rigorous algorithms = Dynamic Programming– Needleman-Wunsch (global)– Smith-Waterman (local)

• Heuristic algorithms (faster but approximate)

• BLAST• FASTA

Page 22: Sequence Alignment Arun Goja MITCON BIOPHARMA. Why do we want to compare sequences? Evolutionary relationships – Phylogenetic trees can be constructed

22

Pairwise alignment

Pairwise sequence alignment methods are concerned with finding the best-matching piecewise local or global alignments of protein (amino acid) or DNA (nucleic acid) sequences.

Typically, the purpose of this is to find homologues (relatives) of a gene or gene-product in a database of known examples.

This information is useful for answering a variety of biological questions:

1. The identification of sequences of unknown structure or function.

2. The study of molecular evolution.

Page 23: Sequence Alignment Arun Goja MITCON BIOPHARMA. Why do we want to compare sequences? Evolutionary relationships – Phylogenetic trees can be constructed

23

Dynamic Programming Approach to Sequence Alignment

The dynamic programming approach to sequence alignment always tries to follow the best prior-result so far.

Try to align two sequences by inserting some gaps at different locations, so as to maximize the score of this alignment.

Score measurement is determined by "match award", "mismatch penalty" and "gap penalty". The higher the score, the better the alignment.

If both penalties are set to 0, it aims to always find an alignment with maximum matches so far. Maximum match = largest number matches can have for one sequence by allowing all possible deletion of another sequence.

It is used to compare the similarity between two sequences of DNA or Protein, to predict similarity of their functionalities.Examples: Needleman-Wunsch(1970), Sellers(1974), Smith-Waterman(1981)

Page 24: Sequence Alignment Arun Goja MITCON BIOPHARMA. Why do we want to compare sequences? Evolutionary relationships – Phylogenetic trees can be constructed

24

Global alignment

A global alignment between two sequences is an alignment in which all the characters in both sequences participate in the alignment.

Global alignments are useful mostly for finding closely-related sequences.

Page 25: Sequence Alignment Arun Goja MITCON BIOPHARMA. Why do we want to compare sequences? Evolutionary relationships – Phylogenetic trees can be constructed

Global Alignment

• Global algorithms are often not effective for highly diverged sequences and do not reflect the biological reality that two sequences may only share limited regions of conserved sequence.

• Sometimes two sequences may be derived from ancient recombination events where only a single functional domain is shared.

• Global alignment is useful when you want to force two sequences to align over their entire length

Page 26: Sequence Alignment Arun Goja MITCON BIOPHARMA. Why do we want to compare sequences? Evolutionary relationships – Phylogenetic trees can be constructed

26

Global Alignment

Find the global best fit between two sequences

Example: the sequences s = VIVALASVEGAS and t = VIVADAVIS align like:

A(s,t) = V I V A L A S V E G A S| | | | | | |V I V A D A - V - - I S

indels

Page 27: Sequence Alignment Arun Goja MITCON BIOPHARMA. Why do we want to compare sequences? Evolutionary relationships – Phylogenetic trees can be constructed

27

The Needleman-Wunsch algorithm

The Needleman-Wunsch algorithm performs a global alignment on two sequences (s and t) and is applied to align protein or nucleotide sequences.

The Needleman-Wunsch algorithm is an example of dynamic programming, and is guaranteed to find the alignment with the maximum score.

The Needleman-Wunsch algorithm is an example of dynamic programming, a discipline invented by Richard Bellman (an American mathematician) in 1953

Page 28: Sequence Alignment Arun Goja MITCON BIOPHARMA. Why do we want to compare sequences? Evolutionary relationships – Phylogenetic trees can be constructed

28

Local alignment

Local alignment methods find related regions within sequences - they can consist of a subset of the characters within each sequence.

For example, positions 20-40 of sequence A might be aligned with positions 50-70 of sequence B.

This is a more flexible technique than global alignment and has the advantage that related regions which appear in a different order in the two proteins (which is known as domain shuffling) can be identified as being related.

This is not possible with global alignment methods.

Page 29: Sequence Alignment Arun Goja MITCON BIOPHARMA. Why do we want to compare sequences? Evolutionary relationships – Phylogenetic trees can be constructed

29

The Smith Waterman algorithmThe Smith-Waterman algorithm (1981) is for determining similar regions between two nucleotide or protein sequences.

Smith-Waterman is also a dynamic programming algorithm and improves on Needleman-Wunsch. As such, it has the desirable property that it is guaranteed to find the optimal local alignment with respect to the scoring system being used (which includes the substitution matrix and the gap-scoring scheme).

However, the Smith-Waterman algorithm is demanding of time and memory resources: in order to align two sequences of lengths m and n, O(mn) time and space are required.

Page 30: Sequence Alignment Arun Goja MITCON BIOPHARMA. Why do we want to compare sequences? Evolutionary relationships – Phylogenetic trees can be constructed

Global vs. Local Alignments• Global alignment algorithms start at the

beginning of two sequences and add gaps to each until the end of one is reached.

• Local alignment algorithms finds the region (or regions) of highest similarity between two sequences and build the alignment outward from there.

Page 31: Sequence Alignment Arun Goja MITCON BIOPHARMA. Why do we want to compare sequences? Evolutionary relationships – Phylogenetic trees can be constructed

31

Statistical analysis of alignments

This works identical to gene finding:

* Generate randomized sequences based on the second string

* Determine the optimal alignments of the first sequence with these randomized sequences

* Compute a histogram and rank the observed score in this histogram

Page 32: Sequence Alignment Arun Goja MITCON BIOPHARMA. Why do we want to compare sequences? Evolutionary relationships – Phylogenetic trees can be constructed

The Needleman-Wunsch algorithm

A smart way to reduce the massive number of possibilitiesthat need to be considered, yet still guarantees that thebest solution will be found (Saul Needleman and ChristianWunsch, 1970).

The basic idea is to build up the best alignment by usingoptimal alignments of smaller subsequences.

Page 33: Sequence Alignment Arun Goja MITCON BIOPHARMA. Why do we want to compare sequences? Evolutionary relationships – Phylogenetic trees can be constructed

Needleman & Wunsch

• Place each sequence along one axis• Place score 0 at the up-left corner• Fill in 1st row & column with gap penalty multiples• Fill in the matrix with max value of 3 possible moves:

– Vertical move: Score + gap penalty– Horizontal move: Score + gap penalty– Diagonal move: Score + match/mismatch score

• The optimal alignment score is in the lower-right corner• To reconstruct the optimal alignment, trace back where the max

at each step came from, stop when hit the origin.

Page 34: Sequence Alignment Arun Goja MITCON BIOPHARMA. Why do we want to compare sequences? Evolutionary relationships – Phylogenetic trees can be constructed

Example• Let gap = -2

match = 1 mismatch = -1.

C A A Aempty

C

G

A

empty

1 -1 -3 -5

-1 0

-3

-4

-1 -1

-2

-8 -6 -4 -2

-2 -6

-4

-2

0

AAACA-GC

AAAC-AGC

Page 35: Sequence Alignment Arun Goja MITCON BIOPHARMA. Why do we want to compare sequences? Evolutionary relationships – Phylogenetic trees can be constructed

Local Alignment

• Problem first formulated:– Smith and Waterman (1981)

• Problem:– Find an optimal alignment between a

substring of s and a substring of t• Algorithm:

– is a variant of the basic algorithm for global alignment

Page 36: Sequence Alignment Arun Goja MITCON BIOPHARMA. Why do we want to compare sequences? Evolutionary relationships – Phylogenetic trees can be constructed

Smith & Waterman• Place each sequence along one axis• Place score 0 at the up-left corner• Fill in 1st row & column with 0s• Fill in the matrix with max value of 4 possible values:

– 0– Vertical move: Score + gap penalty– Horizontal move: Score + gap penalty– Diagonal move: Score + match/mismatch score

• The optimal alignment score is the max in the matrix• To reconstruct the optimal alignment, trace back where the MAX

at each step came from, stop when a zero is hit

Page 37: Sequence Alignment Arun Goja MITCON BIOPHARMA. Why do we want to compare sequences? Evolutionary relationships – Phylogenetic trees can be constructed

Local Alignment• Let gap = -2

match = 1 mismatch = -1.

GATCACCTGATACCC

C

C

C

A

T

A

G

empty

TCCACTAGempty

0 0 0 0 0 0 0 0 00000000

1000000

0 00 0102110

0000322

0000143

0010023

201000

031000

012211

GATCACCTGAT _ ACCC

Page 38: Sequence Alignment Arun Goja MITCON BIOPHARMA. Why do we want to compare sequences? Evolutionary relationships – Phylogenetic trees can be constructed

Pairwise alignment: the solution”Dynamic programming”

(the Needleman-Wunsch algorithm)

Page 39: Sequence Alignment Arun Goja MITCON BIOPHARMA. Why do we want to compare sequences? Evolutionary relationships – Phylogenetic trees can be constructed

Alignment depicted as path in matrix T C G C A

T

C

C

A

T C G C A

T

C

C

A

TCGCATC-CA

TCGCAT-CCA

Page 40: Sequence Alignment Arun Goja MITCON BIOPHARMA. Why do we want to compare sequences? Evolutionary relationships – Phylogenetic trees can be constructed

Alignment depicted as path in matrix T C G C A

T

C

C

A

x

Meaning of point in matrix: all residues up to this point have been aligned (but there are many different possible paths).

Position labeled “x”: TC aligned with TC

--TC -TC TCTC-- T-C TC

Page 41: Sequence Alignment Arun Goja MITCON BIOPHARMA. Why do we want to compare sequences? Evolutionary relationships – Phylogenetic trees can be constructed

• If F(i-1,j-1), F(i-1,j) and F(i,j-1) are known we can calculate F(i,j)

• Three possibilities:

• xi and yj are aligned, F(i,j) = F(i-1,j-1) + s(xi ,yj)

• xi is aligned to a gap, F(i,j) = F(i-1,j) - d

• yj is aligned to a gap, F(i,j) = F(i,j-1) - d

• The best score up to (i,j) will be the largest of the three options

Creation of an alignment path matrix

Page 42: Sequence Alignment Arun Goja MITCON BIOPHARMA. Why do we want to compare sequences? Evolutionary relationships – Phylogenetic trees can be constructed

Dynamic programming: computation of scores

T C G C A

T

C

C

A

x

Any given point in matrix can only be reached from three possible previous positions (you cannot “align backwards”).

=> Best scoring alignment ending in any given point in the matrix can be found by choosing the highest scoring of the three possibilities.

Page 43: Sequence Alignment Arun Goja MITCON BIOPHARMA. Why do we want to compare sequences? Evolutionary relationships – Phylogenetic trees can be constructed

Dynamic programming: computation of scores

x

Any given point in matrix can only be reached from three possible positions (you cannot “align backwards”).

=> Best scoring alignment ending in any given point in the matrix can be found by choosing the highest scoring of the three possibilities.

score(x,y) = max

score(x,y-1) - gap-penalty

T C G C A

T

C

C

A

Page 44: Sequence Alignment Arun Goja MITCON BIOPHARMA. Why do we want to compare sequences? Evolutionary relationships – Phylogenetic trees can be constructed

Dynamic programming: computation of scores

x

Any given point in matrix can only be reached from three possible positions (you cannot “align backwards”).

=> Best scoring alignment ending in any given point in the matrix can be found by choosing the highest scoring of the three possibilities.

score(x,y) = max

score(x,y-1) - gap-penalty

score(x-1,y-1) + substitution-score(x,y)

T C G C A

T

C

C

A

Page 45: Sequence Alignment Arun Goja MITCON BIOPHARMA. Why do we want to compare sequences? Evolutionary relationships – Phylogenetic trees can be constructed

Dynamic programming: computation of scores

x

Any given point in matrix can only be reached from three possible positions (you cannot “align backwards”).

=> Best scoring alignment ending in any given point in the matrix can be found by choosing the highest scoring of the three possibilities.

score(x,y) = max

score(x,y-1) - gap-penalty

score(x-1,y-1) + substitution-score(x,y)

score(x-1,y) - gap-penalty

T C G C A

T

C

C

A

Page 46: Sequence Alignment Arun Goja MITCON BIOPHARMA. Why do we want to compare sequences? Evolutionary relationships – Phylogenetic trees can be constructed

Dynamic programming: computation of scores

x

Any given point in matrix can only be reached from three possible positions (you cannot “align backwards”).

=> Best scoring alignment ending in any given point in the matrix can be found by choosing the highest scoring of the three possibilities.

Each new score is found by choosing the maximum of three possibilities. For each square in matrix: keep track of where best score came from.

Fill in scores one row at a time, starting in upper left corner of matrix, ending in lower right corner.

score(x,y) = max

score(x,y-1) - gap-penalty

score(x-1,y-1) + substitution-score(x,y)

score(x-1,y) - gap-penalty

T C G C A

T

C

C

A

Page 47: Sequence Alignment Arun Goja MITCON BIOPHARMA. Why do we want to compare sequences? Evolutionary relationships – Phylogenetic trees can be constructed

Dynamic programming: example

A C G TA 1 -1 -1 -1C -1 1 -1 -1G -1 -1 1 -1T -1 -1 -1 1

Gaps: -2

Page 48: Sequence Alignment Arun Goja MITCON BIOPHARMA. Why do we want to compare sequences? Evolutionary relationships – Phylogenetic trees can be constructed

Dynamic programming: example

Page 49: Sequence Alignment Arun Goja MITCON BIOPHARMA. Why do we want to compare sequences? Evolutionary relationships – Phylogenetic trees can be constructed

Dynamic programming: example

Page 50: Sequence Alignment Arun Goja MITCON BIOPHARMA. Why do we want to compare sequences? Evolutionary relationships – Phylogenetic trees can be constructed

Dynamic programming: example

Page 51: Sequence Alignment Arun Goja MITCON BIOPHARMA. Why do we want to compare sequences? Evolutionary relationships – Phylogenetic trees can be constructed

Dynamic programming: example

Page 52: Sequence Alignment Arun Goja MITCON BIOPHARMA. Why do we want to compare sequences? Evolutionary relationships – Phylogenetic trees can be constructed

Dynamic programming: example

T C G C A: : : :T C - C A1+1-2+1+1 = 2

Page 53: Sequence Alignment Arun Goja MITCON BIOPHARMA. Why do we want to compare sequences? Evolutionary relationships – Phylogenetic trees can be constructed

Global versus local alignmentsGlobal alignment: align full length of both sequences. (The “Needleman-Wunsch” algorithm).

Local alignment: find best partial alignment of two sequences (the “Smith-Waterman” algorithm).

Global alignment

Seq 1

Seq 2

Local alignment

Page 54: Sequence Alignment Arun Goja MITCON BIOPHARMA. Why do we want to compare sequences? Evolutionary relationships – Phylogenetic trees can be constructed

Local alignment overview• The recursive formula is changed by adding a fourth possibility: zero. This means local alignment scores are never negative.

• Trace-back is started at the highest value rather than in lower right corner

• Trace-back is stopped as soon as a zero is encountered

score(x,y) = max

score(x,y-1) - gap-penalty

score(x-1,y-1) + substitution-score(x,y)

score(x-1,y) - gap-penalty

0

Page 55: Sequence Alignment Arun Goja MITCON BIOPHARMA. Why do we want to compare sequences? Evolutionary relationships – Phylogenetic trees can be constructed

Local alignment: example

Page 56: Sequence Alignment Arun Goja MITCON BIOPHARMA. Why do we want to compare sequences? Evolutionary relationships – Phylogenetic trees can be constructed

Alignments: things to keep in mind

“Optimal alignment” means “having the highest possible score, given substitution matrix and set of gap penalties”.

This is NOT necessarily the biologically most meaningful alignment.

Specifically, the underlying assumptions are often wrong: substitutions are not equally frequent at all positions, affine gap penalties do not model insertion/deletion well, etc.

Pairwise alignment programs always produce an alignment - even when it does not make sense to align sequences.