2015 bioinformatics alignments_wim_vancriekinge

FBW

20-10-2015

Wim Van Criekinge

Rat versus

mouse RBP

Rat versus

bacterial

lipocalin

– Henikoff and Henikoff have compared the

BLOSUM matrices to PAM by evaluating how

effectively the matrices can detect known members

of a protein family from a database when searching

with the ungapped local alignment program

BLAST. They conclude that overall the BLOSUM

62 matrix is the most effective.

• However, all the substitution matrices investigated

perform better than BLOSUM 62 for a proportion of

the families. This suggests that no single matrix is

the complete answer for all sequence comparisons.

• It is probably best to compliment the BLOSUM 62

matrix with comparisons using 250 PAMS, and

Overington structurally derived matrices.

– It seems likely that as more protein three

dimensional structures are determined, substitution

tables derived from structure comparison will give

the most reliable data.

Overview

Available Dot Plot Programs

Dotlet (Java Applet)

http://www.isrec.isb-

sib.ch/java/dotlet/Dotlet.

html

http://www.isrec.isb-sib.ch/java/dotlet/Dotlet.html

Sequence Alignments

Introduction

Algorithms

What ?

Examples

Properties

Dynamic Programming for Pairwise Alignment

Concept

Example

Needleman-Wunsch(.pl)

Smith-Waterman(.pl)

Multiple Alignment

MSA

Hierarchical Pairwise Alignent

ClustalW, PileUp

Formatting

Interpretation

Alternative Methods

SIM

Blast2

Dali

Global and local alignment

Pairwise sequence alignment can be global or local

Global: the sequences are completely aligned

(Needleman and Wunsch, 1970)

Local: only the best sub-regions are aligned

(Smith and Waterman, 1981). BLAST

uses local alignment.

– In order to characterize protein families, identify shared regions of homology in a multiple sequence alignment; (this happens generally when a sequence search revealed homologies to several sequences)

– Determination of the consensus sequence of several aligned sequences

– Help prediction of the secondary and tertiary structures of new sequences;

– Preliminary step in molecular evolution analysis using Phylogenetic methods for constructing phylogenetic trees – Garbage in, Garbage out

– Chicken/egg

Why we do multiple alignments?

Why we do multiple alignments?

• To find conserved regions– Local multiple alignment reveals conserved

regions

– Conserved regions usually are key functional regions

– These regions are prime targets for drug developments

• To do phylogenetic analysis:– Same protein from different species

– Optimal multiple alignment probably implies history

– Discover irregularities, such as Cystic Fibrosis gene

VTISCTGSSSNIGAG-NHVKWYQQLPG

VTISCTGTSSNIGS--ITVNWYQQLPG

LRLSCSSSGFIFSS--YAMYWVRQAPG

LSLTCTVSGTSFDD--YYSTWVRQPPG

PEVTCVVVDVSHEDPQVKFNWYVDG--

ATLVCLISDFYPGA--VTVAWKADS--

AALGCLVKDYFPEP--VTVSWNSG---

VSLTCLVKGFYPSD--IAVEWWSNG--

Sequence Alignments

Introduction

Algorithms

What ?

Examples

Properties


Concept

Example


Smith-Waterman(.pl)

Multiple Alignment

MSA


ClustalW, PileUp

Formatting

Interpretation

Alternative Methods

SIM

Blast2

Dali

Algorithms and Programs

• Algorithm: a method or a process followed to solve a problem.– A recipe.

• An algorithm takes the input to a problem (function) and transforms it to the output.– A mapping of input to output.

• A problem can have many algorithms.

Arayabhata-Euclid’s algorithm: How to find gcd(a,b),

the greatest common divisor of a and b

Based on a single observation: if a = b q + r, then

any divisor of a and b is also a divisor of r, and any divisor

of b and r is also a divisor of a, so gcd(a,b) = gcd(b,r)

Euclid algorithm: use the division algorithm repeatedly

To reduce the problem to one you can solve.

Example: gcd(55,35)

55 = 35*1 + 20 so gcd(55,35) = gcd(35,20)

35 = 20*1 + 15 so gcd(35,20) = gcd(20,15)

20 = 15*1 + 5 done gcd(55,35) = 5

Pseudocode

GGD.py

def gcd(a, b):

while a != 0:

a, b = b%a, a # parallel assignment

return b

print (gcd(55, 35))

Bubble Sort Algorithm

1. Initialize the size of the list to be sorted to be the actual size of the list.

2. Loop through the list until no element needs to be exchanged with another

to reach its correct position.

2.1 Loop (i) from 0 to size of the list to be sorted - 2.

2.1.1 Compare the ith and (i + 1)st elements in the unsorted list.

2.1.2 Swap the ith and (i + 1)st elements if not in order ( ascending or

descending as desired).

2.2 Decrease the size of the list to be sorted by 1.

One of the simplest sorting algorithms proceeds by walking down the list, comparing

adjacent elements, and swapping them if they are in the wrong order. The process is

continued until the list is sorted.

More formally:

Each pass "bubbles" the largest element in the unsorted part of the list to its correct location.

A 13 7 43 5 3 19 2 23 29 ?? ?? ?? ?? ??

Bubble Sort Implementation

void BubbleSort(int List[] , int Size) {

int tempInt; // temp variable for swapping list elems

for (int Stop = Size - 1; Stop > 0; Stop--) {

for (int Check = 0; Check < Stop; Check++) { // make a pass

if (List[Check] > List[Check + 1]) { // compare elems

tempInt = List[Check]; // swap if in theList[Check] = List[Check + 1]; // wrong orderList[Check + 1] = tempInt;

}

}}

}

Bubblesort compares and swaps adjacent elements; simple but not very efficient.

Efficiency note: the outer loop could be modified to exit if the list is already sorted.

Here is an ascending-order implementation of the bubblesort algorithm for integer arrays:

"Great algorithms are the poetry of computation"

"Great algorithms are the poetry of computation"

1946: The Metropolis Algorithm for Monte Carlo. Through the use of random processes, this algorithm offers an efficient way to stumble toward answers to problems that are too complicated to solve exactly.

1947: Simplex Method for Linear Programming. An elegant solution to a common problem in planning and decision-making.

1950: Krylov Subspace Iteration Method. A technique for rapidly solving the linear equations that abound in scientific computation.

1951: The Decompositional Approach to Matrix Computations. A suite of techniques for numerical linear algebra.

1957: The Fortran Optimizing Compiler. Turns high-level code into efficient computer-readable code.

1959: QR Algorithm for Computing Eigenvalues. Another crucial matrix operation made swift and practical.

1962: Quicksort Algorithms for Sorting. For the efficient handling of large databases.

1965: Fast Fourier Transform. Perhaps the most ubiquitous algorithm in use today, it breaks down waveforms (like sound) into periodic components.

1977: Integer Relation Detection. A fast method for spotting simple equations satisfied by collections of seemingly unrelated numbers.

1987: Fast Multipole Method. A breakthrough in dealing with the complexity of n-body calculations, applied in problems ranging from celestial mechanics to protein folding.

From Random Samples, Science page 799, February 4, 2000.

Algorithm Properties

• An algorithm possesses the following properties:– It must be correct.

– It must be composed of a series of concrete steps.

– There can be no ambiguity as to which step will be performed next.

– It must be composed of a finite number of steps.

– It must terminate.

• A computer program is an instance, or concrete representation, for an algorithm in some programming language.

Measuring Algorithm Efficiency

• Types of complexity

– Space complexity

– Time complexity

• Analysis of algorithms

– The measuring of the complexity of an algorithm

• Cannot compute actual time for an algorithm

– We usually measure worst-case time


Three algorithms for computing

1 + 2 + … n for an integer n > 0


The number of operations required by the algorithms


The number of operations required by the algorithms as a

function of n

Big Oh Notation

• To say "Algorithm A has a worst-case time

requirement proportional to n"

– We say A is O(n)

– Read "Big Oh of n"

• For the other two algorithms

– Algorithm B is O(n2)

– Algorithm C is O(1)

• O is derived from order (magnitude)

Picturing Efficiency

O(n) algorithm


An O(n2) algorithm.


Another O(n2) algorithm.

Sequence Alignments

Introduction

Algorithms

What ?

Examples

Properties


Concept

Example


Smith-Waterman(.pl)

Multiple Alignment

MSA


ClustalW, PileUp

Formatting

Interpretation

Alternative Methods

SIM

Blast2

Dali

The best alignment:

The one with the maximum total

score

• Exhaustive …

– All combinations:

• Algorithm

– Dynamic programming (much faster)

• Heuristics

– Needleman – Wunsh for global

alignments

(Journal of Molecular Biology, 1970)

– Later adapated by Smith-Waterman

for local alignment

Overview

• Score of an alignment: reward matches and penalize mismatches and spaces.

– eg, each column gets a (different) value for: • a match: +1, (both have the same

characters); • a mismatch : -1, (both have different

characters); and • a space in a column: -2.

– The total score of an alignment is the sum of the values assigned to its columns.

A metric …

GACGGATTAG, GATCGGAATAG

GA-CGGATTAG

GATCGGAATAG

+1 (a match), -1 (a mismatch),-2 (gap)

9*1 + 1*(-1)+1*(-2) = 6

Dynamic programming

Reduce the problem:

the solution to a large problem is to simplify … if we first know the solution to a smaller problem that is a subset of the larger problem

Overview

P

P2P1 P3

P

Dynamic Programming

• Finding optimal solution to search problem

• Recursively computes solution

• Fundamental principle is to produce optimal solutions to smaller pieces of the problem first and then glue them together

• Efficient divide-and-conquer strategy because it uses a bottom-up approach and utilizes a look-up table instead of recomputing optimal solutions to sub-problems

P

P2P1 P3

P

the best alignment between

• a zinc-finger core sequence:

– CKHVFCRVCI

• and a sequence fragment

from a viral polyprotein:

– CKKCFCKCV

C K H V F C R V C I

+--------------------

C | 1 1 1

K | 1

K | 1

C | 1 1 1

F | 1

C | 1 1 1

K | 1

C | 1 1 1

V | 1 1

Dynamic Programming

C K H V F C R V C I

+--------------------

C | 1 1 1

K | 1

K | 1

C | 1 1 1

F | 1

C | 1 1 1

K | 1

C | 1 1 1

V | 1 1

Dynamic Programming

C K H V F C R V C I

+--------------------

C | 1 1 1 0

K | 1 0

K | 1 0

C | 1 1 1 0

F | 1 0

C | 1 1 1 0

K | 1 0

C | 1 1 1 0

V | 0 0 0 1 0 0 0 1 0 0

Dynamic Programming

C K H V F C R V C I

+--------------------

C | 1 1 1 0

K | 1 0

K | 1 0

C | 1 1 1 0

F | 1 0

C | 1 1 1 0

K | 1 0

C | 2 1 1 0

V | 0 0 0 1 0 0 0 1 0 0

Dynamic Programming

C K H V F C R V C I

+--------------------

C | 1 1 1 0

K | 1 0 0

K | 1 0 0

C | 1 1 1 0

F | 1 0 0

C | 1 1 1 0

K | 1 0 0

C | 2 1 1 1 1 2 1 0 1 0

V | 0 0 0 1 0 0 0 1 0 0

Dynamic Programming

C K H V F C R V C I

+--------------------

C | 1 1 1 1 0

K | 1 1 0 0

K | 1 1 0 0

C | 1 1 1 1 0

F | 1 1 0 0

C | 1 1 1 1 0

K | 2 3 2 2 2 1 1 1 0 0

C | 2 1 1 1 1 2 1 0 1 0

V | 0 0 0 1 0 0 0 1 0 0

Dynamic Programming

C K H V F C R V C I

+--------------------

C | 1 1 1 1 1 0

K | 1 1 1 0 0

K | 1 1 1 0 0

C | 1 1 1 1 1 0

F | 1 1 1 0 0

C | 4 2 2 2 2 2 1 1 1 0

K | 2 3 2 2 2 1 1 1 0 0

C | 2 1 1 1 1 2 1 0 1 0

V | 0 0 0 1 0 0 0 1 0 0

Dynamic Programming

C K H V F C R V C I

+--------------------

C | 1 2 1 1 1 0

K | 1 1 1 1 0 0

K | 1 1 1 1 0 0

C | 1 2 1 1 1 0

F | 2 2 2 2 3 1 1 1 0 0

C | 4 2 2 2 2 2 1 1 1 0

K | 2 3 2 2 2 1 1 1 0 0

C | 2 1 1 1 1 2 1 0 1 0

V | 0 0 0 1 0 0 0 1 0 0

Dynamic Programming

C K H V F C R V C I

+--------------------

C | 1 2 2 1 1 1 0

K | 1 2 1 1 1 0 0

K | 1 2 1 1 1 0 0

C | 4 3 3 3 2 2 1 1 1 0

F | 2 2 2 2 3 1 1 1 0 0

C | 4 2 2 2 2 2 1 1 1 0

K | 2 3 2 2 2 1 1 1 0 0

C | 2 1 1 1 1 2 1 0 1 0

V | 0 0 0 1 0 0 0 1 0 0

Dynamic Programming

C K H V F C R V C I

+--------------------

C | 1 3 2 2 1 1 1 0

K | 1 3 2 1 1 1 0 0

K | 3 4 3 3 2 1 1 1 0 0

C | 4 3 3 3 2 2 1 1 1 0

F | 2 2 2 2 3 1 1 1 0 0

C | 4 2 2 2 2 2 1 1 1 0

K | 2 3 2 2 2 1 1 1 0 0

C | 2 1 1 1 1 2 1 0 1 0

V | 0 0 0 1 0 0 0 1 0 0

Dynamic Programming

C K H V F C R V C I

+--------------------

C | 1 3 3 2 2 1 1 1 0

K | 4 4 3 3 2 1 1 1 0 0

K | 3 4 3 3 2 1 1 1 0 0

C | 4 3 3 3 2 2 1 1 1 0

F | 2 2 2 2 3 1 1 1 0 0

C | 4 2 2 2 2 2 1 1 1 0

K | 2 3 2 2 2 1 1 1 0 0

C | 2 1 1 1 1 2 1 0 1 0

V | 0 0 0 1 0 0 0 1 0 0

Dynamic Programming

C K H V F C R V C I

+--------------------

C | 5 3 3 3 2 2 1 1 1 0

K | 4 4 3 3 2 1 1 1 0 0

K | 3 4 3 3 2 1 1 1 0 0

C | 4 3 3 3 2 2 1 1 1 0

F | 2 2 2 2 3 1 1 1 0 0

C | 4 2 2 2 2 2 1 1 1 0

K | 2 3 2 2 2 1 1 1 0 0

C | 2 1 1 1 1 2 1 0 1 0

V | 0 0 0 1 0 0 0 1 0 0

Dynamic Programming

C K H V F C R V C I

+--------------------

C | 5 3 3 3 2 2 1 1 1 0

K | 4 4 3 3 2 1 1 1 0 0

K | 3 4 3 3 2 1 1 1 0 0

C | 4 3 3 3 2 2 1 1 1 0

F | 3 2 2 2 3 1 1 1 0 0

C | 4 2 2 2 2 2 1 1 1 0

K | 2 3 2 2 2 1 1 1 0 0

C | 2 1 1 1 1 2 1 0 1 0

V | 0 0 0 1 0 0 0 1 0 0

Dynamic Programming

C K H V F C R V C I

+--------------------

C | 5 3 3 3 2 2 1 1 1 0

K | 4 4 3 3 2 1 1 1 0 0

K | 3 4 3 3 2 1 1 1 0 0

C | 4 3 3 3 2 2 1 1 1 0

F | 3 2 2 2 3 1 1 1 0 0

C | 4 2 2 2 2 2 1 1 1 0

K | 2 3 2 2 2 1 1 1 0 0

C | 2 1 1 1 1 2 1 0 1 0

V | 0 0 0 1 0 0 0 1 0 0

Dynamic Programming

C K H V F C R V C I

+--------------------

C | 5 3 3 3 2 2 1 1 1 0

K | 4 4 3 3 2 1 1 1 0 0

K | 3 4 3 3 2 1 1 1 0 0

C | 4 3 3 3 2 2 1 1 1 0

F | 3 2 2 2 3 1 1 1 0 0

C | 4 2 2 2 2 2 1 1 1 0

K | 2 3 2 2 2 1 1 1 0 0

C | 2 1 1 1 1 2 1 0 1 0

V | 0 0 0 1 0 0 0 1 0 0

Dynamic Programming

C K H V F C R V C I

+--------------------

C | 5 3 3 3 2 2 1 1 1 0

K | 4 4 3 3 2 1 1 1 0 0

K | 3 4 3 3 2 1 1 1 0 0

C | 4 3 3 3 2 2 1 1 1 0

F | 3 2 2 2 3 1 1 1 0 0

C | 4 2 2 2 2 2 1 1 1 0

K | 2 3 2 2 2 1 1 1 0 0

C | 2 1 1 1 1 2 1 0 1 0

V | 0 0 0 1 0 0 0 1 0 0

Dynamic Programming

C K H V F C R V C I

+--------------------

C | 5 3 3 3 2 2 1 1 1 0

K | 4 4 3 3 2 1 1 1 0 0

K | 3 4 3 3 2 1 1 1 0 0

C | 4 3 3 3 2 2 1 1 1 0

F | 3 2 2 2 3 1 1 1 0 0

C | 4 2 2 2 2 2 1 1 1 0

K | 2 3 2 2 2 1 1 1 0 0

C | 2 1 1 1 1 2 1 0 1 0

V | 0 0 0 1 0 0 0 1 0 0

Dynamic Programming

C K H V F C R V C I

+--------------------

C | 5 3 3 3 2 2 1 1 1 0

K | 4 4 3 3 2 1 1 1 0 0

K | 3 4 3 3 2 1 1 1 0 0

C | 4 3 3 3 2 2 1 1 1 0

F | 3 2 2 2 3 1 1 1 0 0

C | 4 2 2 2 2 2 1 1 1 0

K | 2 3 2 2 2 1 1 1 0 0

C | 2 1 1 1 1 2 1 0 1 0

V | 0 0 0 1 0 0 0 1 0 0

Dynamic Programming

C K H V F C R V C I

+--------------------

C | 5 3 3 3 2 2 1 1 1 0

K | 4 4 3 3 2 1 1 1 0 0

K | 3 4 3 3 2 1 1 1 0 0

C | 4 3 3 3 2 2 1 1 1 0

F | 3 2 2 2 3 1 1 1 0 0

C | 4 2 2 2 2 2 1 1 1 0

K | 2 3 2 2 2 1 1 1 0 0

C | 2 1 1 1 1 2 1 0 1 0

V | 0 0 0 1 0 0 0 1 0 0

Dynamic Programming

C K H V F C R V C I

+--------------------

C | 5 3 3 3 2 2 1 1 1 0

K | 4 4 3 3 2 1 1 1 0 0

K | 3 4 3 3 2 1 1 1 0 0

C | 4 3 3 3 2 2 1 1 1 0

F | 3 2 2 2 3 1 1 1 0 0

C | 4 2 2 2 2 2 1 1 1 0

K | 2 3 2 2 2 1 1 1 0 0

C | 2 1 1 1 1 2 1 0 1 0

V | 0 0 0 1 0 0 0 1 0 0

Dynamic Programming

C K H V F C R V C I

+--------------------

C | 5 3 3 3 2 2 1 1 1 0

K | 4 4 3 3 2 1 1 1 0 0

K | 3 4 3 3 2 1 1 1 0 0

C | 4 3 3 3 2 2 1 1 1 0

F | 3 2 2 2 3 1 1 1 0 0

C | 4 2 2 2 2 2 1 1 1 0

K | 2 3 2 2 2 1 1 1 0 0

C | 2 1 1 1 1 2 1 0 1 0

V | 0 0 0 1 0 0 0 1 0 0

Dynamic Programming

C K H V F C R V C I

+--------------------

C | 5 3 3 3 2 2 1 1 1 0

K | 4 4 3 3 2 1 1 1 0 0

K | 3 4 3 3 2 1 1 1 0 0

C | 4 3 3 3 2 2 1 1 1 0

F | 3 2 2 2 3 1 1 1 0 0

C | 4 2 2 2 2 2 1 1 1 0

K | 2 3 2 2 2 1 1 1 0 0

C | 2 1 1 1 1 2 1 0 1 0

V | 0 0 0 1 0 0 0 1 0 0

C K H V F C R V C I

C K K C F C - K C V

C K H V F C R V C I

C K K C F C K - C V

C - K H V F C R V C I

C K K C - F C - K C V

C K H - V F C R V C I


Dynamic Programming

C K H V F C R V C I

+--------------------

C | 5 3 3 3 2 2 1 1 1 0

K | 4 4 3 3 2 1 1 1 0 0

K | 3 4 3 3 2 1 1 1 0 0

C | 4 3 3 3 2 2 1 1 1 0

F | 3 2 2 2 3 1 1 1 0 0

C | 4 2 2 2 2 2 1 1 1 0

K | 2 3 2 2 2 1 1 1 0 0

C | 2 1 1 1 1 2 1 0 1 0

V | 0 0 0 1 0 0 0 1 0 0

C K H V F C R V C I

C K K C F C - K C V

C K H V F C R V C I

C K K C F C K - C V

C - K H V F C R V C I


C K H - V F C R V C I


Dynamic Programming

Needleman-Wunsch-Simple.py


The Score Matrix

----------------

Seq1(j)1 2 3 4 5 6 7

Seq2 * C K H V F C R

(i) * 0 -1 -2 -3 -4 -5 -6 -7

1 C -1 1 0 -1 -2 -3 -4 -5

2 K -2 0 2 1 0 -1 -2 -3

3 K -3 -1 1 1 0 -1 -2 -3

4 C -4 -2 0 0 0 -1 0 -1

5 F -5 -3 -1 -1 -1 1 0 -1

6 C -6 -4 -2 -2 -2 0 2 1

7 K -7 -5 -3 -3 -3 -1 1 1

8 C -8 -6 -4 -4 -4 -2 0 0

9 V -9 -7 -5 -5 -3 -3 -1 -1

The Score Matrix

----------------

Seq1(j)1 2 3 4 5 6 7


(i) * 0 -1 -2 -3 -4 -5 -6 -7

1 C -1 1 0 -1 -2 -3 -4 -5

2 K -2 0 2 1 0 -1 -2 -3

3 K -3 -1 1 1 0 -1 -2 -3

4 C -4 -2 0 0 0 -1 0 -1

5 F -5 -3 -1 -1 -1 1 0 -1

6 C -6 -4 -2 -2 -2 0 2 1

7 K -7 -5 -3 -3 -3 -1 1 1

8 C -8 -6 -4 -4 -4 -2 0 0

9 V -9 -7 -5 -5 -3 -3 -1 -1


The Score Matrix

----------------

Seq1(j)1 2 3 4 5 6 7


(i) * 0 -1 -2 -3 -4 -5 -6 -7

1 C -1 1 0 -1 -2 -3 -4 -5

2 K -2 0 2 1 0 -1 -2 -3

3 K -3 -1 1 1 0 -1 -2 -3

4 C -4 -2 0 0 0 -1 0 -1

5 F -5 -3 -1 -1 -1 1 0 -1

6 C -6 -4 -2 -2 -2 0 2 1

7 K -7 -5 -3 -3 -3 -1 1 1

8 C -8 -6 -4 -4 -4 -2 0 0

9 V -9 -7 -5 -5 -3 -3 -1 -1

abc

A: matrix(i,j) = matrix(i-1,j-1) + (MIS)MATCH

if (substr(seq1,j-1,1) eq substr(seq2,i-1,1)

B: up_score = matrix(i-1,j) + GAP

C: left_score = matrix(i,j-1) + GAP


The Score Matrix

----------------

Seq1(j)1 2 3 4 5 6 7


(i) * 0 -1 -2 -3 -4 -5 -6 -7

1 C -1 1 0 -1 -2 -3 -4 -5

2 K -2 0 2 1 0 -1 -2 -3

3 K -3 -1 1 1 0 -1 -2 -3

4 C -4 -2 0 0 0 -1 0 -1

5 F -5 -3 -1 -1 -1 1 0 -1

6 C -6 -4 -2 -2 -2 0 2 1

7 K -7 -5 -3 -3 -3 -1 1 1

8 C -8 -6 -4 -4 -4 -2 0 0

9 V -9 -7 -5 -5 -3 -3 -1 -1



Seq1:CKHVFCRVCI

Seq2:CKKCFC-KCV

++--++--+- score = 0


Extensions to basic dynamic programming methoduse gap penalties

– constant gap penalty for gap > 1

– gap penalty proportional to gap size

• one penalty for starting a gap (gap

opening penalty)

• different (lower) penalty for adding to a

gap (gap extension penalty)

use blosum62

• instead of MATCH and MISMATCH

Dynamic Programming: Needleman-Wunsch-Complete.py

Needleman-Wunsch-Complete.py






Uses of Needleman-Wunsch-Complete.py

• Time Complexity

• Use random proteins to generate

histogram of scores from aligned

random sequences

Time complexity with Needleman-Wunsch-Complete.py

Sequence Length

(aa)

Execution Time (s)

10 0:00:00.001500

25 0:00:00.005340

50 0:00:00.020112

100 0:00:00.081580

500 0:00:01.960721

1000 0:00:07.720884

10000 0:11:36.344549

100000 Memory could not be

written

Simple version (Match/Mismatch) – no gap extension

Complete version !

True positives False positives

False negatives

Sequences reported

as related

Sequences reported

as unrelatedTrue negatives

homologous

sequences

non-homologous

sequences

Sensitivity:

ability to find

true positives

Specificity:

ability to minimize

false positives

If the sequences are similar, the path

of the best alignment should be very

close to the main diagonal.

Therefore, we may not need to fill the

entire matrix, rather, we fill a narrow

band of entries around the main

diagonal.

An algorithm that fills in a band of

width 2k+1 around the main

diagonal.

Local alignment

• The concept of ‘local alignment’ was

introduced by Smith & Waterman in 1981

• A local alignment of 2 sequences is an

alignment between parts of the 2

sequences

Two proteins may one share one stretch of high sequence

similarity, but be very dissimilar outside that region

A global (N-W) alignment of such sequences would have:

(i) lots of matches in the region of high sequence similarity

(ii) lots of mismatches & gaps (insertions/deletions)

outside the region of similarity

It makes sense to find the best local alignment instead

Smith-Waterman.py

• Three changes

– The edges of the matrix are initialized to 0 instead

of increasing gap penalties

– The maximum score is never less than 0, and no

pointer is recorded unless the score is greater

than 0

– The trace-back starts from the highest score in

the matrix (rather than at the end of the matrix)

and ends at a score of 0 (rather than the start of

the matrix)

Smith-Waterman.py

Smith-Waterman.py

Sequence Alignments

Introduction

Algorithms

What ?

Examples

Properties


Concept

Example


Smith-Waterman(.pl)

Multiple Alignment

MSA


ClustalW, PileUp

Formatting

Interpretation

Alternative Methods

SIM

Blast2

Dali

The best alignment:

The one with the maximum total score

Multiple Aligment: n>2

2 to 3: hyperlattice

On its top-left side, the cube is

"covered" by the polyhedron. The

edges 1, 2, 3, 6 and 7 are coming

from the inside, and edges 4 and 5

can be ignored (and are therefore

not labeled in the figure).

• Each node in the k-dimensional hyperlattice is

visited once, and therefore the running time

must be proportional to the number of nodes in

the lattice.

– This number is the product of the lengths of the

sequences.

– eg. the 3-dimensional lattice as visualized.

Computational Complexity of MA by standard Dynamic Programming

• The memory space requirement is even worse.

To trace back the alignment, we need to store the

whole lattice, a data structure the size of a

multidimensional skyscraper.

– In fact, space is the No.1 problem here, bogging down

multiple alignment methods that try to achieve

optimality.

– Furthermore, incorporating a realistic gap model, we

will further increase our demands on space and running

time

Size/Time limits…

• The most practical and widely used

method in multiple sequence alignment

is the hierarchical extensions of

pairwise alignment methods.

• The principal is that multiple alignments

is achieved by successive application

of pairwise methods.

– First do all pairwise alignments (not just one

sequence with all others)

– Then combine pairwise alignments to generate

overall alignment

Multiple Alignment Method

• The steps are summarized as follows:– Compare all sequences pairwise.

– Perform cluster analysis on the pairwise data to

generate a hierarchy for alignment. This may be in

the form of a binary tree or a simple ordering

– Build the multiple alignment by first aligning the

most similar pair of sequences, then the next most

similar pair and so on. Once an alignment of two

sequences has been made, then this is fixed.

Thus for a set of sequences A, B, C, D having

aligned A with C and B with D the alignment of A,

B, C, D is obtained by comparing the alignments

of A and C with that of B and D using averaged

scores at each aligned position.




• Automatic multiple alignemnt– extend dynamic programming (MSA - Lipman)

• limit: computing power: length and number of sequences (e.q. 2000^8)

– progressive alignment (Feng & Doolittle)• use “guide tree” (PileUp, ClustalW etc)

• Dedicated alignment editing program– Boxshade

– SeaView

– SeqPup (Java)

• Combination (Biology – Computation)

Multiple Sequence Alignment programs

• ClustalW is a general purpose multiple

alignment program for DNA or proteins.

• ClustalW is produced by Julie D. Thompson,

Toby Gibson of European Molecular Biology

Laboratory, Germany and Desmond Higgins

of European Bioinformatics Institute,

Cambridge, UK. Algorithmic

• Improves the sensitivity of progressive

multiple sequence alignment through

sequence weighting, positions-specific gap

penalties and weight matrix choice. Nucleic

Acids Research, 22:4673-4680.

ClustalW

****** MULTIPLE ALIGNMENT MENU ******

1. Do complete multiple alignment now (Slow/Accurate)

2. Produce guide tree file only

3. Do alignment using old guide tree file

4. Toggle Slow/Fast pairwise alignments = SLOW

5. Pairwise alignment parameters

6. Multiple alignment parameters

7. Reset gaps between alignments? = OFF

8. Toggle screen display = ON

9. Output format options

S. Execute a system command

H. HELP

or press [RETURN] to go back to main menu

Your choice:

Running ClustalW

• The final product of a PILEUP run is a set of aligned

sequences, which are stored in a Multiple Sequence File (called .msf by GCG).

This msf file is a text file that can be formatted with

a text editor, but GCG has some dedicated tools for

improving the looks of msf files for easier

interpretation and for publication.

• Consensus sequences can be calculated and the

relationship of each character of each sequence to

the consensus can be highlighted using the

program PRETTY

Formatting Multiple Alignments

../Cursus2003/Flbtw2000/malign-msf.html

• Shading of regions of high homology can be created using

the programs BOXSHADE and PRETTYBOX , but that

goes beyond the scope of this tutorial. (Boxshade:

http://www.ch.embnet.org/software/BOX_form.html)

• In addition to these programs that run on the Alpha, the

output of PILEUP (or CLUSTAL) can be moved by FTP

from your RCR account to a local Mac or PC.

• Since this output is a plain text file, it can be edited with

any word processing program, or imported into any

drawing program to add boldface text, underlining,

shading, boxes, arrows, etc

Formatting Multiple Alignments

http://dot.imgen.bcm.tmc.edu:9331/multi-align/multi-align.html

VTISCTGSSSNIGAG-NHVKWYQQLPG

VTISCTGTSSNIGS--ITVNWYQQLPG

LRLSCSSSGFIFSS--YAMYWVRQAPG

LSLTCTVSGTSFDD--YYSTWVRQPPG

PEVTCVVVDVSHEDPQVKFNWYVDG--

ATLVCLISDFYPGA--VTVAWKADS--

AALGCLVKDYFPEP--VTVSWNSG---

VSLTCLVKGFYPSD--IAVEWWSNG--

An example of Multiple Alignment … immunoglobulin

• Their alignment highlights conserved

residues (one of the cysteines forming the

disulphide bridges, and the tryptophan are

notable)

• conserved regions (in particular, "Q.PG" at

the end of the first 4 sequences), and more

sophisticated patterns, like the dominance of

hydrophobic residues at fragment positions 1

and 3.

• The alternating hydrophobicity pattern is

typical for the surface beta-strand at the

beginning of each fragment. Indeed, multiple

alignments are helpful for protein structure

prediction.

An example of Multiple Alignment … immunoglobulin

• Providing the alignment is accurate then the following may be inferred about the secondary structure from a multiple sequence alignment.

The position of insertions and deletions (INDELS) suggests regions where surface loops exist.

Conserved glycine or proline suggests a beta-turn.

A Practical Approach: Interpretation

• Residues with hydrophobic properties conserved at i, i+2, i+4 separated by unconserved or hydrophilic residues suggest surface beta- strands.

A short run of hydrophobic amino acids (4 residues) suggests a buried beta-strand.

Pairs of conserved hydrophobic amino acids separated by pairs of unconserved, or hydrophilic residues suggests an alfa-helix with one face packing in the protein core. Likewise, an i, i+3, i+4, i+7 pattern of conserved hydrophobic residues.

A Practical Approach: Interpretation

• Take out noise (GAPS)

• Extra information (structure - function)

• Recursive selection

– first most similar to have an idea about

conserved regions

– manual scan for these in more distant

members then include these

A Practical Approach: Which sequences to use ?

Sequence Alignments

Introduction

Algorithms

What ?

Examples

Properties


Concept

Example


Smith-Waterman(.pl)

Multiple Alignment

MSA


ClustalW, PileUp

Formatting

Interpretation

Alternative Methods

SIM

Blast2

Dali

L-align (2 sequences)

SIM (www.expasy.ch)

LALNVIEW is available for UNIX, Mac

and PC on the ExPASy anonymous

FTP server.

very nice TWEAKING tool (70% criteria)

http://www.expasy.ch/

http://us.expasy.org/ftp/pub/lalnview/

Length

P-value

SIM

SIM

SIM

How can I use NCBI

to compare two

sequences?

Answer:

Use the

“BLAST 2 Sequences”

program

• Go to http://www.ncbi.nlm.nih.gov/BLAST

• Choose BLAST 2 sequences

• In the program,

[1] choose blastp (protein search) or blastn (for DNA)

[2] paste in your accession numbers

(or use FASTA format)

[3] select optional parameters, such as

--BLOSU62 matrix is default for proteins

try PAM250 for distantly related proteins

--gap creation and extension penalties

[4] click “align”

Practical guide to pairwise alignment:

the “BLAST 2 sequences” website

Question #2:

How can I use NCBI

to compare a

sequence to an

entire database?

BLAST!

Weblems

W4.1: Align the amino acid sequence of acetylcholine receptor from human, rat, mouse, dog with

ClustalW

T-Coffee

Dali

MSA

W4.2: Use BoxShade to create a word file indicating the different conserved resides in colours

W4.3: Perform a LocalAlignent using SIM and Lalign on the same sequence and Blast2

W4.4: Do the different methods give different results, what are the default settings they use ?

W4.5: How would you identify critical residues for catalytic activity ?

Education

2015 bioinformatics alignments_wim_vancriekinge