Multiple Sequence AlignmentMultiple Sequence Alignment Sum ...€¦ · TTCA_GTG_GACGTG__GTA...

Multiple Sequence AlignmentMultiple Sequence AlignmentSum-of-Pairs and ClustalW

Ulf Leser

This Lecture

• Multiple Sequence AlignmentMultiple Sequence Alignment

• The problem• The problem• Theoretical approach: Sum-of-Pairs scores• Practical approach: ClustalW• Practical approach: ClustalW

Ulf Leser: Bioinformatics, Summer Semester 2011 2

Multiple Sequence Alignmentp q g

• We now align multiple (k>2) sequencesWe now align multiple (k>2) sequences– Note: Also BLAST aligns only two sequences

• Why?y– Imagine k sequences of the promoter region of genes, all regulated

by the same transcription factor f. Which subsequence within the k sequences is recognized by f?sequences is recognized by f?

– Imagine k sequences of proteins that bind to DNA. Which subsequence of the k sequences code for the part of the proteins that performs the binding?

• GeneralW t t k th t( ) i k– We want to know the common part(s) in k sequences

– “common” does not mean identical– This part can be anywhere within the sequences

This part can be anywhere within the sequences

Definition

• DefinitionDefinition– A multiple sequence alignment (MSA) von k Strings si, 1≤i≤k, is a

table of k rows and l columns (l≥max(|sk|)), such thath f h b b f bl k• Row i contains the sequence of si, with an arbitrary number of blanks

being inserted at arbitrary positions• Every symbol of every si stands in exactly one column• No column contains only blanks

AACGTGATTGACTCGAGTGCTTTACAGTGCCGTGCTAGTCG

AACGTGATTGA____C_____TCGAGT__GCTTTACAGT_GCCGTGCTAGT____C_G__

TTCAGTGGACGTGGTAGGTGCAGACC

TTCA_GTG_GACGTG__GTAG__GTGCA_GAC___C____

Good MSA

• We are searching for good (optimal) MSAsWe are searching for good (optimal) MSAs• Defining „optimal“ here is not as simple as in the k=2 case• IntuitionIntuition

– All sequences had a common ancestor and evolved by evolution– We want to assume as few evolutionary events as possible– Thus, we want few columns (~ few INSDELs)– Thus, we want homogeneous columns (~ few replacements)

AACGTGATTGA____C_____TCGAGT__GCTTTACAGT_GCCGTGCTAGT C G

AAC__GTG_AT_T_GAC___TCGAGTGC_TTTACA_GTGCCG TGC TA GTCGGCCGTGCTAGT____C_G__

TTCA_GTG_GACGTG__GTAG__GTGCA_GAC___C____

GCCG__TGC_TA__GTCG_TTC_AGTGGACGTG__GTAG____GTGCA__TGACC__

This Lecture

What Should we Count?

• For two sequencesFor two sequences– We scored each column using a scoring matrix– Find the alignment such that the total score is maximal

• But – how do we score a column with 5*T, 3*A, 1*_?– We would need an exponentially big scoring matrix

• Alternative: Sum-of-Pairs Score– We score an entire MSA

W th li t f h i f i th l– We score the alignment of each pair of sequences in the usual way– We aggregate over all pairs to score the MSA– We need a clever algorithm to find the MSA with the best scoreWe need a clever algorithm to find the MSA with the best score

Formallyy

• DefinitionDefinition– Let M be a MSA for the set S of k sequences S={s1,...,sk}– The alignment of si with sj induced by M is generated as follows

• Remove from M all rows except i and j• Remove all columns that contain only blanks

– The sum-of-pairs score (sop) of M is the sum of all pair-wiseThe sum of pairs score (sop) of M is the sum of all pair wise induced alignment scores

– The optimal MSA for S wrt. to sop is the MSA with the highest sop-ll ibl MSA f Sscore over all possible MSA for S

Examplep

d/i = 11

AAGAA_AAT_AATGCTG_G_G

14r = 1m = 0

AAGAA_A 4 16}_ATAATGC_TGG_G

754 16}

• Given a MSA over k sequences of length l – how complex is it to compute its sop-score?

• How do we find the best MSA?

Analogygy

• Think of the k=2 caseThink of the k 2 case• Every alignment is a path

through the matrixg• The three possible

directions (down, right, down-right) conform to the three possible constellations in a columnconstellations in a column (XX, X_, _X)

• With growing paths we• With growing paths, we align growing prefixes of both sequences

Analogygy

• Assume k=3Assume k 3• Think of a 3-dimensional

cube with the three sequences giving the values in each dimension

• Now, we have paths aligning growing prefixes of three sequencesthree sequences

• Every column has seven possible constellationspossible constellations (XXX, XX_, X_X, _XX, X__, _X_, __X)

_ _, __ )

Dynamic Programming in two Dimensionsy g g

• We compute the best possible alignment d(i,j) for everyWe compute the best possible alignment d(i,j) for every pair of prefixes (lengths i,j) using the following formula

⎪⎬

⎫⎪⎨

⎧+−+−

= 1),1(1)1,(

min),( jidjid

jidInsertion in one seq

Insertion in the other seq⎪⎭

⎬⎪⎩

⎨+−− ),()1,1(),(),(

jitjidjj q

Match or mismatch

Dynamic Programming in three Dimensionsy g g

• We compute the best possible alignment d(i,j,k) for everyWe compute the best possible alignment d(i,j,k) for every triple of prefixes (lengths i,j,k ) using the following formula

d(i-1,j-1,k-1) + cij + cik + cjkd(i-1,j-1,k) + cij + 2d(i 1 j k 1) + c + 2

Three (mis)matchesOne (mis)match, two ins …

d(i,j,k) = mind(i-1,j,k-1) + cik + 2d(i,j-1,k-1) + cjk + 2d(i-1,j,k) + 2d(i j 1 k) + 2d(i,j-1,k) + 2d(i,j,k-1) + 2

Let cij = 0, if S1(i) = S2(j), else 1Let cik = 0, if S1(i) = S3(k), else 1Let c = 0 if S (j) = S (k) else 1

Let cjk = 0, if S2(j) = S3(k), else 1

All Possible Stepsp

• d(i-1 j-1 k-1)1 G2 A• d(i-1,j-1,k-1)

• d(i,j-1,k-1)

2 A3 C 1 -

2 A3 C

• d(i,j,k-1)

• d(i,j-1,k)

1 -2 -3 C

1 -d(i,j 1,k)

• d(i-1,j,k)

1 -2 A3 -

1 G2 -

• d(i-1,j-1,k)

• d(i-1,j,k-1)

1 G2 A3 -1 G

2 -3 C

Concrete Examplesp

d(i,j-1,k) d(i-1,j,k-1)( ,j , )

1 G2 -

1 -2 A

( ,j, )

3 C3 -

• Best sop-score for d(i,j-1,k) is known

• Best sop-score for d(i-1, j,k-1) is known

• We want to compute d(i,j,k)• This requires to align one

symbol with two blanks

• We want to compute d(i,j,k)• This requires aligning a blank

with s1[i-1] and with s3[k-1]symbol with two blanks (blank/blank does not count)

• d(i,j,k) = d(i,j-1,k) + 2

with s1[i 1] and with s3[k 1] and to align s1[i-1] and s3[k-1]

• d(i,j,k) = d(i-1,j,k-1) + 2 + cik

Initialization

• We need to start somewhereWe need to start somewhere

• Of course, we have d(0,0,0)=0Of course, we have d(0,0,0) 0• Aligning in one dimension: d(i,0,0)=2*i

– Dito for d(0,j,0), d(0,0,k)( ,j, ), ( , , )

• Aligning in two dimensions: d(i,j,0)= …– Let da,b(i,j) be the alignment score for Sa[1..i] with Sb[1..j],

– d(i, j, 0) = d1,2(i, j) + (i+j) – d(i, 0, k) = d1,3(i, k) + (i+k)– d(0 j k) = d2 3(j k) + (j+k)d(0, j, k) = d2,3(j, k) + (j+k)

Algorithmg

initialize matrix d;for i := 1 to |S1|

for j := 1 to |S2|for k := 1 to |S3|

if (S1(i) = S2(j)) then c := 0; else c := 1;if (S1(i) = S2(j)) then cij := 0; else cij := 1;if (S1(i) = S3(k)) then cik := 0; else cik := 1;if (S2(j) = S3(k)) then cjk := 0; else cjk := 1;d1 := d[i – 1,j – 1,k - 1] + cij + cik + cjk;d2 := d[i – 1,j – 1,k] + cij + 2;d3 := d[i – 1,j,k - 1] + cik + 2;d4 := d[i,j – 1,k - 1] + cjk + 2;d : d[i 1 j k) + 2;d5 := d[i – 1,j,k) + 2;d6 := d[i,j – 1,k) + 2;d7 := d[i,j,k - 1) + 2;d[i,j,k] := min(d1, d2, d3, d4, d5, d6, d7);[ ,j, ] ( 1, 2, 3, 4, 5, 6, 7)

end for;end for;

end for;

Bad News – Complexity p y

• For 3 sequences of length n, we have to performFor 3 sequences of length n, we have to perform– There are n3 cells in the cube– For each cell (top-left-front corner), we need to look at 7 corners– Together: O(7*n3) computations

• For k sequences of length n– There are nk cell corners in the cube– For each corner, we need to look at 2k-1 other corners– Together: O(2k * nk) computations– Together: O(2k * nk) computations

• Actually the problem is NP-complete• Actually, the problem is NP-complete

This Lecture

Scoring a MSAg

• Let‘s take one step backLet s take one step back• What happened during evolution?

CTTGCA

GTTTCAGTTGCA

CTTGCA

GTTGACA

GTTTTCA GTTGTTA

GTATTTCAGTATTTCT

• Real number of events: 8

GTATTTCAGTATTTGA

CT TGC A• Real number of events: 8• sop-score: 2+3+6+6+2+…

– Everything is counted multiple times

_ _GT_TGACAGT_TGTTAGTATTTCT

Everything is counted multiple times GTATTTGA

Different Scoring Functiong

• If we knew the phylogenetic tree of the k sequencesIf we knew the phylogenetic tree of the k sequences– Align every parent with all its children– Aggregate all alignment scores– This gives the “real” number of evolutionary operations

• But: Finding the true phylogenetic tree requires a MSA– Not be covered here

• Use a heuristic: ClustalWHiggins D G and Sha p P M (1988) "CLUSTAL a package fo– Higgins, D. G. and Sharp, P. M. (1988). "CLUSTAL: a package for performing multiple sequence alignment on a microcomputer." Gene73(1): 237-44.Th J D Hi i D G d Gib T J (1994) "CLUSTAL W– Thompson, J. D., Higgins, D. G. and Gibson, T. J. (1994). "CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice " Nucleic Acids Res 22(22): 4673 80

matrix choice." Nucleic Acids Res 22(22): 4673-80.

ClustalW

• Main ideaMain idea– Compute a “good enough” phylogeny – the guide tree– Use the guide tree to iteratively align small MSA to larger MSA

• Starting from single sequences

– To escape the curse of dimensionality, compute MSA iteratively• “Progressive” MSA• Progressive MSA• Does not necessarily find the best solution• Needs a fast method to align two MSAs

• Works quite well in practice• Many other, newer (better) proposals

– DAlign, T-Coffee, HMMT, PRRT, MULTALIGN, ...

Step 1: Compute the Guide Treep p

• Compute all pair-wise alignments and store in similarityCompute all pair wise alignments and store in similarity matrix M– M[i,j] = sim(si, sj)

• Compute the guide tree using hierarchical clustering– Choose the smallest M[i,j]– Let si and sj form a new (next) branch of the tree – Compute the distance from the ancestor of si and sj to all other

sequences as the average of the distances to si and sjsequences as the average of the distances to si and sj

• Set M’ = M• Delete rows and columns i and j

Add l d (ij)• Add a new column and row (ij)• For all k≠ij: M‘[ij,k] = (M[i,k] + M[j,k]) / 2

– Iterate until M’ has only one column / row

Sketch

A ABCDEFGA

A ACEFGaAB

AB.C..D...E

(B,D)→a

AC.E..F...

(E,F)→b

ABCDEF

E....F.....G......

G....a.....

ACGab ACGac A

AC.G..a...

(A,b)→c

CGacCG.a..c

(C,G)→d

b.... FG

ac.d..

(d,c)→e BCDEF

(a,e)→fae

Examplep

A B C D EA

A 17 59 59 77

B 37 61 53

C 13 41

A B E CD

AA B E CD

A 17 77 59

B 53 49

BCDEE 31 E

AE CD AB

E 31 65

Examplep

C PADKTNVKAAWGKVGAHAGEYGAD AADKTNVKAAWSKVGGHAGEYGAD AADKTNVKAAWSKVGGHAGEYGA

A PEEKSAVTALWGKVNVDEYGGB GEEKAAVLALWDKVNEEEYGGA

B GEEKAAVLALWDKVNEEEYGG

C PADKTNVKAAWG_KVGAHAGEYGAD AADKTNVKAAWS KVGGHAGEYGA

D AADKTNVKAAWS_KVGGHAGEYGAE AA__TNVKTAWSSKVGGHAPA__A

A PEEKSAV TALWG KVN VDEYGG_ _ __B GEEKAAV_LALWD_KVN__EEEYGG C PADKTNVKAA_WG_KVGAHAGEYGAD AADKTNVKAA WS KVGGHAGEYGA_ _E AA__TNVKTA_WSSKVGGHAPA__A

Once a gap, always a gap

Step 2: Progressive MSAp g

• Pair-wise alignment in the order given by the guide treePair wise alignment in the order given by the guide tree • Aligning a MSA M1 with a MSA M2

– Use the usual (global) alignment algorithm (g ) g g– To score a column, compute the average score over all pairs of

symbols in this columns

E l• ExampleA …P…B G Score of this columnB …G…C …P…

Score of this column(2*s(P,A)+s(P,Y)+2*s(G,A)+s(G,Y)+

D …A…E …A…F …Y…

2*s(P,A)+s(P,Y) ) / 9

Issues

• There is a lot to say about whether hierarchical clusteringThere is a lot to say about whether hierarchical clustering actually computes the “correct” tree– See phylogenetic algorithms, „Algorithms in Bioinformatics“

• Clustal-W actually uses a different, more accurate algorithm called “neighbor-joining”

• Clustal-W is fast (O(k2*log(k))– For k sequences; plus cost for computing alignments, especially M

d b hi d i li• Idea behind progressive alignment– Find strong signals (highly conserved blocks) first

Outliers are added last– Outliers are added last– Increases the chances that conserved blocks survive– Several improvements to this scheme are known

Problems with progressive MSAp g

Source: Cedric Notredame, 2001

Multiple Sequence AlignmentMultiple Sequence Alignment Sum ...€¦ · TTCA_GTG_GACGTG__GTA...

Documents

Multiple Sequence Alignment Sum-of-Pairs and Clustal-W€¦ · TTCA_GTG_GACGTGGTA GGTGCA_GAC_C__ Ulf Leser: Introduction to Bioinformatics 5 Good MSA • We are searching

Algorithms and Data Structures · Marc Bux, Ulf Leser: Algorithms and Data Structures, Summer Term 2017 3 Recall: Collision Handling •hash table –data structure –average-case

Datenbanksysteme II: Recovery · Ulf Leser: Implementation of Database Systems, Winter Semester 2016/2017 5 Consistent States • Complex consistency rules – If there are no purple

Data Warehousing und Data Mining - informatik.hu-berlin.de · Ulf Leser: Data Warehousing und Data Mining 5 Beispiel • Alle Verkäufe von Produkten der Produktgruppe ‚ Wasser‘

Algorithms and Data Structures - hu-berlin.de€¦ · • Abstract Data Types • Lists, Stacks, and Queues • Realization in Java. Ulf Leser: Algorithms and Data Structures 14

Datenbanksysteme II: Big Data - hu-berlin.de · Big Data Landscape Ulf Leser: Implementation of Database Systems, Winter Semester 2016/2017 19 Ulf Leser: Implementation of Database

Z-Box Algorithmus Preprocessing eines Strings · Ulf Leser Wissensmanagement in der Bioinformatik Bioinformatik Z-Box Algorithmus Preprocessing eines Strings

Fast and practical indexing and querying of very large graphs Silke Triβl, Ulf Leser Humboldt-Universitat zu Berlin Presenter: Liwen Sun (Stephen) SIGMOD’07

Introduction to Bioinformatics - hu-berlin.de · 2018-05-02 · Ulf Leser: Introduction to Bioinformatics 10 . Literature • For algorithms – Gusfield (1997). „Algorithms on

Data Warehousing und Data Mining - Institut für Informatik · Ulf Leser: Data Warehousing und Data Mining 2 . Inhalt dieser Vorlesung • ETL: Extraction, Transformation, Load -

Data Warehousing und Data Mining - hu-berlin.de · Ulf Leser: Data Warehousing und Data Mining 4 Materialisierte Sichten • Beobachtung – Viele Anfragen sind Variationen anderer

Algorithmische Bioinformatik - informatik.hu-berlin.de · Ulf Leser: Algorithmische Bioinformatik 5 Hintergrund • Warum Ähnlichkeitsmatrizen, Substitutionsmatrizen, …? • Ersetzung

Data Warehousing und Data Mining - hu-berlin.de · 2017-12-20 · Ulf Leser: Data Warehousing und Data Mining 2 . Inhalt dieser Vorlesung • Wiederholung: OLAP Operationen • Implementierung

Data Warehousing und Data Mining - hu-berlin.de...Ulf Leser: Data Warehousing und Data Mining 3 Spreadsheets • EXCEL-artige Anwendungen sind extrem weit verbreitet – Vorläufer

ETL: Extraction, Transformation, Load · Ulf Leser Wissensmanagement in der Bioinformatik ETL: Extraction, Transformation, Load Data Warehousing und Data Mining

Data Warehousing - hu-berlin.de€¦ · Ulf Leser: Data Warehousing, Sommersemester 2005 2 • Nächsten Montag – OLAP und Data Mining Plattform Analysis Services und die Data Warehouse

SWISSQM S tephan Wagner Seminar Datenmanagement in Sensornetzen SS 2007, Prof. Ulf Leser, Dipl-Inf. Timo Mika Gläßer

Information Retrieval on the Web Ulf Leser · information linked by pointers in a ... stale caches, cache replacement ... – Iteratively fetch and scanthem for new outlinking URLs

Algorithms and Data Structures - hu-berlin.de€¦ · Ulf Leser: Alg&DS, Summer semester 2011 5 One Issue • Requirement: „Allows for smooth operations in daily routine“ •

Molekularbiologische Datenbanken Grundlagen der ... · Molekularbiologische Datenbanken Grundlagen der Molekularbiologie Ulf Leser Wissensmanagement in der Bioinformatik. ... Vorlesung,