Seminar in structural bioinformatics Multiple structural alignment of proteins By Elad Kaspani

Seminar in Seminar in structural structural

bioinformaticsbioinformatics

Multiple structural alignment Multiple structural alignment of proteins of proteins

By Elad Kaspani

Multiple structural alignmentMultiple structural alignment

OutlineOutline

IntroductionIntroduction What is Multiple structural alignment?What is Multiple structural alignment? Why do we need Multiple structural Why do we need Multiple structural

alignment?alignment? Pairwise Vs. Multiple structural alignmentPairwise Vs. Multiple structural alignment

MASS - Multiple structural alignment MASS - Multiple structural alignment by secondary structuresby secondary structures Problem definitionProblem definition General strategyGeneral strategy Algorithm descriptionAlgorithm description

Outline ContOutline Cont..

MASS - Multiple structural alignment MASS - Multiple structural alignment by secondary structuresby secondary structures Algorithm outlineAlgorithm outline ComplexityComplexity Results DiscussionResults Discussion

Summary & ConclusionsSummary & Conclusions

IntroductionIntroduction

Proteins sharing a common Proteins sharing a common substructure may have a similar substructure may have a similar function.function.

What is Multiple structural What is Multiple structural alignment ?alignment ?

Discussion – we already have Discussion – we already have pairwise alignment, isn’t that pairwise alignment, isn’t that enough?enough?

Pairwise Vs. Multiple structural Pairwise Vs. Multiple structural alignmentalignment

We have many algorithms We have many algorithms pairwise pairwise structural alignment structural alignment tasktask

Only a few methods are available for Only a few methods are available for aligning multiple structuresaligning multiple structures

Most of them are based on series of Most of them are based on series of pairwise comparisons pairwise comparisons SSAPm (SSAPm (Taylor et al., 1994) Prism Prism (Yang and Honig, 2000b) STAMP (Russell and Barton, 1992)

What do we wantWhat do we want??

Classification of existing and newly Classification of existing and newly discovered proteinsdiscovered proteins

Gaining insights into evolutionary Gaining insights into evolutionary relations between proteinsrelations between proteins

Detecting motifs common to a group Detecting motifs common to a group of proteins that share a certain of proteins that share a certain functionfunction

Structure prediction algorithmsStructure prediction algorithms

What’s wrong with What’s wrong with methods based on series methods based on series of pairwise comparisons of pairwise comparisons ??????

Multiple structural alignmentMultiple structural alignment

These methods are limited!!!These methods are limited!!! In each pairwise comp. , the only In each pairwise comp. , the only

information is about the two moleculesinformation is about the two molecules alignments optimal for the whole set can alignments optimal for the whole set can

be disregardedbe disregarded dynamic programming dynamic programming disadvantage - disadvantage -

dependent on the sequence order of the dependent on the sequence order of the polypeptide chainpolypeptide chain

We can’t see the woods We can’t see the woods

WHAT DO WE DO THEN?????????????WHAT DO WE DO THEN?????????????

multiple structural alignment by multiple structural alignment by secondary structures secondary structures

MASSMASS

MASSMASS

Considers all the given structures at the Considers all the given structures at the same timesame time

Exploiting the secondary structure Exploiting the secondary structure representation - reduced time representation - reduced time complexitycomplexity

Does not require that all the input Does not require that all the input molecules be alignedmolecules be aligned

Capable of detecting structural motifs Capable of detecting structural motifs shared only by a subset of the moleculesshared only by a subset of the molecules

MASSMASS

Can find non-sequential and even non-Can find non-sequential and even non-topological structural motifstopological structural motifs

Suitable for a broad range of Suitable for a broad range of applicationsapplications

filter noisy resultsfilter noisy results highly efficient and robusthighly efficient and robust Other multiple-based methodsOther multiple-based methods

(Escalier (Escalier et al.et al., 1988), 1988) MUSTA (Leibowitz MUSTA (Leibowitz et al.et al., 2001), 2001) MultiProt (Shatsky MultiProt (Shatsky et al.et al., 2002), 2002)

Secondary structure elements (SSE)Secondary structure elements (SSE)

Basic termsBasic terms

rigid transformationrigid transformation Q - a subsetQ - a subset T (Q) T (Q) ==R(Q) R(Q) + + t t where where R R is a 3x3 rotation is a 3x3 rotation

matrix and matrix and t t is a translation vectoris a translation vector εε-congruent-congruent

For For εε>0, >0, find two largest subsets of the input find two largest subsets of the input sets, sets, P P and and QQ, and a rigid transformation, , and a rigid transformation, TT, so , so that that distance(P, T (Q)) < distance(P, T (Q)) < εε

How do we measure How do we measure distance?distance? RMSDRMSD

Problem DefinitionProblem Definition

The pairwise case:The pairwise case: given two proteins, represented by a set given two proteins, represented by a set

of points in 3D spaceof points in 3D space each point is associated with an atom’s each point is associated with an atom’s

positionposition find the largest set that is congruent to find the largest set that is congruent to

two subsets of points from each proteintwo subsets of points from each protein In computational geometry - In computational geometry - largest largest

common point set common point set (LCP) problem(LCP) problem


The multiple case:The multiple case: given a collection of given a collection of m m point sets, point sets, find the largest set of points, of which find the largest set of points, of which

an an εε-congruent copy appears in each -congruent copy appears in each of the input setsof the input sets

Unfortunately, it’s NP-hard.....Unfortunately, it’s NP-hard..... We want not only the largest set of We want not only the largest set of

points, but also smaller common points, but also smaller common substructuressubstructures


The multiple subset case:The multiple subset case: find solutions where only a subset of the find solutions where only a subset of the

input proteins is well alignedinput proteins is well aligned this complicates the problem ! (why?)this complicates the problem ! (why?)

number of subsets is exponentialnumber of subsets is exponential trade-off between the size of the subset trade-off between the size of the subset

and the size of its core (match list)and the size of its core (match list) scoring function (core size – L, proteins scoring function (core size – L, proteins

# -k) f(l,k) = k# -k) f(l,k) = k2)(L.

The The algorithm :algorithm :

MethodMethod

Input :Input : a set of a set of m m proteins proteins PP11, P, P22, . . . , P, . . . , Pmm.. For each proteinFor each protein

the sequence of the 3D coordinates of atomsthe sequence of the 3D coordinates of atoms assignment of SSE types to each residueassignment of SSE types to each residue

Output :Output : The multiple alignments with the largest The multiple alignments with the largest

cores, according to the scoring function.cores, according to the scoring function.

General strategyGeneral strategy

We want multiple alignments with at We want multiple alignments with at least two SSEsleast two SSEs

Bases – ordered pairs of SSEs whose Bases – ordered pairs of SSEs whose εε--congruent copies appear in several congruent copies appear in several proteinsproteins

We look for a set of We look for a set of εε-congruent bases -congruent bases {{bb11, b, b22, . . . , b, . . . , bkk }, from proteins }, from proteins PPii11, , PPii22, . . . , P, . . . , Pikik respectively.respectively.

First base (First base (bb11) is our ) is our pivotpivot

General strategy – contGeneral strategy – cont..

Compute all the k − 1 rigid Compute all the k − 1 rigid transformations between this base and transformations between this base and the othersthe others

Result - (TResult - (T1212, T, T1313, . . . , T, . . . , T1k1k) defines ) defines multiple alignment between Pmultiple alignment between Pi1i1, P, Pi2i2, . , P, . , Pikik

The core may contain more then one baseThe core may contain more then one base we will get several alignments with almost we will get several alignments with almost

the same transformationsthe same transformations (one alignment per base in the core)(one alignment per base in the core)

General strategy – contGeneral strategy – cont..

Cluster the initial multiple base Cluster the initial multiple base alignmentsalignments

Merge the alignment. the core of the Merge the alignment. the core of the new alignment is the union of the cores new alignment is the union of the cores of the original alignments.of the original alignments. We get smaller set of multiple alignmentsWe get smaller set of multiple alignments

Extend the clustered alignmentsExtend the clustered alignments Find additional matching residuesFind additional matching residues

Give a score to each alignmentGive a score to each alignment Report the highest scoring alignmentsReport the highest scoring alignments

Algorithm outline

Algorithm outline - stage 1Algorithm outline - stage 1

Representation of Representation of secondary secondary structure elements:structure elements: Axis representation Axis representation

for SSEsfor SSEs The The least squares least squares

line line from all the from all the Cα Cα atomsatoms

Direction & length Direction & length determined by determined by protein structureprotein structure

Algorithm outline – stage 2Algorithm outline – stage 2

Detection of multiple base Detection of multiple base alignments:alignments: Use Use Geometric Hashing to detect bases Geometric Hashing to detect bases

whose whose εε-congruent copies appear in -congruent copies appear in several proteinsseveral proteins

Each base has Each base has fingerprintfingerprint invariant to a 3D rigid transformationinvariant to a 3D rigid transformation the types of the two SSEsthe types of the two SSEs the angle between their axial vectorsthe angle between their axial vectors the midpoint-to-midpoint distancethe midpoint-to-midpoint distance their line distancetheir line distance

Base fingerprintBase fingerprint


Almost-congruent bases have similar Almost-congruent bases have similar fingerprintsfingerprints the types of their SSEs are the samethe types of their SSEs are the same the difference between their midpoint-the difference between their midpoint-

to-midpoint and line distances is up to to-midpoint and line distances is up to 1.5 Å1.5 Å

difference between their angles is up to difference between their angles is up to 0.3 radians0.3 radians

reside close to each other in the gridreside close to each other in the grid


For each grid bin, extract all the For each grid bin, extract all the bases of the bin and of adjacent binsbases of the bin and of adjacent bins

Group them together in the same Group them together in the same base bucketbase bucket

Base bucket - stores bases in Base bucket - stores bases in columns according to the protein columns according to the protein they belong tothey belong to Bases derived from the same protein are Bases derived from the same protein are

stored in the same columnstored in the same column

Base bucketBase bucket

Almost-congruent bases are stored in the same base bucket

Stage 2 contStage 2 cont..

A collection of almost-congruent bases, A collection of almost-congruent bases, each belonging to a different column each belonging to a different column induces a local multiple alignment induces a local multiple alignment between the respective proteinsbetween the respective proteins core consists of at least two SSEscore consists of at least two SSEs

One basis is selected as a pivotOne basis is selected as a pivot rest of the bases are superimposed on itrest of the bases are superimposed on it Selection of the pivot may influence the Selection of the pivot may influence the

alignmentalignment Optional – try each base as pivotOptional – try each base as pivot

Stage 2 contStage 2 cont..

Multiple alignment is defined by an Multiple alignment is defined by an underlying set of pairwise alignmentsunderlying set of pairwise alignments

For each base bucket we compute all For each base bucket we compute all the alignments between two bases the alignments between two bases taken from two different columnstaken from two different columns find the transformation between two find the transformation between two

bases that aligns the maximal number bases that aligns the maximal number of atoms with minimal RMSDof atoms with minimal RMSD

Cα atomic level Cα atomic level

Stage 3 - ClusteringStage 3 - Clustering

For pair of proteins that share more then For pair of proteins that share more then one baseone base We get more alignments with almost the same We get more alignments with almost the same

transformation, but a different local SSE coretransformation, but a different local SSE core Cluster all the local base alignments to find the Cluster all the local base alignments to find the

ones with similar transformationsones with similar transformations merge them into a new global alignmentmerge them into a new global alignment The match list (core) of the new global alignment The match list (core) of the new global alignment

union of the original local match listsunion of the original local match lists its transformation is the one that aligns the SSEs with its transformation is the one that aligns the SSEs with

minimal RMSDminimal RMSD

Stage 4 - Global extensionStage 4 - Global extension

Now the core of each pairwise alignment is a Now the core of each pairwise alignment is a set of SSEsset of SSEs

Then we extend these alignments by finding Then we extend these alignments by finding additional matching residuesadditional matching residues

The The residues not necessarily belong to SSEsresidues not necessarily belong to SSEs We want to extend the cores of these We want to extend the cores of these

alignments by detecting corresponding alignments by detecting corresponding Cα Cα atomsatoms

We want to transform the second protein, so We want to transform the second protein, so that it is fully superimposed onto the pivot that it is fully superimposed onto the pivot proteinprotein

Stage 4 - Global extensionStage 4 - Global extension

Detect in linear time close pairs of C Detect in linear time close pairs of C atoms, one atom from each protein atoms, one atom from each protein

These atom pairs are added to the These atom pairs are added to the alignment’s match listalignment’s match list

transformation of the alignment is transformation of the alignment is refined by employing the Least-refined by employing the Least-Squares Fitting methodSquares Fitting method

Stage 5 – Filtering & ScoringStage 5 – Filtering & Scoring Computing the best global multiple Computing the best global multiple

alignmentsalignments What are the best global multiple What are the best global multiple

alignments? alignments? Number of aligned molecules Vs. core sizeNumber of aligned molecules Vs. core size core size Vs. size of the smallest moleculecore size Vs. size of the smallest molecule

number of possible multiple alignments number of possible multiple alignments defined by the base buckets is defined by the base buckets is exponentialexponential

We do not compute all of themWe do not compute all of them

Stage 5 – Filtering & ScoringStage 5 – Filtering & Scoring

Heuristic solution:Heuristic solution: For each BB compute the set of best multiple For each BB compute the set of best multiple

alignments recursively over the colomnsalignments recursively over the colomns For a set of multiple base alignments, For a set of multiple base alignments,

obtained by last stage obtained by last stage (b(b11, . . . , b, . . . , bkk)) Check if there is a base, Check if there is a base, bbkk+1+1, from the , from the

current column that improve the alignment’s current column that improve the alignment’s scorescore

Core(bCore(b11, . . . , b, . . . , bkk+1+1) ) = =

Core(bCore(b11, . . . , b, . . . , bkk))∩∩Core(bCore(b11, b, bkk+1+1))

Stage 5 – Filtering & ScoringStage 5 – Filtering & Scoring

Our scoring function Our scoring function Core size – LCore size – L Proteins number - kProteins number - k f(l,k) = k f(l,k) = k Report the highest scoring Report the highest scoring

alignmentsalignments Finish !Finish !

. ( )2L

ComplexityComplexity

Worst case complexity: (i) m is the number of proteins (ii) k is the number of residues in an SSE (iii) s and n are the number of SSEs and

the number of residues found in each protein respectively.

n ~ 300, k ~ 10, s ~ 15 The number of bases for each protein is

O(s 2)


For each pair of proteins we construct, cluster and extend O(s 4) pairwise alignments.

This results in O(m 2(s4k3 +s8 log s +s4n)) time where O(m2) is the number of ways of pairing two proteins

In practice, the complexity is much smaller we only construct the pairwise alignments

defined by the BBs and the clustering reduces their number even more


The number of evaluated multiple alignments is linear in the number of bases Each base can be a pivot for only one multiple

alignment We have O(ms2) bases It takes O(ms2n) time to construct a single

multiple alignment and O(m2s4n) time to construct all of them

Running time for intire algorithm is bounded by O(m2s4(k3 + s2 log s + n)), but experiments show that the actual running time is significantly lower

Algorithm outline (reminder)

Results and Results and DiscussionDiscussion

Experiment 1Experiment 1

Example 1 - Detection of subset Example 1 - Detection of subset alignments and their use for structural alignments and their use for structural classificationclassification

We have used MASS to align a set of We have used MASS to align a set of 12 structures from two families:12 structures from two families: Cofilin-like (CL)Cofilin-like (CL) Gelsolin-like (GL)Gelsolin-like (GL)

The two families are related The two families are related structurally but not sequentiallystructurally but not sequentially


The 12-molecule ensemble contains:The 12-molecule ensemble contains: four CL structuresfour CL structures eight GLeight GL

The running time of MASS on this The running time of MASS on this ensemble was 36 sec.ensemble was 36 sec. (Pentium 4 1800 MHz processor)(Pentium 4 1800 MHz processor)

Experiment 1: core Vs. # MoleculesExperiment 1: core Vs. # Molecules

Experiment 1: ResultsExperiment 1: Results

)A (The structural alignment of all 12 proteins of the ensemble. (B) A subset alignment between only the eight GL proteins.

Experiment 1: ResultsExperiment 1: Results

)C (A subset alignment between only the four CL structures. (D) A subset alignment between only three out of the four CL structures.

Results DiscussionResults Discussion

As expected, the maximal core size As expected, the maximal core size decreases as the number of aligned decreases as the number of aligned molecules increasesmolecules increases

The dependence is not linear:The dependence is not linear: Large decrease between three to four Large decrease between three to four

moleculesmolecules Between four to five moleculesBetween four to five molecules Between eight to nine moleculesBetween eight to nine molecules


Non-topological motif detectionNon-topological motif detection The ensembles share a common SSE The ensembles share a common SSE

motif, but different topology.motif, but different topology. In topological motifs, the order and In topological motifs, the order and

the direction of the corresponding the direction of the corresponding SSEs along the polypeptide chain are SSEs along the polypeptide chain are conserved while in non-topological conserved while in non-topological they are not.they are not.


Helix bundle ensembleHelix bundle ensemble: The ten proteins : The ten proteins in this ensemble belong to four different in this ensemble belong to four different folds and six different superfamiliesfolds and six different superfamilies

Running time: 48 seconds.Running time: 48 seconds. Also aligned by MUSTAAlso aligned by MUSTA

MASS detected two additional conserved MASS detected two additional conserved αα--helices.(why ?)helices.(why ?)

MASS is secondary structure orientedMASS is secondary structure oriented Directed to find solutions that contain more Directed to find solutions that contain more

SSEsSSEs

Common core is shown by assigning a different color to each conserved helix.

The schematic TOPS representationThe schematic TOPS representation

Triangles represent strands and circles helices. Corresponding secondary structure regions are drawn in the same color. As one can see the solution is non-topological.

Large-scale structural Large-scale structural alignmentsalignments

MASS can be applied on the order of tens MASS can be applied on the order of tens of proteins in practical running times on a of proteins in practical running times on a standard PCstandard PC

Three SCOP ensembles:Three SCOP ensembles: (i) Serine proteases: all structures from the (i) Serine proteases: all structures from the

‘Prokaryotic trypsin-like serine protease’ SCOP ‘Prokaryotic trypsin-like serine protease’ SCOP family (68 molecules)family (68 molecules)

(ii) PK beta barrel: all structures from the (ii) PK beta barrel: all structures from the ‘Pyruvate kinase beta-barrel domain’ SCOP ‘Pyruvate kinase beta-barrel domain’ SCOP family (66 molecules);family (66 molecules);

(iii) Unrelated proteins (80 proteins)(iii) Unrelated proteins (80 proteins)

Large-scale structural alignmentsLarge-scale structural alignments


The results show that the running The results show that the running time is influenced by:time is influenced by: (i) the number of molecules(i) the number of molecules (ii) the average molecular size (and the (ii) the average molecular size (and the

average number of SSEs in a molecule)average number of SSEs in a molecule) (iii) the structural variance among the (iii) the structural variance among the

moleculesmolecules How do you think they influence the How do you think they influence the

running time???running time???


The first two parameters are expected –The first two parameters are expected –they increase the running time as they they increase the running time as they growgrow

The more structurally variable is the The more structurally variable is the ensemble, the shorter the running time ensemble, the shorter the running time is ! (?) is ! (?)

Why?Why? the more structurally homogeneous is the the more structurally homogeneous is the

input, more SSE bases are stored in the input, more SSE bases are stored in the same grid bin.same grid bin.

Summary & Conclusions - MASSSummary & Conclusions - MASS

Pairwise comparisons are not enoughPairwise comparisons are not enough Novel method for aligning multiple protein Novel method for aligning multiple protein

structures and detecting their shared corestructures and detecting their shared core Capable of detecting also cores common Capable of detecting also cores common

only to subsets of the input (proteins)only to subsets of the input (proteins) exploits a secondary structure exploits a secondary structure

representation of proteins.representation of proteins. Many noisy solutions are filtered outMany noisy solutions are filtered out

Highly efficient capable of aligning tens of Highly efficient capable of aligning tens of protein moleculesprotein molecules

Summary & Conclusions - MASSSummary & Conclusions - MASS

Disregards the sequence order of SSEs Disregards the sequence order of SSEs along the polypeptide chainalong the polypeptide chain Can find non-sequential and non-topological Can find non-sequential and non-topological

structural motifs.structural motifs. Advantage over dynamic-programming based Advantage over dynamic-programming based

methodsmethods MASS program can be run in two modes:MASS program can be run in two modes:

(i) using SSE information only (reduced running (i) using SSE information only (reduced running time) (when will we use this option?)time) (when will we use this option?)

(ii) using both SSE and atomic information(ii) using both SSE and atomic information

Questions?Questions?

Lecture was based on article:Lecture was based on article:MASS: multiple structural alignment

bysecondary structuresBy O. Dror , H. Benyamini, R. Nussinov,

and H. Wolfson(School of Computer Science, Tel Aviv

University, 2003 )

Documents

Seminar in structural bioinformatics Multiple structural alignment of proteins By Elad Kaspani