View
220
Download
1
Tags:
Embed Size (px)
Citation preview
Seminar in Seminar in structural structural
bioinformaticsbioinformatics
Multiple structural alignment Multiple structural alignment of proteins of proteins
By Elad Kaspani
Multiple structural alignmentMultiple structural alignment
OutlineOutline
IntroductionIntroduction What is Multiple structural alignment?What is Multiple structural alignment? Why do we need Multiple structural Why do we need Multiple structural
alignment?alignment? Pairwise Vs. Multiple structural alignmentPairwise Vs. Multiple structural alignment
MASS - Multiple structural alignment MASS - Multiple structural alignment by secondary structuresby secondary structures Problem definitionProblem definition General strategyGeneral strategy Algorithm descriptionAlgorithm description
Outline ContOutline Cont..
MASS - Multiple structural alignment MASS - Multiple structural alignment by secondary structuresby secondary structures Algorithm outlineAlgorithm outline ComplexityComplexity Results DiscussionResults Discussion
Summary & ConclusionsSummary & Conclusions
IntroductionIntroduction
Proteins sharing a common Proteins sharing a common substructure may have a similar substructure may have a similar function.function.
What is Multiple structural What is Multiple structural alignment ?alignment ?
Discussion – we already have Discussion – we already have pairwise alignment, isn’t that pairwise alignment, isn’t that enough?enough?
Pairwise Vs. Multiple structural Pairwise Vs. Multiple structural alignmentalignment
We have many algorithms We have many algorithms pairwise pairwise structural alignment structural alignment tasktask
Only a few methods are available for Only a few methods are available for aligning multiple structuresaligning multiple structures
Most of them are based on series of Most of them are based on series of pairwise comparisons pairwise comparisons SSAPm (SSAPm (Taylor et al., 1994) Prism Prism (Yang and Honig, 2000b) STAMP (Russell and Barton, 1992)
What do we wantWhat do we want??
Classification of existing and newly Classification of existing and newly discovered proteinsdiscovered proteins
Gaining insights into evolutionary Gaining insights into evolutionary relations between proteinsrelations between proteins
Detecting motifs common to a group Detecting motifs common to a group of proteins that share a certain of proteins that share a certain functionfunction
Structure prediction algorithmsStructure prediction algorithms
What’s wrong with What’s wrong with methods based on series methods based on series of pairwise comparisons of pairwise comparisons ??????
Multiple structural alignmentMultiple structural alignment
These methods are limited!!!These methods are limited!!! In each pairwise comp. , the only In each pairwise comp. , the only
information is about the two moleculesinformation is about the two molecules alignments optimal for the whole set can alignments optimal for the whole set can
be disregardedbe disregarded dynamic programming dynamic programming disadvantage - disadvantage -
dependent on the sequence order of the dependent on the sequence order of the polypeptide chainpolypeptide chain
We can’t see the woods We can’t see the woods
WHAT DO WE DO THEN?????????????WHAT DO WE DO THEN?????????????
multiple structural alignment by multiple structural alignment by secondary structures secondary structures
MASSMASS
MASSMASS
Considers all the given structures at the Considers all the given structures at the same timesame time
Exploiting the secondary structure Exploiting the secondary structure representation - reduced time representation - reduced time complexitycomplexity
Does not require that all the input Does not require that all the input molecules be alignedmolecules be aligned
Capable of detecting structural motifs Capable of detecting structural motifs shared only by a subset of the moleculesshared only by a subset of the molecules
MASSMASS
Can find non-sequential and even non-Can find non-sequential and even non-topological structural motifstopological structural motifs
Suitable for a broad range of Suitable for a broad range of applicationsapplications
filter noisy resultsfilter noisy results highly efficient and robusthighly efficient and robust Other multiple-based methodsOther multiple-based methods
(Escalier (Escalier et al.et al., 1988), 1988) MUSTA (Leibowitz MUSTA (Leibowitz et al.et al., 2001), 2001) MultiProt (Shatsky MultiProt (Shatsky et al.et al., 2002), 2002)
Secondary structure elements (SSE)Secondary structure elements (SSE)
Basic termsBasic terms
rigid transformationrigid transformation Q - a subsetQ - a subset T (Q) T (Q) ==R(Q) R(Q) + + t t where where R R is a 3x3 rotation is a 3x3 rotation
matrix and matrix and t t is a translation vectoris a translation vector εε-congruent-congruent
For For εε>0, >0, find two largest subsets of the input find two largest subsets of the input sets, sets, P P and and QQ, and a rigid transformation, , and a rigid transformation, TT, so , so that that distance(P, T (Q)) < distance(P, T (Q)) < εε
How do we measure How do we measure distance?distance? RMSDRMSD
Problem DefinitionProblem Definition
The pairwise case:The pairwise case: given two proteins, represented by a set given two proteins, represented by a set
of points in 3D spaceof points in 3D space each point is associated with an atom’s each point is associated with an atom’s
positionposition find the largest set that is congruent to find the largest set that is congruent to
two subsets of points from each proteintwo subsets of points from each protein In computational geometry - In computational geometry - largest largest
common point set common point set (LCP) problem(LCP) problem
Problem DefinitionProblem Definition
The multiple case:The multiple case: given a collection of given a collection of m m point sets, point sets, find the largest set of points, of which find the largest set of points, of which
an an εε-congruent copy appears in each -congruent copy appears in each of the input setsof the input sets
Unfortunately, it’s NP-hard.....Unfortunately, it’s NP-hard..... We want not only the largest set of We want not only the largest set of
points, but also smaller common points, but also smaller common substructuressubstructures
Problem DefinitionProblem Definition
The multiple subset case:The multiple subset case: find solutions where only a subset of the find solutions where only a subset of the
input proteins is well alignedinput proteins is well aligned this complicates the problem ! (why?)this complicates the problem ! (why?)
number of subsets is exponentialnumber of subsets is exponential trade-off between the size of the subset trade-off between the size of the subset
and the size of its core (match list)and the size of its core (match list) scoring function (core size – L, proteins scoring function (core size – L, proteins
# -k) f(l,k) = k# -k) f(l,k) = k2)(L.
The The algorithm :algorithm :
MethodMethod
Input :Input : a set of a set of m m proteins proteins PP11, P, P22, . . . , P, . . . , Pmm.. For each proteinFor each protein
the sequence of the 3D coordinates of atomsthe sequence of the 3D coordinates of atoms assignment of SSE types to each residueassignment of SSE types to each residue
Output :Output : The multiple alignments with the largest The multiple alignments with the largest
cores, according to the scoring function.cores, according to the scoring function.
General strategyGeneral strategy
We want multiple alignments with at We want multiple alignments with at least two SSEsleast two SSEs
Bases – ordered pairs of SSEs whose Bases – ordered pairs of SSEs whose εε--congruent copies appear in several congruent copies appear in several proteinsproteins
We look for a set of We look for a set of εε-congruent bases -congruent bases {{bb11, b, b22, . . . , b, . . . , bkk }, from proteins }, from proteins PPii11, , PPii22, . . . , P, . . . , Pikik respectively.respectively.
First base (First base (bb11) is our ) is our pivotpivot
General strategy – contGeneral strategy – cont..
Compute all the k − 1 rigid Compute all the k − 1 rigid transformations between this base and transformations between this base and the othersthe others
Result - (TResult - (T1212, T, T1313, . . . , T, . . . , T1k1k) defines ) defines multiple alignment between Pmultiple alignment between Pi1i1, P, Pi2i2, . , P, . , Pikik
The core may contain more then one baseThe core may contain more then one base we will get several alignments with almost we will get several alignments with almost
the same transformationsthe same transformations (one alignment per base in the core)(one alignment per base in the core)
General strategy – contGeneral strategy – cont..
Cluster the initial multiple base Cluster the initial multiple base alignmentsalignments
Merge the alignment. the core of the Merge the alignment. the core of the new alignment is the union of the cores new alignment is the union of the cores of the original alignments.of the original alignments. We get smaller set of multiple alignmentsWe get smaller set of multiple alignments
Extend the clustered alignmentsExtend the clustered alignments Find additional matching residuesFind additional matching residues
Give a score to each alignmentGive a score to each alignment Report the highest scoring alignmentsReport the highest scoring alignments
Algorithm outline
Algorithm outline - stage 1Algorithm outline - stage 1
Representation of Representation of secondary secondary structure elements:structure elements: Axis representation Axis representation
for SSEsfor SSEs The The least squares least squares
line line from all the from all the Cα Cα atomsatoms
Direction & length Direction & length determined by determined by protein structureprotein structure
Algorithm outline – stage 2Algorithm outline – stage 2
Detection of multiple base Detection of multiple base alignments:alignments: Use Use Geometric Hashing to detect bases Geometric Hashing to detect bases
whose whose εε-congruent copies appear in -congruent copies appear in several proteinsseveral proteins
Each base has Each base has fingerprintfingerprint invariant to a 3D rigid transformationinvariant to a 3D rigid transformation the types of the two SSEsthe types of the two SSEs the angle between their axial vectorsthe angle between their axial vectors the midpoint-to-midpoint distancethe midpoint-to-midpoint distance their line distancetheir line distance
Base fingerprintBase fingerprint
Algorithm outline – stage 2Algorithm outline – stage 2
Almost-congruent bases have similar Almost-congruent bases have similar fingerprintsfingerprints the types of their SSEs are the samethe types of their SSEs are the same the difference between their midpoint-the difference between their midpoint-
to-midpoint and line distances is up to to-midpoint and line distances is up to 1.5 Å1.5 Å
difference between their angles is up to difference between their angles is up to 0.3 radians0.3 radians
reside close to each other in the gridreside close to each other in the grid
Algorithm outline – stage 2Algorithm outline – stage 2
For each grid bin, extract all the For each grid bin, extract all the bases of the bin and of adjacent binsbases of the bin and of adjacent bins
Group them together in the same Group them together in the same base bucketbase bucket
Base bucket - stores bases in Base bucket - stores bases in columns according to the protein columns according to the protein they belong tothey belong to Bases derived from the same protein are Bases derived from the same protein are
stored in the same columnstored in the same column
Base bucketBase bucket
Almost-congruent bases are stored in the same base bucket
Stage 2 contStage 2 cont..
A collection of almost-congruent bases, A collection of almost-congruent bases, each belonging to a different column each belonging to a different column induces a local multiple alignment induces a local multiple alignment between the respective proteinsbetween the respective proteins core consists of at least two SSEscore consists of at least two SSEs
One basis is selected as a pivotOne basis is selected as a pivot rest of the bases are superimposed on itrest of the bases are superimposed on it Selection of the pivot may influence the Selection of the pivot may influence the
alignmentalignment Optional – try each base as pivotOptional – try each base as pivot
Stage 2 contStage 2 cont..
Multiple alignment is defined by an Multiple alignment is defined by an underlying set of pairwise alignmentsunderlying set of pairwise alignments
For each base bucket we compute all For each base bucket we compute all the alignments between two bases the alignments between two bases taken from two different columnstaken from two different columns find the transformation between two find the transformation between two
bases that aligns the maximal number bases that aligns the maximal number of atoms with minimal RMSDof atoms with minimal RMSD
Cα atomic level Cα atomic level
Stage 3 - ClusteringStage 3 - Clustering
For pair of proteins that share more then For pair of proteins that share more then one baseone base We get more alignments with almost the same We get more alignments with almost the same
transformation, but a different local SSE coretransformation, but a different local SSE core Cluster all the local base alignments to find the Cluster all the local base alignments to find the
ones with similar transformationsones with similar transformations merge them into a new global alignmentmerge them into a new global alignment The match list (core) of the new global alignment The match list (core) of the new global alignment
union of the original local match listsunion of the original local match lists its transformation is the one that aligns the SSEs with its transformation is the one that aligns the SSEs with
minimal RMSDminimal RMSD
Stage 4 - Global extensionStage 4 - Global extension
Now the core of each pairwise alignment is a Now the core of each pairwise alignment is a set of SSEsset of SSEs
Then we extend these alignments by finding Then we extend these alignments by finding additional matching residuesadditional matching residues
The The residues not necessarily belong to SSEsresidues not necessarily belong to SSEs We want to extend the cores of these We want to extend the cores of these
alignments by detecting corresponding alignments by detecting corresponding Cα Cα atomsatoms
We want to transform the second protein, so We want to transform the second protein, so that it is fully superimposed onto the pivot that it is fully superimposed onto the pivot proteinprotein
Stage 4 - Global extensionStage 4 - Global extension
Detect in linear time close pairs of C Detect in linear time close pairs of C atoms, one atom from each protein atoms, one atom from each protein
These atom pairs are added to the These atom pairs are added to the alignment’s match listalignment’s match list
transformation of the alignment is transformation of the alignment is refined by employing the Least-refined by employing the Least-Squares Fitting methodSquares Fitting method
Stage 5 – Filtering & ScoringStage 5 – Filtering & Scoring Computing the best global multiple Computing the best global multiple
alignmentsalignments What are the best global multiple What are the best global multiple
alignments? alignments? Number of aligned molecules Vs. core sizeNumber of aligned molecules Vs. core size core size Vs. size of the smallest moleculecore size Vs. size of the smallest molecule
number of possible multiple alignments number of possible multiple alignments defined by the base buckets is defined by the base buckets is exponentialexponential
We do not compute all of themWe do not compute all of them
Stage 5 – Filtering & ScoringStage 5 – Filtering & Scoring
Heuristic solution:Heuristic solution: For each BB compute the set of best multiple For each BB compute the set of best multiple
alignments recursively over the colomnsalignments recursively over the colomns For a set of multiple base alignments, For a set of multiple base alignments,
obtained by last stage obtained by last stage (b(b11, . . . , b, . . . , bkk)) Check if there is a base, Check if there is a base, bbkk+1+1, from the , from the
current column that improve the alignment’s current column that improve the alignment’s scorescore
Core(bCore(b11, . . . , b, . . . , bkk+1+1) ) = =
Core(bCore(b11, . . . , b, . . . , bkk))∩∩Core(bCore(b11, b, bkk+1+1))
Stage 5 – Filtering & ScoringStage 5 – Filtering & Scoring
Our scoring function Our scoring function Core size – LCore size – L Proteins number - kProteins number - k f(l,k) = k f(l,k) = k Report the highest scoring Report the highest scoring
alignmentsalignments Finish !Finish !
. ( )2L
ComplexityComplexity
Worst case complexity: (i) m is the number of proteins (ii) k is the number of residues in an SSE (iii) s and n are the number of SSEs and
the number of residues found in each protein respectively.
n ~ 300, k ~ 10, s ~ 15 The number of bases for each protein is
O(s 2)
ComplexityComplexity
For each pair of proteins we construct, cluster and extend O(s 4) pairwise alignments.
This results in O(m 2(s4k3 +s8 log s +s4n)) time where O(m2) is the number of ways of pairing two proteins
In practice, the complexity is much smaller we only construct the pairwise alignments
defined by the BBs and the clustering reduces their number even more
ComplexityComplexity
The number of evaluated multiple alignments is linear in the number of bases Each base can be a pivot for only one multiple
alignment We have O(ms2) bases It takes O(ms2n) time to construct a single
multiple alignment and O(m2s4n) time to construct all of them
Running time for intire algorithm is bounded by O(m2s4(k3 + s2 log s + n)), but experiments show that the actual running time is significantly lower
Algorithm outline (reminder)
Results and Results and DiscussionDiscussion
Experiment 1Experiment 1
Example 1 - Detection of subset Example 1 - Detection of subset alignments and their use for structural alignments and their use for structural classificationclassification
We have used MASS to align a set of We have used MASS to align a set of 12 structures from two families:12 structures from two families: Cofilin-like (CL)Cofilin-like (CL) Gelsolin-like (GL)Gelsolin-like (GL)
The two families are related The two families are related structurally but not sequentiallystructurally but not sequentially
Experiment 1Experiment 1
The 12-molecule ensemble contains:The 12-molecule ensemble contains: four CL structuresfour CL structures eight GLeight GL
The running time of MASS on this The running time of MASS on this ensemble was 36 sec.ensemble was 36 sec. (Pentium 4 1800 MHz processor)(Pentium 4 1800 MHz processor)
Experiment 1: core Vs. # MoleculesExperiment 1: core Vs. # Molecules
Experiment 1: ResultsExperiment 1: Results
)A (The structural alignment of all 12 proteins of the ensemble. (B) A subset alignment between only the eight GL proteins.
Experiment 1: ResultsExperiment 1: Results
)C (A subset alignment between only the four CL structures. (D) A subset alignment between only three out of the four CL structures.
Results DiscussionResults Discussion
As expected, the maximal core size As expected, the maximal core size decreases as the number of aligned decreases as the number of aligned molecules increasesmolecules increases
The dependence is not linear:The dependence is not linear: Large decrease between three to four Large decrease between three to four
moleculesmolecules Between four to five moleculesBetween four to five molecules Between eight to nine moleculesBetween eight to nine molecules
Experiment 2Experiment 2
Non-topological motif detectionNon-topological motif detection The ensembles share a common SSE The ensembles share a common SSE
motif, but different topology.motif, but different topology. In topological motifs, the order and In topological motifs, the order and
the direction of the corresponding the direction of the corresponding SSEs along the polypeptide chain are SSEs along the polypeptide chain are conserved while in non-topological conserved while in non-topological they are not.they are not.
Experiment 2Experiment 2
Helix bundle ensembleHelix bundle ensemble: The ten proteins : The ten proteins in this ensemble belong to four different in this ensemble belong to four different folds and six different superfamiliesfolds and six different superfamilies
Running time: 48 seconds.Running time: 48 seconds. Also aligned by MUSTAAlso aligned by MUSTA
MASS detected two additional conserved MASS detected two additional conserved αα--helices.(why ?)helices.(why ?)
MASS is secondary structure orientedMASS is secondary structure oriented Directed to find solutions that contain more Directed to find solutions that contain more
SSEsSSEs
Common core is shown by assigning a different color to each conserved helix.
The schematic TOPS representationThe schematic TOPS representation
Triangles represent strands and circles helices. Corresponding secondary structure regions are drawn in the same color. As one can see the solution is non-topological.
Large-scale structural Large-scale structural alignmentsalignments
MASS can be applied on the order of tens MASS can be applied on the order of tens of proteins in practical running times on a of proteins in practical running times on a standard PCstandard PC
Three SCOP ensembles:Three SCOP ensembles: (i) Serine proteases: all structures from the (i) Serine proteases: all structures from the
‘Prokaryotic trypsin-like serine protease’ SCOP ‘Prokaryotic trypsin-like serine protease’ SCOP family (68 molecules)family (68 molecules)
(ii) PK beta barrel: all structures from the (ii) PK beta barrel: all structures from the ‘Pyruvate kinase beta-barrel domain’ SCOP ‘Pyruvate kinase beta-barrel domain’ SCOP family (66 molecules);family (66 molecules);
(iii) Unrelated proteins (80 proteins)(iii) Unrelated proteins (80 proteins)
Large-scale structural alignmentsLarge-scale structural alignments
Results DiscussionResults Discussion
The results show that the running The results show that the running time is influenced by:time is influenced by: (i) the number of molecules(i) the number of molecules (ii) the average molecular size (and the (ii) the average molecular size (and the
average number of SSEs in a molecule)average number of SSEs in a molecule) (iii) the structural variance among the (iii) the structural variance among the
moleculesmolecules How do you think they influence the How do you think they influence the
running time???running time???
Results DiscussionResults Discussion
The first two parameters are expected –The first two parameters are expected –they increase the running time as they they increase the running time as they growgrow
The more structurally variable is the The more structurally variable is the ensemble, the shorter the running time ensemble, the shorter the running time is ! (?) is ! (?)
Why?Why? the more structurally homogeneous is the the more structurally homogeneous is the
input, more SSE bases are stored in the input, more SSE bases are stored in the same grid bin.same grid bin.
Summary & Conclusions - MASSSummary & Conclusions - MASS
Pairwise comparisons are not enoughPairwise comparisons are not enough Novel method for aligning multiple protein Novel method for aligning multiple protein
structures and detecting their shared corestructures and detecting their shared core Capable of detecting also cores common Capable of detecting also cores common
only to subsets of the input (proteins)only to subsets of the input (proteins) exploits a secondary structure exploits a secondary structure
representation of proteins.representation of proteins. Many noisy solutions are filtered outMany noisy solutions are filtered out
Highly efficient capable of aligning tens of Highly efficient capable of aligning tens of protein moleculesprotein molecules
Summary & Conclusions - MASSSummary & Conclusions - MASS
Disregards the sequence order of SSEs Disregards the sequence order of SSEs along the polypeptide chainalong the polypeptide chain Can find non-sequential and non-topological Can find non-sequential and non-topological
structural motifs.structural motifs. Advantage over dynamic-programming based Advantage over dynamic-programming based
methodsmethods MASS program can be run in two modes:MASS program can be run in two modes:
(i) using SSE information only (reduced running (i) using SSE information only (reduced running time) (when will we use this option?)time) (when will we use this option?)
(ii) using both SSE and atomic information(ii) using both SSE and atomic information
Questions?Questions?
Lecture was based on article:Lecture was based on article:MASS: multiple structural alignment
bysecondary structuresBy O. Dror , H. Benyamini, R. Nussinov,
and H. Wolfson(School of Computer Science, Tel Aviv
University, 2003 )