Pair-wise Structural Comparison using DALILite Software of DALI Rajalekshmy Usha

Pair-wise Structural Comparison using DALILite Software of DALI

Rajalekshmy Usha

Overview

History Protein Structure Comparison Comparison Algorithm Input and Output Interface Demo on the software Analysis of the Result References

History

Earliest resources(1970s) were sequence data Pioneered by Dayhoff

Structural database appeared in mid-1990s Structural data is sparse PDB (protein Data Bank) has 39,464 structural entries to date NCBI (National Center for Biotechnology Information)has over 12 million entries on sequence data

Popular Structural classifications of proteins in: Structural Classification of Proteins (SCOP) Distance Matrix Alignment (DALI) CATH Others are DDBase, 3Dee and DaliDD (Dali Domain Database)

Protein Structure Comparison

Popularized by Liisa Holm and Chris Sander (1993) DALI

Created by Liisa Holm Completely automated Too large and complex to be installed in external sites Use distance matrices Standalone version of search engine of Dali server

Why use structural data? 3D structure of the proteins have been conserved over time Leads to interesting evolutionary observations, prediction of

structure and functions

Comparison Algorithm

Exhaustive, all-against-all 3D structure comparison Helps to understand the distribution of known structure in shape

space Use protein structures from PDB

Use distance matrix three dimensional coordinates of each protein residues (i.e., C-α

atoms) pair-wise distance between the residue centers (a 2D

representation of 3D structure) each structure’s contact map are overlaid move them horizontally and vertically overlap along the diagonal represent similar backbone

confirmations (secondary structure) off-diagonal similarity tertiary structure similarity

Underlying Algorithms

Branch and Bound Search to find the optimal alignment Uses distance matrices

Collapsed into regions of overlap (sub-matrices) of fixed size

The sub-matrices are stitched together if there is an overlap with the neighboring fragments

Uses similarity score Monte Carlo Optimization Algorithm

To optimize the alignment

Understanding the Formula Used Similarity Score

core is the set of structurally equivalent residue pairs between proteins A and B

Δ is the deviation of the intermolecular Cα-Cα intermolecular distance between (iA,jA) and (iB,jB), relative to their arithmetic mean d.

θ is the similarity threshold, set empirically to 0.2 ω is the envelope function and ω = exp(-d2/r2), where r =

20ºA High score means good fit

Branch and Bound Search Consider only nongapped segment pairs

This reduces the complexity of structure alignment Natural segmentation uses the secondary structures of the

query structure E.g. α helices and β strands

Diagonal lines represent the nongapped segment pairings Pairing between segments of query structure (horizontal)

and the proteins being aligned to it (vertical). Do an alignment score (similarity score) within the segments and

between the segments Split the search space into smaller subset of candidate pairings

(matrices) Chose the upper bound on the sum-of-pairs score Subset with the highest bound contains the optimal alignment

Branch and Bound Search

Image source: Holm L., Park J (2000)DaliLite workbench for protein structure comparison. Bioinformatics 16, 567

Monte Carlo Optimization Algorithm A basic move is made

The move is random Probability of accepting a move is p = e beta*(s’-s), where S’ = new score,

S= old score and beta is a parameter Involves addition or deletion of residue equivalence assignment

Two basic modes of operation Expansion mode

Alignment is incremented by using overlapping contact patterns Extend the alignment by including all pairs of matching contact

patterns with the same residue pairs (iA ,iB) Adding new fragment requires tentative removal of inconsistent

previous equivalent assignment The removal is permanent

Trimming mode Removal of fragment that give a net negative contribution to the

similarity score Done after the 1st and every 5 subsequent expansion cycles

The Monte Carlo Optimization

Image source: Holm L., Park J (2000)DaliLite workbench for protein structure comparison. Bioinformatics 16, 567

Thick black line indicates the optimum found after branch and bound algorithm

Red dashed line indicates final alignment after Monte Carlo Optimization

DaliLite Database Search Input Interface

DaliLite Database Server Output

DaliLite Database Server Output : 2

DaliLite Pair wise Comparison Input Interface

Statistical Analysis of the Result

Z- score: X is the raw score to be standardized σ is the standard deviation μ is the mean Score < 2.0 are structurally dissimilar

RMSD (Root Mean Square Deviation) Average distance between the backbones of the

superimposed proteins δ = distance between N pairs of equivalent Cα atoms

Sequence Identity percentage of identical amino acids over all structurally

equivalent residues

DaliLite Output

DaliLite Output : 2 – cont’d

DaliLite Output : 3 – cont’d

Demo on Using DaliLite

http://www.ebi.ac.uk/dali/index.html

1CDK and 1CJA 1CDK:A 1CJA:A

1CDK is a cAMP-dependent protein kinase and 1CJA is an actin-fragmin kinase

Image source : PDB.org

1CPC and 1KTP1CPC:A 1KTP:A

1CPC and 1KTP belong to the same phycocyanin family (light harvesting protein complex); both have six helices sequentially aligned.

Image source : PDB.org

References

Holm L., Sander C(1993 a) Protein Structure Comparison by Alignment of Distance Matrices. Journal of Molecular Biol. 233(1): 123-138

Holm L., Park J(2000) DaliLite workbench for protein structure comparison. Bioinformatics 16, 566-567

Holm L., Sander C(1996) Mapping the protein universe. Science 273: 595-602

Bourne P.E., Weissig H. Structural Bioinformatics. Wiley-Liss, Hoboken, New Jersey

http://wikipedia.org

Documents

Pair-wise Structural Comparison using DALILite Software of DALI Rajalekshmy Usha