1 Randomized Algorithms for Three Dimensional Protein Structures Comparison Yaw-Ling Lin Dept...

Randomized Algorithms for Three Randomized Algorithms for Three Dimensional Protein Structures Dimensional Protein Structures

ComparisonComparison

Yaw-Ling Lin

Dept Computer Sci and Info Engineering,

Providence University, Taiwan

E-mail: yllin@pu.edu.twWWW: http://www.cs.pu.edu.tw/~yawlin

Outline

• Introduction

• Protein Structures

• 3D structure comparisons

• Algorithms

• Benchmarking

• Comparing with other systems

• Future Works

Introduction

What are proteins ?• Structural framework (keratin, collagen)• Transport and storage of small molecules (hemoglobin)• Transmit information (hormones, receptors)• Antibodies• Blood clotting factors• Enzymes

The protein is created in the cell as a unique sequenceof amino acids

A C LE

ACMVLLCEVEKYP…Sequence

Structure

folding

Function ?????

The function of 40-50% of the new proteins is unknown.

About protein sequences are knowntoday (non-redundant database).

This number keeps rapidly growing (large scale sequencing projects).

Background and Problem definition

Understanding biological function is important for:• Study of fundamental biological processes• Drug design• Genetic engineering

What bioinformatics can do for us?

Drug Discovery

• Target Identification– Which protein to inhibit?

• Lead discovery & optimization– What sort of molecule will bind to this protein?

• Toxicology– Side effects, target specificity

• Pharmacokinetics– Metabolization and transport

Drug Development Life Cycle

0 2 4 6 8 10 12 14 16

Discovery (2 to 10 Years)

Preclinical Testing(Lab and Animal Testing)

Phase I(20-30 Healthy Volunteers used to check for safety and dosage)

Phase II(100-300 Patient Volunteers used to check for efficacy and side effects)

Phase III(1000-5000 Patient Volunteers used to monitor reactions to long-term drug use)

FDA Review & Approval

Post-Marketing Testing

$600-700 Million!$600-700 Million!$600-700 Million!$600-700 Million!

7 – 15 Years!7 – 15 Years!7 – 15 Years!7 – 15 Years!

With the aid of bioinformatics

Drug lead screening5,000 to 10,000

compounds screened

250 Lead Candidates in Preclinical Testing5 Drug Candidates

enter Clinical Testing; 80% Pass Phase I

30%Pass Phase II

80% Pass Phase III

One drug approved by the FDA

Drug Lead Screening & Docking

ComplementarityShape

ChemicalElectrostatic

Protein Structures

Levels of structure in proteins

Myoglobin structure

Myoglobin structure contd.

Myoglobin in solution

Three dimensional structures of cytochrome c, lysozyme and ribonuclease

PDB file format

Protein Structures

Rasmol-StructurePDB: 101M

PDB: 2DHB

Rasmol-GroupPDB: 101M

PDB: 2DHB

Structural classifications• SCOP http://scop.mrc-lmb.cam.ac.uk/scop/

• CATH http://www.biochem.ucl.ac.uk/bsm/cath_new/index.html

• FSSP http://www.ebi.ac.uk/dali/fssp/fssp.html

Structure comparison algorithms•Dali•CE•Structal•VAST

Contact matrix and the Dali method

residues# rematrix whe matrix Contact nnn

)#,#(distance),( jijid cc

Idea: Similar structures have similar contact matrices

From distance map to structuralsimilarities

• Imagine transparent distance map of one protein put on to of a map of other protein (Liisa Holm Chris Sander J. Mol. Biol. 23 3.):– Matching patches centered on diagonal correspond to matching

secondary structures.

– Matches of short distances off diagonal correspond to tertiary conformations.

– Similarity score

Unmatched residues do not contribute to score.

Contact matrix and the Dali method

residues# rematrix whe matrix Contact nnn

)#,#(distance),( jijid cc

Idea: Similar structures have similar contact matrices

DALI algorithm outline• Step1: Consider all possible pairs of 6x6 submatrices

of the contact matrices. Such matrices are small enough that the problem can be solved optimally.

• Step2: Assembly the alignments from step 1. Method – Monte Carlo algorithm.

CE (Shindyalov & Bourne, Protein Eng. 1998) Protein Structure Alignment by Incremental Combinatorial Extension (CE) of the Optimal Path

Define alignment fragment pair (AFP) as a continuous segment of protein A aligned against a continuous segment of protein B (without gaps).•An alignment is a path of AFPs s.t. for every two consecutive AFPs there may be gaps inserted into either A or B, but not into both. That is, for every two consecutive AFPs i and i+1

or and

where piA is the starting position of AFP i in protein A

mpp Ai

Ai 1 mpp B

mpp Ai

mpp Bi

Bi 1mpp A

mpp Bi

CEWhat is a “good”AFP?

Define the distance between two different AFPs i and j as:

dA(p,q) represents the distance between the alpha carbon atoms at positions p and q in protein A.

If you already have n-1 AFPs and consider adding the n-th AFN, do so only if

),1(),1(1

kmpkpdkmpkpdm

0 (1) DDnn

iin DD

jij DD

n 0 012

Protein A

Protein B

i ji j

CE (cont.)1. Select an initial AFP. 2. Build an alignment path by incrementally adding “good” AFPs

that satisfy the conditions of paths 3. Repeat step (2) until the proteins are completely matched, or

until no good AFPs remain.

4. To assess the significance of the alignment, compare it to the alignment of a random pairs of structures, and compute the Z-score based on the RMSD and number of gaps in the final alignment.

Protein A

Protein B

Structal (Levitt & Gerstein, PNAS 1998)

An initial equivalence is chosen, based on matching the ends of the two structures.

Repeat until convergence:

• Superimpose the two structures so as to minimize the RMS, given the equivalence

• Given the superposition, calculate the distances dij between any atom i in the first protein and any atom j in the second protein

• Transform distances into similarities sij = M/[1+ (dij/d0)2] where M=20 and d0 = 2.24A

• Apply dynamic programming to define a new set of equivalences

Structal (cont)

1) Alignment fixed2) Superimpose to minimize RMS

3) Calculate distances between all atoms

4) Use dynamic prog. to find the best set of equivalences

5) Superimpose given the new alignment

6) Recalculate distances between all atoms

Approach based on comparingsecondary structure arrangement

Motivation:

• Folds are often defined as arrangement of secondary structure elements (sse).

• Why not to compare arrangement of sse rather than going down to atomic

level?

1EJ9: Human topoisomerase

VAST- graph theoretical approach• http://www2.ncbi.nlm.nih.gov/Structure/VAST/vast.shtml

• Perform the comparison on the level of secondary structures and not residues.

• Treat each secondary structure as a vector of direction and length corresponding to the direction and length of the secondary structure. Attributes of such vector include the type of secondary structure, number of residues, etc.

• For two secondary structure provide a way of describing the relative spatial position of secondary structures – distance, angle, etc.

• VAST finds maximal subset of secondary structures that are in the same relative positions in compared protein structures and in the same order within the structure.

Structural classification of proteins with 5 level hierarchy:Domains: the individual entriesFamily: homologous proteins with significant sequence

similaritySuperfamily: protein families that share weak sequence

similarity but with conserved functional residues (e.g. in active sites) – believed to be evolutionary related

Fold: protein superfamilies that share he same fold (not necessarily due to common evolutionary ancestry)

Class: all-alpha, all-beta, alpha/beta, alpha+beta, membrane proteins, small proteins

The classification is based on manual analysis by experts (Dr. Alexy Murzin)

As of May 2002, 7 main classes, 686 folds, 1073 superfamilies, 1827 families

CATHStructural classification of proteins with 5 level hierarchy:Protein chains: the individual entriesHomologous superfamily: proteins with highly similar

structures and functions. Topology: clusters according to the topological

connections and numbers of secondary structures. Architecture: describes the gross orientation of secondary

structures, independent of connectivities (assigned manually).

Class: derived from secondary structure content, is assigned for more than 90% of protein structures automatically.

The assignments of structures to topology families and homologous superfamilies are made by sequence and structure comparisons.

As of Jan 2002, 8 main classes, 46 architectures, 1453 topologies, more than 2000 superfamilies.

Structural classification of proteins into a tree hierarchy:

Protein domains: the individual entries (defined using the algorithm of Holm and Sander 1994)

Start with all-vs-all structure comparison of protein domains

Domains are clustered automatically into clusters using the single linkage algorithm based on the z-scores of the structure similarity scores

3242 families of more than 30,000 structures as of June 2002

Algorithms

• Measurement: rmsd.

• Pair atoms of two structures by minimum bipartite matching.

• Fix one structure, and keep several 3-D orientations of the other.

• Randomly perturb these orientations, and shift to better positions until converging.

• Report the best rmsd score and orientation.

INIT-S(N)

N=4 N=8N=6

N=20N=12

INIT-S(N)

MB-Align Algorithm

MB-Align Descriptions

3D Transformation

• 3D rotation is done around a rotation axis • Fundamental rotations

About x, y, or z axes

• Positive RotationCounter-clockwise rotation (when you look down the

negative axis)

3D Transformation

• Rotation about Z

x’ = x cos() – y sin()

y’ = x sin() + y cos()

z’ = z x

cos() -sin() 0 0 sin() cos() 0 0 0 0 1 0 0 0 0 1

• OpenGL - glRotatef(, 0,0,1)

Rotation about Y (z → x, x → y, y → z) z’ = z cos() – x sin() x’ = z sin() + x cos() y’ = y

cos() 0 sin() 0 0 1 0 0 -sin() 0 cos() 0 0 0 0 1

3D Transformation

Rotation about X (y → x, z → y, x → z) y’ = y cos() – z sin() z’ = y sin() + z cos() x’ = x

1 0 0 0 0 cos() -sin() 0 0 sin() cos() 0 0 0 0 1

3D Transformation

• Arbitrary rotation axis (rx, ry, rz)

• glRotatef(angle, rx, ry, rz)

So, which way is a positive rotation?

y (rx, ry, rz)

3D Transformation

Rotation

Rotation Matrix

The orientation vector is perturbed to its neighborhood.

Perturbation

r, the normal vector.

Perturbation Algorithm

MB-Align Algorithm

System Implementations

• OS: Linux/Red Hat 7.2 run on Pentium-4 2800Mhz CPU and 1G bytes RAM.

• Bioperl – pdb file format conversion• Rotation/perturbation/integration – C progr

ams• Minimum bipartite matching – LEDA• Rmsd - PROFIT

Benchmarking

Benchmarks

Efficiencies of Strategies

each for dicecommon share :dice Global

each for dice a have :dice Local

The End.

1 Randomized Algorithms for Three Dimensional Protein Structures Comparison Yaw-Ling Lin Dept...

Documents

Yaw yeboah

SD3-60 AIRCRAFT MAINTENANCE MANUAL 22.pdf · YAW Yaw damper engaged (Pre-mod A8232 aircraft only) Green DIS (YAW) Yaw damper disengaged ... SD3-60 AIRCRAFT MAINTENANCE MANUAL …

Yaw Rate and Lateral Acceleration Sensor Plausibilisation ... · Yaw Rate and Lateral Acceleration Sensor Plausibilisation in an ... Yaw Rate and Lateral Acceleration Sensor Plausibilisation

Minimum Back-Walk-Free Latency Problem Yaw-Ling Lin Dept Computer Sci. & Info. Management, Providence University, Taichung, Taiwan

Hydraulic Soft Yaw System

Simulink Yaw damping model

A simpliﬁed yaw-attitude model for eclipsing GPS satellitesacc.igs.org/orbits/yaw-attitude_kouba_gpssoln09.pdf · Abstract A simpliﬁed yaw-attitude modeling, consistent with Bar-Sever

Phillipsdale Historic District East Providence Providence ... · Phillipsdale Historic District East Providence Providence County, ... Phillipsdale Historic District East Providence

Yaw Sayadaw

Efficient Algorithms for Locating the Length- Constrained Heaviest Segments, with Applications to Biomolecular Sequence Analysis Yaw-Ling Lin * Tao Jiang

SEE INSIDE FOR MORE INFORMATION - Providence Catholic High ... · Providence Catholic. 4. Provide certi cate from the SGO to use tax credit when ling a 2018 return in 2019. Strong

Synthetic Sequence Design for Signal Location Yaw-Ling Lin ( 林耀鈴 ) Dept Computer Sci and Info Engineering College of Computing and Informatics Providence

providence villa: Closing - Congregation of Divine Providence · to making God’s Providence more visible in our world. 2 ProviDence Alive! sPrinG 2015 The sisters of Divine Providence

Subtrees Comparison of Phylogenetic Trees with Applications to Two Component Systems Sequence Classifications in Bacterial Genome Yaw-Ling Lin 1 Ming-Tat

Yaw powerpoint

1 Exact String Matching, Suffix Trees, and Applications Yaw-Ling Lin ( 林耀鈴 ) Dept Computer Sci and Info Management Providence University, Taiwan E-mail:

Providence College DigitalCommons@Providence Curtains …

01 - YAW-YEW

1 Building Phylogenetic Trees Yaw-Ling Lin ( 林耀鈴 ) Dept Computer Sci and Info Management Providence University, Taiwan E-mail: yllin@pu.edu.tw WWW: yawlin

Yaw Pitch Roll - DTIC