Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Computational Methods for Structural Bioinforamtics
and Computational Biology (3)
(Protein geometry and topology: volume and surfaces)
Jie Liang 梁 杰
Molecular and Systems Computational Bioengineering Lab (MoSCoBL)Department of Bioengineering
University of Illinois at Chicago上海交通大学系统医学研究院
上海生物信息技术研究中心
E-mail: [email protected]/~jliang
Dragon Star Short CourseSuzhou University, June 14 – June 18, 2009
Today’s Lecture
Space filling structures of proteins: volume and surface models,
Geometric constructs and algorithms: Voronoi diagram, Delaunaytriangulation, and alpha shapeAlgorithmsA bit of topology
Application in proteins packing and function prediciton
Today’s Lecture
Space filling structures of proteins: volume and surface models,
Geometric constructs and algorithms: Voronoi diagram, Delaunaytriangulation, and alpha shapeAlgorithmsA bit of topology
Application in proteins packing and function prediciton
Different structural models of proteins
Volumetric and surface models
Backbone centric viewSecondary structure, tertiary fold, side chain packing
But ligand and substrate sees differently!
We are interested in things like binding surfaces
Volumetric and surface modelsMuch more complicated, as there could be 10,000 atoms.
GDP Binding Pockets
Ras 21 Fts Z
Functional Voids and Pockets
Space-filling Model of Protein
The shape of a protein is complexProperties determined by distribution of charge density,
Chemical bonds transfer charges from one atom to anotherIsosurface of electron density depend on locations of atoms and interactions
X-ray scattering pattern are due to these distributions.
Space Filling model: Idealized modelAtom approximated by balls, difference between bonded and nonbondedregions ignored.
interlocking sphere model, fused ball model
Amenable for modeling and fast computation
Ball radius: many choices, eg. van der Waals radii
Mathematical Model: Union of ball model
For a molecule M of n atoms, the i-thatom is a ball bi, with center at zi ∈ R3
bi
≡
{x| x
∈
R3, |x-zi
| <= ri
},
parameterized by (zi
, ri
).
Molecuel M is formed by the union of a finite number n of such balls defining the set B:
M = U B = U i=1n
{bi
}
Creates a space-filling body corresponding to the volume of the union of the excluded volume
When taken vdw radii, the boundary ∂ U {B} is the van der Waals surface.
(Edelsbrunner, 1995; see also Liang
et al, 1998)
Solvent Accessible surface model
Solvent accessible surface (SA model):
Solvent: modeled as a ball
The surface generated by rolling the solvent ball along the van der Waals atoms.
Same as vdw model, but with inflated radii by that of the solvent radius
( Lee and Richards, 1974)
Molecular Surface Model
Molecular Surface Model (MS):
The surface rolled out by the front of the solvent ball.
Also called Connolly’s surface.
More on molecular surface model
( Michael Connolly, http://www.netsci.org/Science/Compchem/feature14 e.html )
Elementary Surface Pieces: SA
SA: the boundary surface is formed by three elementary pieces:
Convex sphereical surface pieces, arcs or curved line segments
Formed by two intersecting spheres
VertexIntersection point of three spheres
The whole surface: stitching of these three elementary pieces.
a
Vdw
surface: Shrunken version ofSA surface by 1.4 A
Elementary Surface Pieces: MS
MS: three different elementary pieces:
Convex sphereical surface pieces,
Concave toroidal surface pieces
Concave spheric surfaceThe latter two are also called “Re-entrant surface”
The whole surface: stitching of these three elementary pieces.
b
Relationship between different surface models
c
vdW and SA surfaces.
SA and MS surfaces:Shrink or expand atoms.
SA
MS
Vertex
concave spheric
surface piece
Arcsconcave toroidal
surface piece
Conv. surfade
Smaller conv
surface
SA and MS: Combinatorically equivalent
Homotopy equivalent
But, different metric properties!SA: void of 0-volume ---- MS: void of 4πr3/3
Today’s Lecture
Space filling structures of proteins: volume and surface models,
Geometric constructs and algorithms: Voronoi diagram, Delaunaytriangulation, and alpha shapeAlgorithmsA bit of topology
Application in proteins packing and function prediciton
Computing protein geometry
It is easy to conceptualize different surface models
But how to compute them?Topological properties
Metric properties (size measure)
Need:Geometric constructs
Mathematical structure
Algorithms
Geometric Constructs: Voronoi
Diagram
A point set S of atom centers in R3
The Voronoi region / Voronoi cell of an atom of an atom bi with center zi∈R3 :
Vi
= { x ∈
R3
| |x-zi
| <= |x-zj
|, zj
∈
S }
All points that are closer to (or as close as to) bi than any other balls bj
Alternative view:
Bisector plane has equal distance to both atoms, and forms a half space for bi.
Half space of bi with each of the other balls bj
Intersection of the half spaces forms the Voronoi cell, and is a convex region
Weighted Voronoi diagram
In reality, atoms have different radius
Cannot use Euclidean distance and bisector plane.
Use Power Distance:
πi
(x) ≡
||x-zi
||2
- ri2
which replaces the Euclidean distance
||x-zi
|| It is the length of thetangent line segment
Delaunay Triangulation
Convex hull of point set S:The smallest convex space contain all points of S.
It is formed by intersection of halfplanes, and is a convex polytope.
Delaunay triangulation:uniquely tesselate/tile up the space of the convex hull of a point set with tetrahedra, together with their triangles, edges, and vertices
(triangles instead of tetrahedra in 2D)
Dual relationship between Voronoi
and Delaunay
These two geometric constructs look very different!
In fact, they are dual to each other
Reflect the same combinatorial structures
The mapping process.
Delaunay triangulation:Formed by gluing 4 types of primitive elements:
Vertices: just atom centers
Edges: connecting atom centers zi and zj iff the intersection vi Å vj of their Voronoi regions vi and vj is not empty:
vi
Å
vj
≠
∅
Triangle: connecting atom centers zi , zj and zk iff
vi
Å
vj
Å
vk
≠
∅.
Tetrahedron: connecting atom centers zi , zj, zk and zl iff
vi
Å
vj
Å
vk
Å
vl
≠
∅.
They form Delaunay Complex.
What are vi Å vj ? vi Å vj Å vk ? and vi Å vj Å vk Å vl ?
Computing Delaunay Triangulation
Circumcircle of a Delaunaytriangle abc:
The unique circle passing through a, b, and c.Its center: the corresponding Voronoivertex vi Å vj Å vk
Empty circumcircle of triangle abc:
Contains no other points in S.
Circumcircle Claim (for triangles): Let S ⊂ R2 be finite, and let a, b, c ∈ S be three points. abc
is a Delaunay
triangle iff
the circumcircle
of abc
is empty
a
b
c
d
a
2 flip
a
b
c
d
acd
is
abc
is notan emptycircle
Supporting Circle
Supporting Circle Claim (for edges):
Let S ⊂
R2
be finite,
and let a, b ∈
S be two points.
ab
is a Delaunay
edge iff
there is an empty
circle that passes through a and b.
a
b
c
d
a
Not ab
here
Locally Delaunay
A triangulation K triangulates point set S if the triangles decompose the convex hull of S and the set of vertices is SAn edge ab ∈ K is locally Delaunay if
It belongs to only one triangle, and therefore bound the convex hullIt belongs to two triangles, abc and abd, and d lies outside the circumcircle of abc
Note: A locally Delaunay edge is not necessarily an edge of the Delaunaytriangulation.
Delaunay Lemma
If every edge is locally Delaunay, then all edges are Delaunya edges
Delaunay Lemma: If every edge of K is locally Delaunay, then K is the Delaunay triangulation of S
General idea: Take an arbitrary triangulation Go through each of the triangles,
make corrections using “flips" discussed below if a specific triangle contains an edge is not locally Delaunay.
Primitive Operation: Edge FlipIf ab is not locally Delaunay, then the union of the two triangles abc ∪ abd is a convex quadrangle acbd, and edge cd is locally Delaunay. We can replace edge ab by edge cd. This is called an edge-flip.
Also called 2-to-2 flip, as two old triangles are replaced by two new triangles.
Recursively check each boundary edge of the quadrangle abcd to see if it is also locally Delaunay after replacing ab by cd.
If not, recursively edge-flip it.
Data structure: use a stack.
2−to−2 flip
a
b
c
d
a
b
c
d
a
Incremental Algorithm
A finite set of points S = (z1, z2, …, zn)Start with a large auxiliary triangle containing all points.Insert the points one by one
Maintaining a Delaunay triangulation Di upon insertion of point zi
Search for the triangle τi-1 containing this new point.Add z_i and split τi-1 into 3 smaller triangles
1-to-3 flip, as it replaces one old triangle with three new triangles
Check if each of the three edges in τi-1 still satises the locally Delaunay requirement.
If not, perform a recursive edge-flip.
1−to−3 flip
b
Incremental Algorithm
Delaunay Tetrahedrization
More complicated in R3, but same basic idea:
Locate a tetrahedron that contains the newly inserted point.
“Locally Delaunay” replaced by “Locally convex”
Other flips than “2-to-2” flips
Sequentially adding points become important
Excellent expected performance.
(Edelsbrunner
and Shah, 1996)
Computing voronoi Diagram
Easy when Delaunay triangulation is available.Mathematical duality
Compute all of the Voronoi vertices, edges, and planar faces from the Delaunay tetrahedra, triangles, and edges. Because one point zi may be an vertex of many Delaunaytetrahedra, the Voronoi region of zi therefore may contain many Voronoi vertices, edges, and planar faces.
The efficient quad-edge data structure can be used for software implementation
A bit of topology: simplices
To encode the intersection relationship
0-simplices σ0: vertices, covex hull of 1-point
1-simplices σ1 : edgescovex hull of 2 points
2-simplices σ2 : trianglescovex hull of 3 points
3-simplices σ3 : tetrahedroncovex hull of 4 points
Tetrahedron is the most complicatedsimplex in R3
These can be glue tomodel arbitrailycomplicated molecule
Simplicial Complex
We can glue the simplicies together properly to form more complicated structuresSimplicial complex: the collection of faces of a finite number of simplicies
Any two are either disjoint, or meet in a common face
Formally, simplices form a simplicial complex K
K = {σ|I|-1
| Å i ∈
Ι
Vι
≠
∅ }
Simplicial complex from simplices iffσ
∈
K Æ
ι≤σ
→ ι
∈
K,
σ, ν
∈
K → σ
Å
ν
≤
σ, ν
Examples
Alpha Complex
Grow or shrink balls by a parameter α
For a ball bi of radius ri, its modified radius at a particular α is:
ri
(α) = (ri2+α)1/2
Shrunk when -ri<α <0,What happens if α < 0 and |α| > ri?
Simplex occurs when intersection appears.Collect simplices at different α as we increase α from -∞ to +∞
a b c
d e f
( Mucke
& Edelsbrunner, 1994, ACM Tran Graph )
Alpha complex and dual complex
At any specific α value, we have a dual simplicial complex or alpha complex:
When all atoms take incremented radius ri + rs, we have dual complex K0 of the protein molecule.
When α is sufficiently large, all simplicies are collected, and we have the Delaunay complex
A series of simplicial complex at different α values.
A series of 2D simplicial complexes (alpha shapes).
Each faithfully represents the geometric and topological property of the protein molecule at a particular resolution parametrized
by
the α
value
Equivalent View
For dual complex or alpha shape K0 at α=0,
Take a subset of the simplices iff the corresponding intersections of Voronoi cells overlap with the body of the union of the balls:
K0
= {σ|I|-1
| Åi
∈
I
Vi Å
∪ B ≠
∅
}
Alpha shape:The underlying space of alpha complex.
A Guide Map for Computing Geometric Properties
Combinatorial Information
A Guide Map for Computing Geometric Properties
Re-entrant surfaces in MS Model :
Concave spherical patch: boundary triangles in alpha shape
Delaunay
edge: vi
Å
vj
Å
vk
≠
∅
.Alpha edge: (vi
Å
vj
Å
vk
)
Å
∪
B ≠
∅
The line (segment) vi
Å
vj
Å
vk
intersect with SA of ∪ B:
at a surface vertex where three balls overlapCorresponds to a concave spherical surface patch in the molecular surface
c
A Guide Map for Computing Geometric Properties
Re-entrant surfaces in MS Model :
Concave toroidal surfaceboundary edges in alpha shape:
The Voronoi plane vi Å vjcoincides with the intersecting plane when two atoms meet
intersect with SA of ∪ B at an arc (line segment)Corresponds to a toroidal concave surface patch
Remaining part of the surface: convex
correspond to the vertices in alpha shape, That is, the atoms on the boundary of the alpha shape
c
A Guide Map for Computing Geometric Properties
Number of concave spheric pieces = number of boundary triangles in alpha shape
Number of toroidal pieces = number of boundary edges
Seem to scale roughly to O(n):due to constraints in bond lengths, bond angles, and excluded volume
Geometric structure from alpha shape
Mucke & Edelsbrunner, 1994, ACM Tran GraphEdelsbrunner, Facello, Liang, 1998, Disc Appl MathLiang, Edelsbrunner, Woodward, 1998, Protein Sci
Guide map for topological properties: Voids in proteins
Computing Metric Properties
Volume of a protein molecule in SA model
A grossly incorrect method: Sum the volume of individual atoms
Problem: over-exaggeration, because neglecting volume overlaps.
Explicit correction: Direct Inclusion-Exclusion
When two balls overlap, subtract intersection
When three balls overlap: Oversubtraction!
Need to add back volume of 3-intersection
Continues when thereare 4-
or 5-volume
overlaps.
(Edelsbrunner, 95)
Inclusion-Exclusion Principle
Corrected volume V(B) for a set of balls B is:
V(B) = ∑
vol(Å
T) > 0, T ⊂
{B} (-1)dim(T)-1
vol(Å
T),
Same as the Gauss-Bonnet Theorem combinatorically
Straightforward application of this does not work!Volume overlap can be 7-8 degrees,
Difficult to keep track, and difficult to compute volume overalps:
how large is the k-volume overlap of which one of the 7 \choose k or 8 \choose k overlapping atoms for all of k=2, …, 7?
(Edelsbrunner, 95)
( 1 )
How to avoid high-degree overlaps?
Insight: in R3, overlaps of five or more balls at a time can always be reduced to a ``+'' or a ``-'' signed combination of overlaps of four or fewer balls2-body, 3-body, and 4-body terms in Eqn 1 enter the formula iff there exist in the dual simplicial complex K0 of the molecule
the corresponding edge σij connecting the two balls (1-simplex), triangles σijk spanning the three balls (2-simplex), tetrahedron σijkl cornered on the four balls (3-simplex)
(Edelsbrunner, 95)
Simplified Inclusion-Exclusion Formula for Molecules
Volume for a set of balls B:
V(B) = ∑
σi ∈
K
vol
(bi
) -
∑
σij
∈
K
vol
(bi
Å
bj
) +
∑
σijk
∈
K
vol
(bi
Å
bj
Å
bk
) -
∑
σijkl
∈
K
vol
(biÅ
bj
Å
bk
Å
bl
)
Same idea for the calculation of surface area of molecules.
An Example in 2D
Total area of the union of disks (2D) is:
Atotal
= (A1
+ A2
+ A3
+ A4
)
- (A12
+ A23
+ A24
+ A34
)
+ A234
Here we:add the area of bi
if the corresponding vertex σ
i
∈
K0
,
subtract the area of bi
Å
bj
if σ
ij
∈
K0
add the area of bi
Å
bj
Å
bk
if σ
ijk
∈
K0
b1
b2
b3
b4
b1
b2
b3
b4
An Example in 2D
Without the guidance of the alpha complex:
Atotal
=
(A1
+ A2
+ A3
+ A4
)
- (A12
+ A13
+ A14
+ A23
+ A24
+ A34
)
+ ( A123 + A124 + A134 + A234 )
- A1234
But six of them are canceling redundant terms:
A13 = A123, A14 = A124, A134 = A1234
Computing them would be wasteful.
b1
b2
b3
b4
b1
b2
b3
b4
No Redundant Term
The redundant terms do not occur if we follow the alpha complex:
The part of the Voronoi regions contained in the respective balls for the redundant terms do not intersect: eg. A123 , A124 , and A134
The corresponding edges and triangles do not enter the alpha complex
In R2, we have terms of at most three disk intersections:
triangles in the alpha complex,
In R3, the most complicated terms are intersections of four balls:
tetrahedra in the alpha complex
b1
b2
b3
b4
Short Inclusion-Exclusion Term
Even faster computation with shorter expressions of area and volume
Use fractional coefficients in the formulas: Angles at the ball centers relative to its neighbor in the alphacomplex
Measured as fractions of circles or spheres and normalized between 0 and 1.
Atotal
= (1 · A1
+ φ2
· A2
+ φ3
· A3
+ φ4
· A4
)
− (1 · A12
+1/2 · A23
+ 1/2 · A24
+1/2 · A34
)
+ A
A: area of the triangles in alpha complex.b1
b2
b3
b4
Algorithm for Metric Size Calculation
Let V and A denote Volume and area of a molecule
Volume of ball intersections
Sector: solid angle x vol
of ball
Wedge: dihedral angle x
(Vcap
(i,j) + Vcap
(j,i) )
Pawn:
What about area calculations?
How to compute metric properties of MS models?
Hint: The MS model has the same structure as the SA model, but different primitive geometric objects
Concave spheric piece
Concave toroidal piece
Voids and pockets in proteins
Concave regions on protein surfaces
Shape complimentarity important for molecular recognition
Binding frequently occurs in p0ckets and voids
Eg. enzymes
Computing voids and pockets
Consider Delaunay tetrahedra not included in alpha complex.
Repeatedly merge those that share a 2-simplex
Obtain discrete set of empty tetrahedra
Voids: completely isolated
Pockets: connected to outside with a constricted mouth
Shallow depression: connected,
Size calculation: inclusion and exclusion.
Computing pockets in proteins
Edelsbrunner et al, 1998, Disc Appl MathLiang, Edelsbrunner, Woodward, 1998, Protein Sci
Today’s Lecture
Space filling structures of proteins: volume and surface models,
Geometric constructs and algorithms: Voronoi diagram, Delaunaytriangulation, and alpha shapeAlgorithmsA bit of topology
Application in proteins packing and function prediciton
Voids and Pockets in Soluble Proteins
“Protein interior is solid-like, tightly packed like a jig-saw puzzle”
High packing density (Richards, 1977)
Low compressibility (Gavish, Gratoon, and Harvey, 1983)
Many voids and pockets.At least 1 water molecule; 15/100 residues.
(Liang & Dill, 2001, Bioph J)
Proteins are not optimized by evolution to eliminate voids.
Protein dictated by generic compactness constraint related to nc.
Surfaces with unknown functional roles
http://cast.engr.uic.edu
Enzyme Functional Site Prediction
Where are the functional surfaces located ?
What are the key residues (active site residues) in the functional surfaces ?
Which mutations to make?
Our strategyGeometric
Computing( precomupted
in castPdatabase)
Locating
Key residues (red)4~5 residues
Proteinstructure~200 residues
FunctionalSurface (green)
Identifying
Structure is the only input!!
Validation study: large-scale prediction of functional surface
Data Set (~700 structures).
10-fold cross validation.
Average accuracy of predicting functional surfaces of proteins is 91.2%. Tseng and Liang, Annals of BME, 2007
The Universe of Protein Structures
Human genome: 3 billion nucleotides;
Number of genes: 20,000 –25,000; Protein families: 10,000; Number of folds: 1,000s
Currently in PDB: about 1,000 folds
Comparative modeling: needs a structural template with sequence identities > 30-35%
eg. ~50% of ORFs and ~18% of residues of S. cerevisiae genome
Structural Genomics: populating each fold with 4-5 structures
One for each superfamily at 30-35% sequence identities.
Fold of a novel gene can be identified
Its structure can then be interpolated by comparative modeling.
All β α/β
(from SCOP)
Predicting and characterizing protein functions
Important, but challenging tasks:Needs > 60-70% sequence identity.
Fold prediction: >20-30% sequence identity.
Proteins from structural genomics often are of unknown functions.
Sequence homologs are often hypothetical proteins.
(Rost, 02, JMB; Tian & Skolnick, 03, JMB)
Another Way: How to identify biologically important pockets and voids from random ones?
By Homology:
Assessing Local Sequence and Shape Similarity
(Binkowski, Adamian, Liang, 2003, JMB, 332:505-526)
Binding Site Pocket: Sparse Residues, Long Gaps
ATP Binding: cAMP
Dependet
Protein Kinase
(1cdk) and Tyr
Protein Kinase
c-src
(2src)
1cdk.A 49LGTGSFGRVMLVKHKETGNHFAMKILDKQKVVKLKQIEHTLNEKRILQAVNFPFLVKLEYSFKDNSNL YMVMEYVPGGEMFSHLRRIGRFSEPHARFYAAQIVLTFEYLHSLDLIYRDLKPENLLIDQQGYIQVTDFG FAKRVKGRTWTLCGTPEYLAPEIILSKGYNKAVDWWALGVLIYEMAAGYPPFFADQPIQIYEKIVSGKVR FPSHFSSDLKDLLRNLLQVDLTKRFGNLKDGVNDIKNHKWFATTDWIAIYQRKVEAPFIPKFKGPGDTSN F327
1cdk.A_p 49LGTGSFGRV A K V
MEYV E K EN L TD F
2src.m 273LGQGCFGEVWMGTWNGTTRVAIKTLKPGTMSPEAFLQEAQVMKKLRHEKLVQLYAVVSEEPIYIVTEYMSKGSLLDFLKGETGKYLRLPQLVDMAAQIASGMAYVERMNYVHRDLRAANILVGENLVCKVAD404
2src.m_p 273LGQGCFGEV A K V TEYM GS D D R AN L AD
Low overall sequence identity: 13 %
High Sequence Similarity of Pocket Residues
1cdk.A LGTGSFGRVAKVMEYV---EKENLTDF 24
2src.m LGQGCFGEVAKVTEYMGSDDRANLAD- 26
** *.**.**** **: :: **:*
1cdkcAMP Dependent Protein Kinase
2srcTyr Protein Kinase c-src
High sequence identity: 51 %
Sequence Similarity of Surface Pockets
Similarity detection:Dynamic programming SSEARCH (Pearson, 1998)
Order Dependent Sequence Pattern.
Statistics of Null Model:Gapless local alignment: Extreme Value Distribution
(Altschul & Karlin, 90)
Alignment with gaps: (Altschul, Bundschuh, Olsen & Hwa, 01)
Statistical Significance !
Approximation with EVD distribution (Pearson, 1998, JMB)
Kolmogorov-Smirnov Test:
Estimate K and λ parameters.
Estimation of E-value:
Estimate p value of observed Smith-Waterman score by EVD.
allall )( NpNNpE d ⋅≤−⋅=
)exp(1)'(
,ln'xexSp
KmnSS−−−=≥
−= λ
(Binkowski, Adamian, Liang, 2003, JMB, 332:505-526)
Shape Similarity Measure
cRMSD (coordinate root mean square distance)
oRMSD (Orientational RMSD):Place a unit sphere S2 at center of mass x0 ∈ R3
Map each residue x ∈ R3 to a unit vector on S2 :
f :
x
= (x, y, z)T
u
= (x -
x0
) / || x -
x0
||
Measuring RMSD between two sets of unit vectors.
(cf. uRMSD by Kedem and Chew, 2002)
Surprising Surface Surprising Surface SimilaritySimilarity
Retroviral proteaseRetroviral proteaseFamilyFamily
All All ββClassClass
Binds polyBinds poly--peptide substrate acetylpeptide substrate acetyl--pepstatinpepstatin
FoldFold
HIVHIV--1 Protease 1 Protease ((5hvp5hvp))
PocketPocket
Acid proteasesAcid proteases
CATHCATH
Hsp90Hsp90FamilyFamily
αα++ββClassClass
Binds protein segment geldanamycinBinds protein segment geldanamycin
FoldFold
Heat Shock Protein 90 Heat Shock Protein 90 ((1yes1yes))
PocketPocket
αα//β β sandwhichsandwhich
CATHCATH
••Conserved residues both important in Conserved residues both important in polypeptide binding polypeptide binding
•• Both pockets undergo conformational Both pockets undergo conformational changes upon binding changes upon binding
How to capture the evolutionary signal due to biological function?
(Tseng and Liang, 2006, Mol Biol Evo, 23:421; Tseng et al, 2009, J. Mol.Biol)
Strong selection pressures that increase the overall fitness of proteins.
But may not be related with biological function:
Structural constraints
Protein stability
Folding kinetics
Isolating selection pressure due to biological function:
Unsolved problem!
Deriving Scoring Matrix from Evolutionary History
A scoring matrix is critical:
Determines similarity between residues and hence statistical significance.
Derived from evolutionary history of proteins sharing the same function.
Existing approach: PAM and BLOSUM heuristics.
Position specific weight matrix.
Entropy/relative entropy for full proteins or domains.
Our approach: Evolution: Explicit phylogenetic tree.
Model: Continuous time Markov process.
Geometry: Evolution of only residues located in the binding region.
Bayesian Markov chain Monte Carlo.
A R N D C Q E G H I L K M F P S T W Y V
A
R
N
D
C
Q
E
G
H
I
L
K
M
F
P
S
T
W
Y
V
The Active Pocket [ValidPairs: 39]
(a)
A R N D C Q E G H I L K M F P S T W Y V
A
R
N
D
C
Q
E
G
H
I
L
K
M
F
P
S
T
W
Y
V
The rest of Surface [ValidPairs: 177]
(b)
A R N D C Q E G H I L K M F P S T W Y V
A
R
N
D
C
Q
E
G
H
I
L
K
M
F
P
S
T
W
Y
V
Interior [ValidPairs: 190]
(c)
A R N D C Q E G H I L K M F P S T W Y V
A
R
N
D
C
Q
E
G
H
I
L
K
M
F
P
S
T
W
Y
V
Surface [ValidPairs: 187]
(d)
Evolutionary rates of binding sites and other regions are different
Residues on protein functional surface experience different selection pressure.
Estimated substitution rate matrices of amylase:
• Functional surface residues.
• The remaining surface, • The interior residues• All surface residues.
Example 1: Finding alpha amylase by matching pocket surfaces
Challenging:– amylases often have low overall sequence identity (<25%).
–1bag, pocket 60; B. subtilis–14 sequences, none with structures, 2 are hypothetical
–1bg9; Barley–9 sequences, none with structures.
Criteria for declaring similar functional
surface to a matched surface
Search >2million surfaces with a template surface.
Shapes have to be very similar:p-value for cRMSD: < 10-3 .
Customized scoring matrices of 300 different time intervals.
The most similar surface has nmax of matrices capable of finding this homologous surface.
Declare a hit if >1/3 nmax of matrices give positive results.
Results for Amylase
• 1bag: found 58 PDB structures.
• 1bg9: found 48 PDB structures.
Altogether: 69
All belong to amylase (EC 3.2.1.1)
Query: B. subtilis Barley1bag 1bg9
Hits: human1b2y 1u2y22% 23%
0y)
False Positive Rate
Tru
ePosi
tive
Rate
(b)
Helmer-Citterich, M et al (BMC Bioinformat. 2005)Russell RB. (JMB 2003)Sternberg MJ Skolnick, JLichtarg, O (JMB2003)Ben-Tal, N and Pupko, T ( ConSurf )
• 110 protein families• Each points on the curve corresponds to p-values of various cRMSD cutoffs• Accuracy ~92% (EBI: 75%)
1
False positive rateTN FP
TN FP TN FPTrue positive rate
TPTP FN
≡
− =+ +
≡
+
Laskowski, Thornton, JMB, 05, 351:614-26
EBICATH domain, not function
Large Scale Prediction of Protein Functions
False Positive Rate (1-specificity)0.0 0.2 0.4 0.6 0.8 1.0
Tru
e Po
sitiv
e R
ate
(sen
sitiv
ity)
0.0
0.2
0.4
0.6
0.8
1.0
False Positive Rate
0.0 0.2 0.4 0.6 0.8 1.0
MC
C
0.78
0.80
0.82
0.84
0.86
0.88
(Tseng et al, 2009)
Orphan protein structures without functional annotation
Orphan proteins from Structural Genomics.
No known functions.
Often sequences homologs are hypothetical proteins.
Our tasks:Identify the functional pocket for the
structure.
Predict protein function.
Inferring biological functions of protein BioH
NP_991531
1A7U
1BRT
1HKH
1A88
1A8S
1A8Q
1IUN
1J1I
ZP_00217495
ZP_00221914
ZP_00194950
NP_822724
ZP_00031039
ZP_00032388
NP_069699
ZP_00081046
NP_273326
NP_618510
NP_871588
NP_778087
NP_249193
NP_790345
ZP_00086843
NP_904047
NP_298645
NP_796527
1M33
0.1 substitutions/site
99
100
94
68
100
93
91
60
62
70
85
44
63
82
97
66
99
53
93
98
100
49
85
90
85
The phylogenetic tree of 28 sequences related to BioH. Many are hypothetical genes.
The candidate binding pocket (CASTp id=35) of BioH (1m33) and a similar functional surface detected from (1p0p, E.C. 3.1.1.8)
Protein of unknownfunctions from structuralgenomics
A Probablistic Model of Enzyme Function
Enzymes have diverse reactivites:Some are very specific
Some can react with a range of substrates
An E.C. number alone is inadequate
Our approach: a probabilistic model
3.1.1.8 3.1.4.17 3.4.21.7 3.4.23.22
0.00
0.05
0.10
0.15
0.20
0.25
0.30
Enzyme Classification Number
pro
bab
ility
BioH
Significant hits predicted for BioH:E.C. 3.1.1.8, p = 0.31, cholinesteraseE.C. 3.4.23.22, p = 0.26, aspartic endopeptidaseE.C. 3.4.21.7, p = 0.24, serine endopeptidaseE.C. 3.1.4.17, p = 0.20, phosphodiesterase
Experimental Results:Carboxyesterase E.C. 3.1.1.1 highLipase E.C. 3.1.1.3 lowThioesterase E.C. 3.1.1.5 lowAminopeptidease E.C. 3.4.11.5 low
Summary
Space filling structures of proteins: volume and surface models,
Geometric constructs and algorithms: Voronoi diagram, Delaunaytriangulation, and alpha shapeAlgorithmsA bit of topology
Application in proteins packing and function prediciton
References
H. Edelsbrunner and N. Shah. Incremental topological flipping works for regular triangulations. Algorithmica 15(1996), 223-241.
L. Guibas and J. Stolfi. Primitives for the manipulation of general subdivisions and the computation of Voronoi diagrams. ACM Transactions on Graphiques, 4:74-123, 1985