Upload
sharlene-bennett
View
216
Download
0
Tags:
Embed Size (px)
Citation preview
Topic 10Chapter 20, Du and Bourne “Structural Bioinformatics”
What is a domain
Shamelessly ‘borrowed’ from Phil Bourne’s notes: www.sdsc.edu/pb/edu/pharm201/15/15.pptx
reasonable region of complexity
Definition of protein domain is not well defined (to say the least), which makes it difficult to identify their boundaries
General Considerations: - compact, semi-independent units (close to spherical shape) *
- interactions between domains are weak (small contact)
- identifiable hydrophobic core (interface is more hydrophilic) **
- -sheet is best preserved
* Wetlaufer DB. PNAS 1973; 70:697-701** Swindells MB. Protein Science 1995; 4:103-112
Protein Domain
Multi-domain Proteins
Redfern et al, PloS Computational Biology, 2007
Approximately 50% proteins are multi-domain (data from 2005). It could be as high as 80% in eukaryotes
From Wikipedia…
• A protein domain is a part of protein sequence and structure that can evolve, function, and exist independently (not likely now) of the rest of the protein chain.
• Each domain forms (formed?) a compact three-dimensional structure and often can be independently stable and folded.
• Many proteins consist of several structural domains.
• One domain may appear in a variety of different proteins. Molecular evolution uses domains as building blocks and these may be recombined in different arrangements to create proteins with different functions.
• Domains vary in length from between about 25 amino acids up to 500 amino acids in length. The shortest domains such as zinc fingers are stabilized by metal ions or disulfide bridges. (Is a single zinc finger really a domain?)
• Domains often form functional units, such as the calcium-binding EF-hand domain of calmodulin. (Is a single EF-hand really a domain?)
• Because they are independently stable, domains can be "swapped" by genetic engineering between one protein and another to make chimeric proteins. (Sometimes.)
EF-hands (domain or motif?)
Calmodulin
The EF-hand is another common structural element. In fact, the protein calmodulin has four of them.
What about a zinc finger?
Zinc finger
From Wikipedia: A zinc finger is a small protein structural motif that is characterized by the coordination of one or more zinc ions in order to stabilize the fold.
Quick aside about zinc fingers
Repeat proteins
Ankyrin
Adding to the Complexity, Discontinuous Domains
Redfern OC. et al, PloS Computational Biology, 2007
N-terminal C-terminal
33844 px c.56.5.4 d1cg2a1 1cg2 A:26-213,A:327-41439360 px d.58.19.1 d1cg2a2 1cg2 A:214-326
SCOP Classification:
About 20% of mutidomain proteins are not contiguous in sequence
Domain identification
• Any structure unclassified by the sequence-based methods are divided into their constituent domains (when appropriate). The domains are then resubmitted to the sequence and structure comparison protocols discussed previously.
• While there are many automatic domain identification algorithms, most result in significant numbers of incorrect assessments (20-30% incorrect).
• This is mainly due to the fact that there is no unique answer to the question, “What is a domain?” For example, one could easily envision various domain classification schemes based on sequence, phylogeny, and/or structure.
• Structure-based approaches are based on straightforward structural concepts: namely that (globular) proteins have hydrophobic cores, and that these cores should constitute a (semi)independent folding nucleus.
• Thus the automated methods attempt to (maximize, minimize) (intra, inter)-domain contacts.
• What about non-globular (i.e., intrinsically disordered or integral) proteins???
Domain identification
Most automated domain identification methods are primarily based on this premise. However, as you might expect, there are myriad ways to implement such an idea.
ADH
Early works only apply to single-segment domainsCrippen, 1978; Nemethy & Scheraga, 1979; Lesk & Rose, 1981; Rashin, 1981.
Current methods for multi-segment domains mostly use heuristics and approximations:
Holm & Sander, 1994; Siddiqui & Barton, 1995; Swindells, 1995……..
Automatic Domain Partition Methods
Note: the focus here is structural domain partition. While structure-based domain assignment is not a trivial problem, domain prediction from sequences is even more difficult. Any advances in sequence-based domain prediction will greatly improve protein structure prediction. ?
The general approach
Basic principle for domain partition: inter-residue interactions are denser within domains than between domains
Top-down vs. Bottom-up
Start with the entire structure and proceed through iterative partitions into smaller units.
Define very small structural units and assemble them into domains.
• Over the years, an amazing array of approaches have been put forward to solve the domain ID problem.
• In spite of very different overall approaches, an interesting observation has been made: most algorithms correctly ID 70-80% domains within structures, but fail on the others due to complexity within some multi-domain proteins.
• The # of boundaries are both over-predicted leading to too many domains (overcut) or under-predicted leading to too few domains (undercut).
• Thus, the problem remaining is not “where does the boundary of the domain fall?”, but rather “is the identified boundary real?”
How do automatic methods work?
3D-coordinates of chain
Predicted domains
Make domains by putting together primitive units of secondary structure
Bottom-up approach
Parameters involved
Make domainsby partitioning chain into smaller units
Top-down approach
Ste
p 1
Evaluate each potential domain using set of parameters (accept or reject given assignment)S
tep
2
Maximize hydrophobic core of the unitMaximize compactness of the unitFind mechanical hinge points between unitsMinimize interface area between units Minimum size of unitMaximize globularityMinimize cutting through secondary structuresMaximum number of discontinuous fragments within the domain
Shamelessly ‘borrowed’ from Phil Bourne’s notes: www.sdsc.edu/pb/edu/pharm201/15/15.pptx
Two steps of algorithm design:
Train the algorithm
compare predicted domain assignments to
“correct” domain assignments
Tune parameters till the best level of
prediction is achieved
Validate the performance
run the algorithm of an independent
set of data
Report % of correctly partitioned
proteins
Step A Step B
Use expert data for domain assignments
A problem: different algorithms use assignments from different experts for training and validation.
Algorithms will reflect same propensities
toward domain assignments as the expert
method they rely upon.
More seriously, there is no good objective way
to compare the performance of different
methods, as each uses different dataset for
validation.
Shamelessly ‘borrowed’ from Phil Bourne’s notes: www.sdsc.edu/pb/edu/pharm201/15/15.pptx
Issues in Protein Domain Partition
• Compactness (contacts/#of residues……)
• Minimum domain size (35 amino acids [AA], 40AA…?)
• Minimum size to be considered for partition (80AA…?)
• Integrity of secondary structures (Is it ok to break -sheet?)
• Most programs use top-down approach, what are the criteria for stops?
CATH Domain Classification
Use both automatic and manual techniques
If it has high sequence identity (80%) and structural similarity (SSAP score >= 80) with a protein chain X that has been classified in CATH, use the boundaries of X.
Otherwise, apply several domain partition programs 1. DETECTIVE (Swindells, 1995), 2. PUU (Holm & Sander, 1994), 3. DOMAK (Siddiqui and Barton, 1995).
If there is no consensus assign manually.
Differences
WARNING: Even though each method has about 70-80% accuracy based on benchmark tests, disagreement among methods is very big in terms of the number of domains, and domain boundaries.
In CATH, if consensus is not found within a tolerance of 10 residues, the domains are manually assigned (right).
DOMAK (Siddiqui and Barton, 1995). split value = (intA/extAB)*(intB/extAB) intA (B): the number of internal contacts in A (B) (contact: heavy atoms within 5 Å) extAB: the number of contacts between A and B
DETECTIVE (Swindells, 1995), hydrophobic core determination
PUU (protein unfolding units, Holm & Sander, 1994), harmonic model to describe inter-domain dynamics
Domainparser (Xu, 2000) graph algorithm---network flow
Automatic Domain Partition Methods
DomainParser
• DomainParser (Xu et al, Bioinformatics 2000) uses a graph-theoretic algorithm for the decomposition of a multi-domain protein into individual structural domains.
• The underlying principle used is that residue-residue contacts are denser within a domain than between domains.
• The decomposition problem is recast as a network flow problem, in which each residue is represented as a node of a network and each residue-residue contact is represented as an edge with a particular capacity, depending on the type of the contact.
• A two-domain decomposition problem is solved by finding a cut of the network, which minimizes the total cross-edge capacity (minimum cut).
• To deal with networks with non-unique minimum cuts, the algorithm finds all cuts, which achieve the minimum cross-edge capacity.
• A recent analysis of four automatic methods put DomainParser (marginally) at the top (Holland et al, JMB, 2006) --- In fact, 3/4 were nearly equal depending on the evaluation criterion.
bottleneck
interface
Domain partition Network flow
Domain Partition as a Network Flow Problem
Basic idea: identify the bottleneckXu et al, Bioinformatics 2000Guo et al, NAR 2003Note: there is now a DomainParser 2
DomainParser
Domain identification is recast as a network flow problem. Meaning, the method attempts to divide the network into two interconnected parts in such a way that the edge capacity across the division in minimized. (Note, each edge can carry different weights, or capacities.)
Intuitively, this translates into finding the bottleneck within the network.
The algorithm works by systematically removing nodes until domain separation is maximized.
There is a second (post-processing) step that checks the validity of the domain boundaries using commonsense metrics like compactness, radius of gyration, number of non-contiguous segments per domain, and distribution of domain sizes.
Because the method is based on topology, it is very fast. And, it scales very well as well O(nm2), where n = # of nodes and m = # of nodes.
source sink
capacity
edge
node
Maximum Flow/Minimum Cut (bottleneck)
Algorithm to solve this problem: Ford-Fulkerson Method
We need to construct a graph first……
Node
Capacity
Source/sink
Residue (C)
Packing
Extreme points
Model Building for Domain Partition
Issues:• Compactness• Minimum domain size• Integrity of secondary structures• When to stop
Find the bottleneck
Capacity between Residues A/B: (based on Holm & Sander 1994) (1) If atom distance <= 4.0 A, ++1;(2) If backbone contact, ++5;(3) If across a -sheet, ++12;(4) If within a -strand, ++1000.
Capacity and Extreme Points
Two farthest residuesperpendicular to the axis
Source
Sink(sampling)
Preserve -sheet structure
Use multiple extreme points
* Violate compact globular requirement
Domains have very simple and/or extended structure (DomainParser 1 domain)
1aaya1zmec 6prch
Assignments by DomainParser vs. SCOP
DomainParser preserves -sheet (DomainParser 1 domain)
undercut
Structurally correct decomposition by DomainParser (DomainParser: 2 domains)
SCOP treats them as single domain proteins, functional consideration or ?
Assignments by DomainParser vs. SCOP
2liv2adma
overcut
Holland, et al, JMB, 2006Experts: CATH, SCOP, AUTHORS
Domain Assignments by DomainParser
DomainParser tneds to undercut large mutlti-domain proteins
Holland, et al, JMB, 2006
Summary of Performance Comparison
Holland, et al, JMB, 2006
But PDP (Protein Domain Parser) is the winner
Holland, et al, JMB, 2006
• PDP is a recursive top-down algorithm that makes either: (1.) a single cut producing two contiguous domains or (2.) a double cut, where the cuts are at least 35 residues apart and within 8 Å of each other.
• The best cut is selected using criteria of minimum contacts between resulting domains, normalized by the size of the domains.
• The algorithm continues to recursively partition each of the resulting domains until a stopping condition is met.
• During the post-processing step, the number of contacts between resulting domains is evaluated and domains with a high level of contacts are merged together. Very small domains (below 35 residues) are discarded.
But PDP (Protein Domain Parser) is the winner
• PDP is a recursive top-down algorithm that makes either: (1.) a single cut producing two contiguous domains or (2.) a double cut, where the cuts are at least 35 residues apart and within 8 Å of each other.
• The best cut is selected using criteria of minimum contacts between resulting domains, normalized by the size of the domains.
• The algorithm continues to recursively partition each of the resulting domains until a stopping condition is met.
• During the post-processing step, the number of contacts between resulting domains is evaluated and domains with a high level of contacts are merged together. Very small domains (below 35 residues) are discarded.
But PDP (Protein Domain Parser) is the winner
• Based on the criterion of correct number of assigned domains, PDP appears to be the most accurate method (85% correct) followed by NCBI (83%), DomainParser (77%), and PUU (74%).
• DomainParser is the most accurate on structures with few domains. However, it tends to under-cut many structures (4.5% over-cut, 18.5% under-cut).
• NCBI, on the other hand, shows a balance between over-cut and under-cut types of errors (9.9% over-cut, 7.6% under-cut).
• The performance of PDP is consistently superior to other methods; it is particularly impressive on chains with larger number of domains: the method assigns correctly four out of five, five-domain chains and is the only method to correctly assign a six-domain chain. In general the performance of NCBI is very similar overall as well as in its profile character to that of PDP; its assignment of four-domain chains is superior to that of PDP, but NCBI fails to assign correctly most of five-domain chains and both of the six-domain chains.
Summary of Performance Comparison
Some insights from looking at automatic domain assignments:
Maximizing ratio of intra- /inter-domain contacts is a chief principle in algorithmic
assignments and work well for ‘standard’ cases. As more complex structures are solved,
more cases of ‘unusual’ architecture are uncovered. These tend to defy our basic rules.
It is possible to include more parameters and tune them
better to avoid some obvious cases of overcuts:
penalize splitting secondary structure elements (some cutting of
secondary structures is essential to obtain ‘correct’ domain, but this feature should be
carefully balanced)
penalize domains consisting from too many short fragments (excessive fragmentation
may result in very compact, but biologically unfeasible domains)
improve the ability to recognize ‘classical’ folds (this will improve recognition of very small
and very large domains for which contact density may be misleading)
Shamelessly ‘borrowed’ from Phil Bourne’s notes: www.sdsc.edu/pb/edu/pharm201/15/15.pptx
Best practices: use a consensus approach