Topic 10 Chapter 20, Du and Bourne “Structural Bioinformatics”

Topic 10Chapter 20, Du and Bourne “Structural Bioinformatics”

What is a domain

Shamelessly ‘borrowed’ from Phil Bourne’s notes: www.sdsc.edu/pb/edu/pharm201/15/15.pptx

reasonable region of complexity

http://www.sdsc.edu/pb/edu/pharm201/15/15.pptx


Definition of protein domain is not well defined (to say the least), which makes it difficult to identify their boundaries

General Considerations: - compact, semi-independent units (close to spherical shape) *

- interactions between domains are weak (small contact)

- identifiable hydrophobic core (interface is more hydrophilic) **

- -sheet is best preserved

* Wetlaufer DB. PNAS 1973; 70:697-701** Swindells MB. Protein Science 1995; 4:103-112

Protein Domain

Multi-domain Proteins

Redfern et al, PloS Computational Biology, 2007

Approximately 50% proteins are multi-domain (data from 2005). It could be as high as 80% in eukaryotes

From Wikipedia…

• A protein domain is a part of protein sequence and structure that can evolve, function, and exist independently (not likely now) of the rest of the protein chain.

• Each domain forms (formed?) a compact three-dimensional structure and often can be independently stable and folded.

• Many proteins consist of several structural domains.

• One domain may appear in a variety of different proteins. Molecular evolution uses domains as building blocks and these may be recombined in different arrangements to create proteins with different functions.

• Domains vary in length from between about 25 amino acids up to 500 amino acids in length. The shortest domains such as zinc fingers are stabilized by metal ions or disulfide bridges. (Is a single zinc finger really a domain?)

• Domains often form functional units, such as the calcium-binding EF-hand domain of calmodulin. (Is a single EF-hand really a domain?)

• Because they are independently stable, domains can be "swapped" by genetic engineering between one protein and another to make chimeric proteins. (Sometimes.)

EF-hands (domain or motif?)

Calmodulin

The EF-hand is another common structural element. In fact, the protein calmodulin has four of them.

What about a zinc finger?

Zinc finger

From Wikipedia: A zinc finger is a small protein structural motif that is characterized by the coordination of one or more zinc ions in order to stabilize the fold.

Quick aside about zinc fingers

Repeat proteins

Ankyrin

Adding to the Complexity, Discontinuous Domains

Redfern OC. et al, PloS Computational Biology, 2007

N-terminal C-terminal

33844 px c.56.5.4 d1cg2a1 1cg2 A:26-213,A:327-41439360 px d.58.19.1 d1cg2a2 1cg2 A:214-326

SCOP Classification:

About 20% of mutidomain proteins are not contiguous in sequence

Domain identification

• Any structure unclassified by the sequence-based methods are divided into their constituent domains (when appropriate). The domains are then resubmitted to the sequence and structure comparison protocols discussed previously.

• While there are many automatic domain identification algorithms, most result in significant numbers of incorrect assessments (20-30% incorrect).

• This is mainly due to the fact that there is no unique answer to the question, “What is a domain?” For example, one could easily envision various domain classification schemes based on sequence, phylogeny, and/or structure.

• Structure-based approaches are based on straightforward structural concepts: namely that (globular) proteins have hydrophobic cores, and that these cores should constitute a (semi)independent folding nucleus.

• Thus the automated methods attempt to (maximize, minimize) (intra, inter)-domain contacts.

• What about non-globular (i.e., intrinsically disordered or integral) proteins???

Domain identification

Most automated domain identification methods are primarily based on this premise. However, as you might expect, there are myriad ways to implement such an idea.

ADH

Early works only apply to single-segment domainsCrippen, 1978; Nemethy & Scheraga, 1979; Lesk & Rose, 1981; Rashin, 1981.

Current methods for multi-segment domains mostly use heuristics and approximations:

Holm & Sander, 1994; Siddiqui & Barton, 1995; Swindells, 1995……..

Automatic Domain Partition Methods

Note: the focus here is structural domain partition. While structure-based domain assignment is not a trivial problem, domain prediction from sequences is even more difficult. Any advances in sequence-based domain prediction will greatly improve protein structure prediction. ?

The general approach

Basic principle for domain partition: inter-residue interactions are denser within domains than between domains

Top-down vs. Bottom-up

Start with the entire structure and proceed through iterative partitions into smaller units.

Define very small structural units and assemble them into domains.

• Over the years, an amazing array of approaches have been put forward to solve the domain ID problem.

• In spite of very different overall approaches, an interesting observation has been made: most algorithms correctly ID 70-80% domains within structures, but fail on the others due to complexity within some multi-domain proteins.

• The # of boundaries are both over-predicted leading to too many domains (overcut) or under-predicted leading to too few domains (undercut).

• Thus, the problem remaining is not “where does the boundary of the domain fall?”, but rather “is the identified boundary real?”

How do automatic methods work?

3D-coordinates of chain

Predicted domains

Make domains by putting together primitive units of secondary structure

Bottom-up approach

Parameters involved

Make domainsby partitioning chain into smaller units

Top-down approach

Ste

p 1

Evaluate each potential domain using set of parameters (accept or reject given assignment)S

tep

2

Maximize hydrophobic core of the unitMaximize compactness of the unitFind mechanical hinge points between unitsMinimize interface area between units Minimum size of unitMaximize globularityMinimize cutting through secondary structuresMaximum number of discontinuous fragments within the domain




Two steps of algorithm design:

Train the algorithm

compare predicted domain assignments to

“correct” domain assignments

Tune parameters till the best level of

prediction is achieved

Validate the performance

run the algorithm of an independent

set of data

Report % of correctly partitioned

proteins

Step A Step B

Use expert data for domain assignments

A problem: different algorithms use assignments from different experts for training and validation.

Algorithms will reflect same propensities

toward domain assignments as the expert

method they rely upon.

More seriously, there is no good objective way

to compare the performance of different

methods, as each uses different dataset for

validation.




Issues in Protein Domain Partition

• Compactness (contacts/#of residues……)

• Minimum domain size (35 amino acids [AA], 40AA…?)

• Minimum size to be considered for partition (80AA…?)

• Integrity of secondary structures (Is it ok to break -sheet?)

• Most programs use top-down approach, what are the criteria for stops?

CATH Domain Classification

Use both automatic and manual techniques

If it has high sequence identity (80%) and structural similarity (SSAP score >= 80) with a protein chain X that has been classified in CATH, use the boundaries of X.

Otherwise, apply several domain partition programs 1. DETECTIVE (Swindells, 1995), 2. PUU (Holm & Sander, 1994), 3. DOMAK (Siddiqui and Barton, 1995).

If there is no consensus assign manually.

Differences

WARNING: Even though each method has about 70-80% accuracy based on benchmark tests, disagreement among methods is very big in terms of the number of domains, and domain boundaries.

In CATH, if consensus is not found within a tolerance of 10 residues, the domains are manually assigned (right).

DOMAK (Siddiqui and Barton, 1995). split value = (intA/extAB)*(intB/extAB) intA (B): the number of internal contacts in A (B) (contact: heavy atoms within 5 Å) extAB: the number of contacts between A and B

DETECTIVE (Swindells, 1995), hydrophobic core determination

PUU (protein unfolding units, Holm & Sander, 1994), harmonic model to describe inter-domain dynamics

Domainparser (Xu, 2000) graph algorithm---network flow

Automatic Domain Partition Methods

DomainParser

• DomainParser (Xu et al, Bioinformatics 2000) uses a graph-theoretic algorithm for the decomposition of a multi-domain protein into individual structural domains.

• The underlying principle used is that residue-residue contacts are denser within a domain than between domains.

• The decomposition problem is recast as a network flow problem, in which each residue is represented as a node of a network and each residue-residue contact is represented as an edge with a particular capacity, depending on the type of the contact.

• A two-domain decomposition problem is solved by finding a cut of the network, which minimizes the total cross-edge capacity (minimum cut).

• To deal with networks with non-unique minimum cuts, the algorithm finds all cuts, which achieve the minimum cross-edge capacity.

• A recent analysis of four automatic methods put DomainParser (marginally) at the top (Holland et al, JMB, 2006) --- In fact, 3/4 were nearly equal depending on the evaluation criterion.

bottleneck

interface

Domain partition Network flow

Domain Partition as a Network Flow Problem

Basic idea: identify the bottleneckXu et al, Bioinformatics 2000Guo et al, NAR 2003Note: there is now a DomainParser 2

DomainParser

Domain identification is recast as a network flow problem. Meaning, the method attempts to divide the network into two interconnected parts in such a way that the edge capacity across the division in minimized. (Note, each edge can carry different weights, or capacities.)

Intuitively, this translates into finding the bottleneck within the network.

The algorithm works by systematically removing nodes until domain separation is maximized.

There is a second (post-processing) step that checks the validity of the domain boundaries using commonsense metrics like compactness, radius of gyration, number of non-contiguous segments per domain, and distribution of domain sizes.

Because the method is based on topology, it is very fast. And, it scales very well as well O(nm2), where n = # of nodes and m = # of nodes.

source sink

capacity

edge

node

Maximum Flow/Minimum Cut (bottleneck)

Algorithm to solve this problem: Ford-Fulkerson Method

We need to construct a graph first……

Node

Capacity

Source/sink

Residue (C)

Packing

Extreme points

Model Building for Domain Partition

Issues:• Compactness• Minimum domain size• Integrity of secondary structures• When to stop

Find the bottleneck

Capacity between Residues A/B: (based on Holm & Sander 1994) (1) If atom distance <= 4.0 A, ++1;(2) If backbone contact, ++5;(3) If across a -sheet, ++12;(4) If within a -strand, ++1000.

Capacity and Extreme Points

Two farthest residuesperpendicular to the axis

Source

Sink(sampling)

Preserve -sheet structure

Use multiple extreme points

* Violate compact globular requirement

Domains have very simple and/or extended structure (DomainParser 1 domain)

1aaya1zmec 6prch

Assignments by DomainParser vs. SCOP

DomainParser preserves -sheet (DomainParser 1 domain)

undercut

Structurally correct decomposition by DomainParser (DomainParser: 2 domains)

SCOP treats them as single domain proteins, functional consideration or ?

Assignments by DomainParser vs. SCOP

2liv2adma

overcut

Holland, et al, JMB, 2006Experts: CATH, SCOP, AUTHORS

Domain Assignments by DomainParser

DomainParser tneds to undercut large mutlti-domain proteins

Holland, et al, JMB, 2006

Summary of Performance Comparison


But PDP (Protein Domain Parser) is the winner


• PDP is a recursive top-down algorithm that makes either: (1.) a single cut producing two contiguous domains or (2.) a double cut, where the cuts are at least 35 residues apart and within 8 Å of each other.

• The best cut is selected using criteria of minimum contacts between resulting domains, normalized by the size of the domains.

• The algorithm continues to recursively partition each of the resulting domains until a stopping condition is met.

• During the post-processing step, the number of contacts between resulting domains is evaluated and domains with a high level of contacts are merged together. Very small domains (below 35 residues) are discarded.


• PDP is a recursive top-down algorithm that makes either: (1.) a single cut producing two contiguous domains or (2.) a double cut, where the cuts are at least 35 residues apart and within 8 Å of each other.

• The best cut is selected using criteria of minimum contacts between resulting domains, normalized by the size of the domains.

• The algorithm continues to recursively partition each of the resulting domains until a stopping condition is met.

• During the post-processing step, the number of contacts between resulting domains is evaluated and domains with a high level of contacts are merged together. Very small domains (below 35 residues) are discarded.


• Based on the criterion of correct number of assigned domains, PDP appears to be the most accurate method (85% correct) followed by NCBI (83%), DomainParser (77%), and PUU (74%).

• DomainParser is the most accurate on structures with few domains. However, it tends to under-cut many structures (4.5% over-cut, 18.5% under-cut).

• NCBI, on the other hand, shows a balance between over-cut and under-cut types of errors (9.9% over-cut, 7.6% under-cut).

• The performance of PDP is consistently superior to other methods; it is particularly impressive on chains with larger number of domains: the method assigns correctly four out of five, five-domain chains and is the only method to correctly assign a six-domain chain. In general the performance of NCBI is very similar overall as well as in its profile character to that of PDP; its assignment of four-domain chains is superior to that of PDP, but NCBI fails to assign correctly most of five-domain chains and both of the six-domain chains.

Summary of Performance Comparison

Some insights from looking at automatic domain assignments:

Maximizing ratio of intra- /inter-domain contacts is a chief principle in algorithmic

assignments and work well for ‘standard’ cases. As more complex structures are solved,

more cases of ‘unusual’ architecture are uncovered. These tend to defy our basic rules.

It is possible to include more parameters and tune them

better to avoid some obvious cases of overcuts:

penalize splitting secondary structure elements (some cutting of

secondary structures is essential to obtain ‘correct’ domain, but this feature should be

carefully balanced)

penalize domains consisting from too many short fragments (excessive fragmentation

may result in very compact, but biologically unfeasible domains)

improve the ability to recognize ‘classical’ folds (this will improve recognition of very small

and very large domains for which contact density may be misleading)




http://pdomains.sdsc.edu

Best practices: use a consensus approach

http://pdomains.sdsc.edu/

Best practices: use a consensus approach

Documents

Topic 10 Chapter 20, Du and Bourne “Structural Bioinformatics”