Download pdf - Remote Template Detectionubio.bioinfo.cnio.es/Cursos/EstructuraUAM2009/TempDetectBM13.pdf · If no similar PDB template exists … If no template can be found for a domain by BLAST,

Remote Template DetectionRemote Template Detection

Michael TressMichael Tress

CNIOCNIO

Simplified Protein StructureSimplified Protein Structure

Prediction Flow ChartPrediction Flow Chart

Simplified Protein StructureSimplified Protein Structure

Prediction Flow ChartPrediction Flow Chart

We know structure and sequence are related …

Early comparisons of the 3D structures of

homologous proteins showed that 3D

structure is closely related to protein

sequence.

For example, these two structures

have only 20% sequence identity.

Now we know that even proteinswith minimal sequence similarity

may have a similar fold (backbone

structure).

… and that structure is conserved

So we can make structure predictions even with

remote homology

Structural space(more limited)

Sequence space(more diverse)

Comparative Modelling Target

Remotely Homologous Target

But the fact that structure is highly

conserved means that we can even predict

structure based on remotely similar

structural templates from.

If no similar PDB template exists …

If no template can be found for a domain by BLAST,

then the process of structure prediction becomes a

lot more complicated.

If we are lucky we can use more complex sequence

search methods (PSIBLAST, FFAS, HMMs) or fold

recognition techniques.

If these methods fail we will have to use fold

recognition methods to find a remotely related

template.

If fold recognition methods fail to find a template, the

protein can still be modelled ab initio.

Remember, once sequence identity between the target sequence and the nearest

template falls below 25 or 30%, homology models are likely to be less accurate.

Pairwise sequence search methods detect folds when sequence similarity is high,

but are very poor at detecting relationships that have less than 20% identity.

Detecting Remote Templates

Advanced Sequence-Based and Hybrid Techniques

PSIBLAST and Hidden Markov ModelsProfile methods use multiple alignments (profiles) to search for

templates. PSI-BLAST generates multiple alignments from BLASTsearches, Hidden Markov model methods build profiles using models of

probability.

These methods are more powerful because the more sequences thatyou can use in a search, the more likely you are to find evolutionary

distant relatives.

META-PROFILES and PROFILE-PROFILE SearchesBoth these methods are based on profiles.

Meta-profile methods improve on PSI-BLAST by including the

coincidence of real and predicted secondary structure and accessibility.

Adding structural information to the profiles helps find distant but

structurally similar homologues.

Profile-profile searches build profiles for both the target sequences

AND the templates from the protein data bank. The extra evolutionary

information in the template profiles improves the power of the search.

Hybrid and Sequence-Based Servers

SAM T06 - www.cse.ucsc.edu/research/compbio/HMM-apps/T06-query.html

The query is checked against a library of hidden Markov models. This is a sequencebased technique, but it does use secondary structure information.

Meta-BASIC - at bioinfo.pl

Meta-BASIC is based on consensus alignments of profiles. It combines sequence

profiles with predicted secondary structure and uses several scoring systems and

alignment algorithms.

FFAS – ffas.ljcrf.edu/ffas-cgi/cgi/ffas.pl

FFAS03 is a profile-profile alignment method, which takes advantage of theevolutionary information in both query and template sequences.

HHPRED – toolkit.tuebingen.mpg.de/hhpred

HHpred is based on the pairwise comparison of profile hidden Markov models. It

searches domain databases, like Pfam or SMART instead of single sequences and is

meant to be as easy as BLAST, but more sensitive at finding remote homologues.

Fold Recognition

When fold recognition methods were first

developed it was thought that they could detect

analogous, proteins – those that were

structurally similar but with no evolutionary

relationship.

Fold Recognition





relationship.

In fact most of these predictions were later

shown to be homologous (have an evolutionary

relationship) once advanced sequence

comparison methods, such as PSI-BLAST, were

developed.

Fold Recognition





relationship.





developed.

Fold recognition methods are used because they

allow you to find more distant templates.

Fold Recognition





relationship.





developed.



Fold recognition techniques find templates thatsequence based methods cannot because theyuse structural information as well as sequencesimilarity to evaluate templates.

MREYKLVVLGSGGVGKSALTVQFVQGIFVDEYDPTIEDSYRKQVEVDCQQCMLEILDTAGTEQFTAMRDLYMKNGQGFALVYSITAQSTFNDLQDLREQILRVKDTEDVPMILVGNKCDLEDERVVGKEQG

QNLARQWCNCAFLESSAKSKINVNEIFYDLVRQINR

?

Fold Recognition





relationship.





developed.



In fold recognition we are asking the

opposite to the question we asked

earlier: “starting from known protein

structures, can I fit my sequence onto

them?”

Fold recognition techniques find templates thatsequence based methods cannot because theyuse structural information as well as sequencesimilarity to evaluate templates.

Fold recognition methods have built in fold libraries

Fold recognition methods work by superimposing the target onto a database of

known 3D structures (folds) and evaluating the sequence-fold alignments.

Each method has its own non-redundant database of folds to save calculation

time.

Here you start with the query sequence …

Align the sequence with all

folds from the library.

(Fold recognition programs

use a library of folds …)

Evaluate the fit between

the target and all the

structures

Fold recognition –

a summary

Instead of aligning two sequences …LKRPMT ...

Evaluate whether this would be a

good structure (secondary

structure, accessibility, contacts)

FRIPLN ...

L

R

K

P

Take the template structure …

F

I

R

P

R

P

I

L

Align the target sequence with

the template structure …

Align again …

Evaluation Sequence-Structure Fit

Using Scoring Functions

Scoring functions for evaluating the sequence-structure fit include:

Coincidence of real and predicted secondary structure/accessibility

Evolutionary information (from aligned structures and sequences)

Pair potentials

Solvation energy

Similarity between the known and predicted residue environments

Bowie et al. (1991) used “environments” to describe each residue position

in a protein structure. Eighteen distinct environments were defined by

measuring the area of side chain that was buried, the fraction that was

exposed to polar atoms, and local secondary structure.

The structural databases were used to build a scoring matrix from the

preferences of the 20 amino acids for the 18 environment classes.

Fold recognition methods pre-assign one of these 18 environments to each

residue position in each of the templates in their fold libraries.

When the target sequence is aligned with the fold, a score can be

calculated from the scoring matrix for each of the aligned positions. The

final environment score will be the sum of the probabilities for each residue.

R

P

I

L

1

2

3

4

5

6

7

A Fold Recognition Example

- 3D-PSSM

1. The Fold Library

PSI-BLAST and a non-redundant database are

used to create sequence profiles for each of thefolds in the library.

Structurally similar folds and their profiles aremerged to form a “3D-profile”.

2. Secondary Structure and Solvation

Secondary structure is assigned to each foldbased on the annotation in the STRIDE

database.

Each residue is also assigned a solvationpotential, calculated as the ratio between the

surface area accessible to solvent and its overallsurface area.

Solvation potential is divided into 21 bins,ranging from 0% (buried) to 100% (exposed).

A Fold Recognition Example

- 3D-PSSM

3. Query Sequences

Query sequences have their secondary structurepredicted by PSI-Pred.

PSI-BLAST profiles are also generated for the

query sequence to allow “bi-directional scoring”.

4. Fold recognition

The target sequence is aligned to folds from the

fold library using the 3D-FSSP dynamicprogramming algorithm.

Sequence-structure alignments are scored by a

range of 1D and 3D profiles, secondary structurematching , solvation potentials and residue

equivalence.

Fold Recognition Servers

3D-PSSM - www.sbg.bio.ic.ac.uk/~3dpssm/

Based on sequence profiles, solvatation potentials and secondary structure.

mGenTHREADER - www.psipred.net/

Combines profiles and sequence-structure alignments. A neural network-

based jury system calculates the final score based on solvation and pair

potentials.

RAPTOR - software.bioinformatics.uwaterloo.ca/~raptor/

Best-scoring server in CAFASP3 competition in 2002. ACE server (based

on Raptor) best FR server in CASP6. You have to ask to use it first ...

SPARKS - http://sparks.informatics.iupui.edu/

Top server in CASP 6. Sequence, secondary structure Profiles And

Residue-level Knowledge-based Score for fold recognition

Consensus Fold Recognition

No one method can hope to correctly identify every fold. In fact the best

predictions are often when server predictions agree.

Human experts have recognised this, and generally they are better at fold

prediction than machines.

Human experts usually use several different fold recognition methods and

predict folds after evaluating all the results (not just the top hits) from a

range of methods.

So why not produce an algorithm that mimics the human experts?

The first consensus server, Pcons, sent the target sequence to six publicly

available fold recognition web servers.

Predictions were structurally superimposed and evaluated for their

similarity. The best model was predicted from similarity to other predicted

models.

Consensus Fold

Recognition Servers

3D Jury - http://bioinfo.pl/meta/

3D Jury is a consensus predictor that utilizes the results of fold

recognition servers, such as FFAS, 3D-PSSM, FUGUE and

mGenTHREADER, and uses a jury system to select alignments and

templates. Models are built with Modeller.

GeneSilico - http://genesilico.pl/meta/

A gateway to various methods for protein structure prediction. Domains

are identified by HmmPfam, and there are several methods for secondary

and tertiary structure (FR) prediction. Consensus predictions are made

with the Pcons consensus server and you can also send a subset of

alignments to the FRankenstein server.

Pcons - www.sbc.su.se/~arne/pcons/

Pcons was the first consensus server for fold recognition. It has been

relaunched recently.

•Rather than using previously solved structures, ab initio methods build 3D

models from physical principles such as energy functions, or try to mimic

protein folding.

•Ab initio methods work best on very small proteins. They require vast

computational resources and the physical basis of protein structural

stability and the necessary energy functions are not fully understood.

•De novo modelling is a form of ab initio modelling that uses protein

fragments to build models that are evaluated by physical principles and

statistical properties. De novo modelling has had some notable successes.

Again this works best on smaller proteins.

Ab initio and de novo protein modelling

De Novo Servers

ROBETTA - http://robetta.bakerlab.org/

ROBETTA makes both ab initio and template-based predictions. It

detects fragments with BLAST, FFAS03, or 3DJury and uses

fragment insertion and assembly.

FoldPRO - http://www.igb.uci.edu/?page=tools&subPage=psss

A server that attempts to assemble fragments in a similar way to

Robetta.