Remote Template DetectionRemote Template Detection
Michael TressMichael Tress
CNIOCNIO
Simplified Protein StructureSimplified Protein Structure
Prediction Flow ChartPrediction Flow Chart
Simplified Protein StructureSimplified Protein Structure
Prediction Flow ChartPrediction Flow Chart
We know structure and sequence are related …
Early comparisons of the 3D structures of
homologous proteins showed that 3D
structure is closely related to protein
sequence.
For example, these two structures
have only 20% sequence identity.
Now we know that even proteinswith minimal sequence similarity
may have a similar fold (backbone
structure).
… and that structure is conserved
So we can make structure predictions even with
remote homology
Structural space(more limited)
Sequence space(more diverse)
Comparative Modelling Target
Remotely Homologous Target
But the fact that structure is highly
conserved means that we can even predict
structure based on remotely similar
structural templates from.
If no similar PDB template exists …
If no template can be found for a domain by BLAST,
then the process of structure prediction becomes a
lot more complicated.
If we are lucky we can use more complex sequence
search methods (PSIBLAST, FFAS, HMMs) or fold
recognition techniques.
If these methods fail we will have to use fold
recognition methods to find a remotely related
template.
If fold recognition methods fail to find a template, the
protein can still be modelled ab initio.
Remember, once sequence identity between the target sequence and the nearest
template falls below 25 or 30%, homology models are likely to be less accurate.
Pairwise sequence search methods detect folds when sequence similarity is high,
but are very poor at detecting relationships that have less than 20% identity.
Detecting Remote Templates
Advanced Sequence-Based and Hybrid Techniques
PSIBLAST and Hidden Markov ModelsProfile methods use multiple alignments (profiles) to search for
templates. PSI-BLAST generates multiple alignments from BLASTsearches, Hidden Markov model methods build profiles using models of
probability.
These methods are more powerful because the more sequences thatyou can use in a search, the more likely you are to find evolutionary
distant relatives.
META-PROFILES and PROFILE-PROFILE SearchesBoth these methods are based on profiles.
Meta-profile methods improve on PSI-BLAST by including the
coincidence of real and predicted secondary structure and accessibility.
Adding structural information to the profiles helps find distant but
structurally similar homologues.
Profile-profile searches build profiles for both the target sequences
AND the templates from the protein data bank. The extra evolutionary
information in the template profiles improves the power of the search.
Hybrid and Sequence-Based Servers
SAM T06 - www.cse.ucsc.edu/research/compbio/HMM-apps/T06-query.html
The query is checked against a library of hidden Markov models. This is a sequencebased technique, but it does use secondary structure information.
Meta-BASIC - at bioinfo.pl
Meta-BASIC is based on consensus alignments of profiles. It combines sequence
profiles with predicted secondary structure and uses several scoring systems and
alignment algorithms.
FFAS – ffas.ljcrf.edu/ffas-cgi/cgi/ffas.pl
FFAS03 is a profile-profile alignment method, which takes advantage of theevolutionary information in both query and template sequences.
HHPRED – toolkit.tuebingen.mpg.de/hhpred
HHpred is based on the pairwise comparison of profile hidden Markov models. It
searches domain databases, like Pfam or SMART instead of single sequences and is
meant to be as easy as BLAST, but more sensitive at finding remote homologues.
Fold Recognition
When fold recognition methods were first
developed it was thought that they could detect
analogous, proteins – those that were
structurally similar but with no evolutionary
relationship.
Fold Recognition
When fold recognition methods were first
developed it was thought that they could detect
analogous, proteins – those that were
structurally similar but with no evolutionary
relationship.
In fact most of these predictions were later
shown to be homologous (have an evolutionary
relationship) once advanced sequence
comparison methods, such as PSI-BLAST, were
developed.
Fold Recognition
When fold recognition methods were first
developed it was thought that they could detect
analogous, proteins – those that were
structurally similar but with no evolutionary
relationship.
In fact most of these predictions were later
shown to be homologous (have an evolutionary
relationship) once advanced sequence
comparison methods, such as PSI-BLAST, were
developed.
Fold recognition methods are used because they
allow you to find more distant templates.
Fold Recognition
When fold recognition methods were first
developed it was thought that they could detect
analogous, proteins – those that were
structurally similar but with no evolutionary
relationship.
In fact most of these predictions were later
shown to be homologous (have an evolutionary
relationship) once advanced sequence
comparison methods, such as PSI-BLAST, were
developed.
Fold recognition methods are used because they
allow you to find more distant templates.
Fold recognition techniques find templates thatsequence based methods cannot because theyuse structural information as well as sequencesimilarity to evaluate templates.
MREYKLVVLGSGGVGKSALTVQFVQGIFVDEYDPTIEDSYRKQVEVDCQQCMLEILDTAGTEQFTAMRDLYMKNGQGFALVYSITAQSTFNDLQDLREQILRVKDTEDVPMILVGNKCDLEDERVVGKEQG
QNLARQWCNCAFLESSAKSKINVNEIFYDLVRQINR
?
Fold Recognition
When fold recognition methods were first
developed it was thought that they could detect
analogous, proteins – those that were
structurally similar but with no evolutionary
relationship.
In fact most of these predictions were later
shown to be homologous (have an evolutionary
relationship) once advanced sequence
comparison methods, such as PSI-BLAST, were
developed.
Fold recognition methods are used because they
allow you to find more distant templates.
In fold recognition we are asking the
opposite to the question we asked
earlier: “starting from known protein
structures, can I fit my sequence onto
them?”
Fold recognition techniques find templates thatsequence based methods cannot because theyuse structural information as well as sequencesimilarity to evaluate templates.
Fold recognition methods have built in fold libraries
Fold recognition methods work by superimposing the target onto a database of
known 3D structures (folds) and evaluating the sequence-fold alignments.
Each method has its own non-redundant database of folds to save calculation
time.
Here you start with the query sequence …
Align the sequence with all
folds from the library.
(Fold recognition programs
use a library of folds …)
Evaluate the fit between
the target and all the
structures
Fold recognition –
a summary
Instead of aligning two sequences …LKRPMT ...
Evaluate whether this would be a
good structure (secondary
structure, accessibility, contacts)
FRIPLN ...
L
R
K
P
Take the template structure …
F
I
R
P
R
P
I
L
Align the target sequence with
the template structure …
Align again …
Evaluation Sequence-Structure Fit
Using Scoring Functions
Scoring functions for evaluating the sequence-structure fit include:
Coincidence of real and predicted secondary structure/accessibility
Evolutionary information (from aligned structures and sequences)
Pair potentials
Solvation energy
Similarity between the known and predicted residue environments
Bowie et al. (1991) used “environments” to describe each residue position
in a protein structure. Eighteen distinct environments were defined by
measuring the area of side chain that was buried, the fraction that was
exposed to polar atoms, and local secondary structure.
The structural databases were used to build a scoring matrix from the
preferences of the 20 amino acids for the 18 environment classes.
Fold recognition methods pre-assign one of these 18 environments to each
residue position in each of the templates in their fold libraries.
When the target sequence is aligned with the fold, a score can be
calculated from the scoring matrix for each of the aligned positions. The
final environment score will be the sum of the probabilities for each residue.
R
P
I
L
1
2
3
4
5
6
7
A Fold Recognition Example
- 3D-PSSM
1. The Fold Library
PSI-BLAST and a non-redundant database are
used to create sequence profiles for each of thefolds in the library.
Structurally similar folds and their profiles aremerged to form a “3D-profile”.
2. Secondary Structure and Solvation
Secondary structure is assigned to each foldbased on the annotation in the STRIDE
database.
Each residue is also assigned a solvationpotential, calculated as the ratio between the
surface area accessible to solvent and its overallsurface area.
Solvation potential is divided into 21 bins,ranging from 0% (buried) to 100% (exposed).
A Fold Recognition Example
- 3D-PSSM
3. Query Sequences
Query sequences have their secondary structurepredicted by PSI-Pred.
PSI-BLAST profiles are also generated for the
query sequence to allow “bi-directional scoring”.
4. Fold recognition
The target sequence is aligned to folds from the
fold library using the 3D-FSSP dynamicprogramming algorithm.
Sequence-structure alignments are scored by a
range of 1D and 3D profiles, secondary structurematching , solvation potentials and residue
equivalence.
Fold Recognition Servers
3D-PSSM - www.sbg.bio.ic.ac.uk/~3dpssm/
Based on sequence profiles, solvatation potentials and secondary structure.
mGenTHREADER - www.psipred.net/
Combines profiles and sequence-structure alignments. A neural network-
based jury system calculates the final score based on solvation and pair
potentials.
RAPTOR - software.bioinformatics.uwaterloo.ca/~raptor/
Best-scoring server in CAFASP3 competition in 2002. ACE server (based
on Raptor) best FR server in CASP6. You have to ask to use it first ...
SPARKS - http://sparks.informatics.iupui.edu/
Top server in CASP 6. Sequence, secondary structure Profiles And
Residue-level Knowledge-based Score for fold recognition
Consensus Fold Recognition
No one method can hope to correctly identify every fold. In fact the best
predictions are often when server predictions agree.
Human experts have recognised this, and generally they are better at fold
prediction than machines.
Human experts usually use several different fold recognition methods and
predict folds after evaluating all the results (not just the top hits) from a
range of methods.
So why not produce an algorithm that mimics the human experts?
The first consensus server, Pcons, sent the target sequence to six publicly
available fold recognition web servers.
Predictions were structurally superimposed and evaluated for their
similarity. The best model was predicted from similarity to other predicted
models.
Consensus Fold
Recognition Servers
3D Jury - http://bioinfo.pl/meta/
3D Jury is a consensus predictor that utilizes the results of fold
recognition servers, such as FFAS, 3D-PSSM, FUGUE and
mGenTHREADER, and uses a jury system to select alignments and
templates. Models are built with Modeller.
GeneSilico - http://genesilico.pl/meta/
A gateway to various methods for protein structure prediction. Domains
are identified by HmmPfam, and there are several methods for secondary
and tertiary structure (FR) prediction. Consensus predictions are made
with the Pcons consensus server and you can also send a subset of
alignments to the FRankenstein server.
Pcons - www.sbc.su.se/~arne/pcons/
Pcons was the first consensus server for fold recognition. It has been
relaunched recently.
•Rather than using previously solved structures, ab initio methods build 3D
models from physical principles such as energy functions, or try to mimic
protein folding.
•Ab initio methods work best on very small proteins. They require vast
computational resources and the physical basis of protein structural
stability and the necessary energy functions are not fully understood.
•De novo modelling is a form of ab initio modelling that uses protein
fragments to build models that are evaluated by physical principles and
statistical properties. De novo modelling has had some notable successes.
Again this works best on smaller proteins.
Ab initio and de novo protein modelling
De Novo Servers
ROBETTA - http://robetta.bakerlab.org/
ROBETTA makes both ab initio and template-based predictions. It
detects fragments with BLAST, FFAS03, or 3DJury and uses
fragment insertion and assembly.
FoldPRO - http://www.igb.uci.edu/?page=tools&subPage=psss
A server that attempts to assemble fragments in a similar way to
Robetta.