Upload
vuanh
View
250
Download
0
Embed Size (px)
Citation preview
1
Protein Analysis - Part 2
Bioinformatic tools for identification andcharacterization of proteinsPart 1- Similarity searches- Motif (pattern and profile) searches Protein domain databases- Primary structure analysis
Part 2- Secondary structure prediction- Tertiary structure and modelling- Proteomics
Secondary structure - alpha-Helix
Properties of the a-helix.The structure repeats itself every 5.4 Å along the helix axis, i.e. we say that the a-helix has a pitch of 5.4 Å. a-helices have3.6 amino acid residues per turn, i.e. a helix 36 amino acids long would form 10 turns.
Secondary Structure - ß-Sheet
The ß-sheet structureIn a ß-sheet two or more polypeptide chains run alongside each other and are linked in a regular manner by hydrogen bondsbetween the main chain C=O and N-H groups. Therefore all hydrogen bonds in a ß-sheet are between different segments ofpolypeptide. This contrasts with the a-helix where all hydrogen bonds involve the same element of secondary structure.
Secondary structureReverse turnsA reverse turn is region of the polypeptide having a hydrogen bond from one main chain carbonyl oxygen to the main chainN-H group 3 residues along the chain (i.e. Oi to Ni+3). Helical regions are excluded from this definition and turns betweenß-strands form a special class of turn known as the ß-hairpin.
How can secondary structures be predicted
- Statistics or stereochemical principles of AAs- Multiple alignments, the conserved structures are buried in the core (mostly), variable regions outside- Solvent accessibility pattern correspond to specific secondary structures
- Replace expert knowledge by neural networks
How are the secondary structures detected
in a PDB file
The figure below shows the three main chain torsion angles of a polypeptide. These are phi (F), psi (Y), and omega (W).
omega fixed because of planar peptide bond.
alpha
beta
2
How are the secondary structures detected in PDB
Hydrogen bonds (3.10 helix: i, i+3; alpha helix i, i+4 etc.)
• the proton-acceptor distance - less than 2.4 Angstroms & the angle between the proton-donor bond and the line connecting the donor and acceptor atoms - less than 35 degrees (e.g., see Berndt et al., 1993).• the proton-acceptor distance - less than 2.5 Angstroms & the angle - between +/- 90 and 180 degrees (Baker & Hubbard, 1984)• the energy defined by an electrostatic potential function is less than a cutoff value (see Kabsch & Sander, 1983) (E < -0.5 kcal/mol) allowing a distance of up to 5.2 Angstroms & allowing a misalignment of up to 63 degrees at the ideal length (2.9 A).
Secondary structure prediction
Best programs in the CASP2 and CASP3 contest:
• Dsc (King and Sternberg,ICRF, London; imp. HUSAR)• PHD (B.Rost, Columbia.edu)• PSIPRED (D.Jones, Warwick, UK)
also available:
• Predator (Argos, EMBL)• Foldclass (HUSAR)• GORIV (IBCP.FR)• HNN (IBCP.FR)• NNSSP (Solovyev, Sanger Center)
Algorithm PHD
multiple alignments of homologous proteins, looked up through database searching
2 neural network algorithm backpropagation with a single hidden layers
multi-level-system (also using solvent plots, transmembrane domain predictions,…)
Neural Networks learning with neural network
training settest set
GLCRVLLKP
Helix Sheet Coil
Neural Networks 2 learning with neural network
Sequence information
...
AAA
AA.
LLL
AAG
CCS
...
Profile
A C L G S in del 100 0 0 0 0 0 0100 0 0 0 0 0 330 0 100 0 0 0 0....
20+2in-putx
win-dow
H
E
C
Sequence to structure
Second Neural Network- structure to structure
Secondary structure prediction PHD
expected average accuracy > 72% for the three states helix, strand andloop(Rost &Sander, PNAS, 1993 , 90, 7558-7562; Rost & Sander, JMB, 1993 , 232, 584-599; Rost & Sander, Proteins, 1994 , 19, 55-72; evaluation of accuracy)
3
Algorithm DSC
Sequences in homologous alignments described as:• residue conformational propensities, • sequence edge effects; • moments of hydrophobicity;• position of insertions and deletions in aligned homologous sequences; • moments of conservation; • auto-correlation;• residue ratios; • secondary-structure feedback effects; • filtering
Learning method is standard linear discriminationDataset 496 proteins
Secondary structure prediction DSC
DSC (Ross D. King & Michael J.E. Sternberg (Protein Science 5:2298-2310, 1996))
Input should be an alignment (also single sequence possible).
If input is a set of multiply aligned homologous sequences, DSC has an overall per residue three-state accuracy of ~70%
Algorithm PSIPRED
PSIBLAST to get profile that means informations aboutconserved regions and insertions/deletions
2 neural networks with on single hidden layer like PHD
Secondary structure prediction PSIPRED
Jones, D. T. (1999) Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol. 292:195-202.
Accuracy ca. 74%
Secondary structure prediction -
EvaluationRoss D. King et al., (2000) Protein Engineering, 13, 15-19.Is it better to combine predictions?
Yes!
Combined NNSSP, PHD, DSC and Predator.
Secondary structure prediction- Evaluation
Results
CASP3 test proteins
Method AccuracyNNSSP 70.5PHD 70.3DSC 70.2PREDATOR 68.3Simple vote 73.3Linear Discr. 72.7Neural Network 73.3
Combinedmethods
4
NPS - Network Protein Sequence Analysis
http://pbil.ibcp.fr/TIBS 2000 March Vol. 25, No 3 [291]:147-150Combet C., Blanchet C., Geourjon C. and Deléage G.Beside other tools: consensus protein secondary structureprediction. Used programs:
• SOPM (Geourjon and Deléage, 1994) • SOPMA (Geourjon and Deléage, 1995) • HNN (Guermeur, 1997) • MLRC (Guermeur et al., 1999) • DPM (Deléage and Roux, 1987) • DSC (King and Sternberg, 1996) • GOR I (Garnier et al., 1978) • GOR III (Gibrat et al., 1987) • GOR IV (Garnier et al., 1996) • PHD (Rost and Sander, 1993) • PREDATOR (Frishman and Argos, 1996) • SIMPA96 (Levin, 1997)
JPRED - Consensus secondary structure
prediction (http://jura.ebi.ac.uk:8888)Cuff J. A., Clamp M. E., Siddiqui A. S., Finlay M., Barton G, J., Jpred: A Consensus Secondary Structure Prediction Server, Bioinformatics, 14:892-893, (1998)
Input single sequence in RAW or PIR format or multiple sequence alignment in MSF or BLC format Output 3 state secondary structure prediction, in Coloured HTML, PS, Java, ASCII output Prediction methods PHD (Rost and Sander, 1993) DSC (King and Sternberg, 1996) PREDATOR (Frishman and Argos, 1996) NNSSP (Salamov, A. A. & Solovyev, V. V., 1995) MULPRED (Barton (1994), unpublished) ZPRED (Zvelebil et. al., 1987) JNET (Cuff J. and Barton G. J., 1999) JNETsolacc COILS (Lupas, A., 1996) MULTICOIL (Wolf E., Kim P. S, Berger B., 1997) PHDhtm (Rost and Sander, 1993)
Secondary structure
prediction - Example
PDB:1bnk and NNSSP
1 50MVTPALQMKKPKQFCRRMGQKKQRPARAGQPHSSSDAAQAPAEQPHSSSDCCCCCCCCCCCCCHHHHCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC51 100AAQAPCPRERCLGPPTTPGPYRSIYFSSPKGHLTRLGLEFFDQPAVPLARCCCCCCCCCCCCCCCCCCCCCEEEEECCCCCCCCCCCCCCCCCCCHHHHH101AFLGQVLVRRLPNGTELRGRIVETEAYLGPEDEAAHSRGGRQTPRNRGMFHHHCCEEEEECCCCEEEEEEEEEEEECCCCCCHHHCCCCCCCCCCCCEEE151MKPGTLYVYIIYGMYFCMNISSQGDGACVLLRALEPLEGLETMRHVRSTLECCCCEEEEEEECEEEEEEEECCCCCHEEEEECCCCCCHHHHHHHHHHCC201RKGTASRVLKDRELCSGPSKLCQALAINKSFDQRDLAQDEAVWLERGPLECCCEEEEEECCCCCCCCCHHHHHHHCCCCCCCCCCCCCCCCEEEECCCCC251PSEPAVVAAARVGVGHAGEWARKPLRFYVRGSPWVSVVDRVAEQDTQACCCCCEEEEEEEECCCCCCCCCCCEEEEEECCCCEEEEECHHHCCCCC
HELIX 88-91, 95-102, 138-140,146-150, 190-197, 211-213,HELIX 218-224, 229-231, 265-271, 290-293SHEET 242-245, 106-110, 116-122, 184-188, 156-161,SHEET 165-173, 178-183, 123-127,SHEET 274-279, 256-260
Secondary structure
prediction - Example
PDB:1bnk and NPS (1)
HELIX 88-91, 95-102, 138-140,146-150, 190-197, 211-213,HELIX 218-224, 229-231, 265-271, 290-293SHEET 242-245, 106-110, 116-122, 184-188, 156-161,SHEET 165-173, 178-183, 123-127,SHEET 274-279, 256-260
10 20 30 40 50 60 70 | | | | | | |bnkxxx0 MVTPALQMKKPKQFCRRMGQKKQRPARAGQPHSSSDAAQAPAEQPHSSSDAAQAPCPRERCLGPPTTPGPDPM ccechhhhhtcchhhhhhccccchchhhctctttcchhhhhhhtctttcchhhhccchhhcctccccctcDSC ccccccccccchhhhhhhhhccccccccccccccccccccccccccccccccccccccccccccccccccGOR4 ccccccccccccchhhhhcccccccccccccccchhhhhcccccccccccccccccccccccccccccccHNNC cccccccccchhhhhhhhccccccccccccccccccccccccccccccccccccccccccccccccccccPHD ccchhhccccccccccccccccccccccccccccccccccccccccccchhhccccccccccccccccccPredator ccccccccccccchhhhhccccccccccccccccccccccccccccccccccccccccccccccccccccSIMPA96 cccchhhhhhhhhhhhhhccccccccccccccccccccccccccccccccccccccccccccccccccccSOPM eecchhhcccthhhhhhhtcccccccccccccccccccccccccccccccccccccccccccccccccccSec.Cons. cccc??ccccc?hhhhhhcccccccccccccccccccccccccccccccccccccccccccccccccccc
80 90 100 110 120 130 140 | | | | | | |bnkxxx0 YRSIYFSSPKGHLTRLGLEFFDQPAVPLARAFLGQVLVRRLPNGTELRGRIVETEAYLGPEDEAAHSRGGDPM ccceeecttccccehhchhhhhhchhchhhhhheeeeeehctcccchhcheehhhhhctccchhhttttcDSC cceeeecccccccccccccccccchhhhhhhhhhhhhhhhhhcccccceeeeeeeecccccchhhhhcccGOR4 ceeeeecccccccccccccccccccchhhhhhhhhhhhhcccccccccchhhhhhhcccchhhhhhhcccHNNC cceeeecccccchhheehhhcchhhhhhhhhhccchehhecccccceceeeeeeecccccccchhhhcccPHD eeeeeeccchhhccccccccccchhhhhhhhhhhhhhhhhccccceccceeeeeeecccccchhhhccccPredator ceeeeecccccccccccccccccccchhhhhhhhhhheeeccccccccceeeecccccccccccccccccSIMPA96 ceeeeecccccceeeccccccccchhhhhhhhhhhhhhhhccccccccceeeecccccccchhhhhccccSOPM ceeeeeccttcceeeeeeeeccccccchhhhhhhhheeecccttccccteeeehhheccccchhhhcttcSec.Cons. ceeeeecccccccccccccccccchhhhhhhhhhhhhhhhccccccccceeee???cccccchhhhcccc
Secondary structure
prediction - Example
PDB:1bnk and NPS(2)
HELIX 88-91, 95-102, 138-140,146-150, 190-197, 211-213,HELIX 218-224, 229-231, 265-271, 290-293SHEET 242-245, 106-110, 116-122, 184-188, 156-161,SHEET 165-173, 178-183, 123-127,SHEET 274-279, 256-260
150 160 170 180 190 200 210 | | | | | | |bnkxxx0 RQTPRNRGMFMKPGTLYVYIIYGMYFCMNISSQGDGACVLLRALEPLEGLETMRHVRSTLRKGTASRVLKDPM ccctttcchhhtccceeeeeeeeeeeeeeettttccheehhhhhhchhchhhhhheeeeehccchchehhDSC cccccceeeecccccceeeeeccceeeeecccccccchhhhhhhhhhhhhhhhhhhhhhcccccccccccGOR4 cccccccceeccccceeeeeeeeeeeeeeeeecccchhhhhhhccchhhhhhhhhhhhhccccchhhhhhHNNC ccccccccceccccceeeeeehhheeeecccccccchhhhhhhhccccchhhhhhhhhhccccccheeecPHD cccccccceeeccceeeeeeeeeeeeeeeeeecccchhheecccccccchhhhhhhhhhhccccccccccPredator cccccccccccccceeeeeeeeeeeeeeeecccccchhhhhhhhhhccccchhhhhhhhhccccchhhhhSIMPA96 cccccccccccccceeeeeeecceeeeeccccccccceehhhhhccchhhhhhhhhhhhhhccccchhhhSOPM cccccccceeecttceeeeeeeeeeeeeeecccccchheehhhhhhhhhhhhhhhhhhhhhttchhhhhhSec.Cons. cccccccceeccccceeeeeeeeeeeeeeecccccchhhhhhhh?c?h?hhhhhhhhhhhccccc?hhhh
220 230 240 250 260 270 280 | | | | | | |bnkxxx0 DRELCSGPSKLCQALAINKSFDQRDLAQDEAVWLERGPLEPSEPAVVAAARVGVGHAGEWARKPLRFYVRDPM hhhhtctccchhhhhhhtcththhhhhhhhhhhhhhccctcchchehhhhhecectcchhhhhchheeeeDSC cccccccccchhhhhhhhhccchhhhhhhhhhhhhcccccccccceeeeecccccccccchhccceeeecGOR4 cccccccccchhhhhhhccchhhhhhhhhhhhhhhcccccccchhhhhhhhhccccccccccccceeeeeHNNC cccccccchhhhhhhhhccccccchhhhhhhhhhhcccccccchhhhhhhhhccccchhhhccceeeeeePHD cccccccchhhhhhhhhccccccchhhhhcceeeeccccccccchhhhhhhcccccccccccccceeeecPredator hhcccccccchhhhhhhhcccchhhhhhhhhhhhhcccccccchhhhhhhhhhccccccccccceeeeeeSIMPA96 cccccccchhhhhhhhhcccccccccchhhhhhhccccccccchhhhhhhhhcccccccccccceeeeecSOPM htccccccchhhhhhhhhhccchhhhhhhhheeecccccccccchhhhhhhetcccccccccccceeeetSec.Cons. ccccccccc?hhhhhhhccccchhhhhhhhhhhhhcccccccc?hhhhhhhhccccccccccccceeeee
Fold classes
Notation for fold super classes: 1:all-alpha 2:alpha*beta 3:alpha+beta 4:all-beta
Notation for fold-classes (names as in Pascarella & Argos, 1992): 1:gap 2:cytc 3:hmr 4:wrp 5:ca_bind 6:globin 7:lzm 8:crn 9:cyp 10:ac_prot 11:pap 12:256b 13:hoe 14:sns 15:ferredox 16:cpp 17:pgk 18:xia 19:kinase 20:binding 21:tln 22:barrel 23:inhibit 24:pti 25:plasto 26:cts 27:rdx 28:plipase 29:virus 30:virus_prot 31:cpa 32:dfr 33:igb 34:il 35:fxc 36:sbt 37:gcr 38:tox 39:wga 40:eglin 41:ltn 42:s_prot 43:membrane 44:nbd
5
Fold class prediction - FoldClass
FoldClass (HUSAR) predicts protein fold classes and protein domains from sequence data. The predictions are generated by artificial neural networks (Reczko, M. and Bohr, H. Nucl. Ac. Res. 22: 3616-3619 (1994)).
This program predicts:• a specific overall fold-class,• a super fold-class with respect to secondary structure content and spatial distribution• optionally, a profile of possible fold-classes along the sequence.
Fold class prediction - Threader2.3
Jones DT. THREADER: Protein sequence threading by double dynamic programming, Comp. methods in Mol Biol. New York:Elsevier 1998.http://globin.bio.warwick.ac.uk/~jones/threader.html
Algorithm:• A library of unique protein domain folds is derived from PDB• Testsequence is optimally fitted to all folds (allowing insertions/deletions)• Energy of each possible fit is calculated by summing interactions and solvationsparameters• The lowest energy fold is taken
Output: Fold class, domain folds
Special secondary structures programs in
HUSAR
Transmembrane regions - TMHMM
Helixturnhelix elements - HTHscan
Coiled coils regions - Coilscan
Amphipathic helices - Amphi, Net, Wheel
globular and nonglobular regions - SEG
Protein Analysis
Bioinformatic tools for identification andcharacterization of proteinsPart 1- Similarity searches- Motif (pattern and profile) searches Protein domain databases- Primary structure analysis
Part 2- Secondary structure prediction- Tertiary structure and modelling- Proteomics
Protein simulation and modelling
Protein modelling:
•Sequence -> Model structure, homology modelling, threading•Modelling ligands into an active site•Docking
Protein simulation:•Molecular dynamics - follow the thermal motions of the structure with time•Prediction and information about reaction mechanisms•Prediction of binding energies, pKa‘s, spectra, etc.
Requirements for protein modelling and
simulation
• Structural data for proteins out of the PDB - Brookhaven protein database
• Potential energy for any protein conformation - Potential energy function (PEF)
6
Protein modelling method
I have a protein sequencecan I predict its structure?
Homology modellingQuick and easy!!!!Use the SWISS-MODEL server:HTTP://www.expasy.ch/swissmod/SWISS-MODEL.html
SWISS-MODEL is an Automated Protein ModellingServer running at the GlaxoWellcome ExperimentalResearch in Geneva, Switzerland.
DisclaimerThe result of any modelling procedure is NON-EXPERIMENTAL and MUST be considered with care.This is especially true since there is no humanintervention during model building.
New 3D modeling Server Geno3d:HTTP://geno3d-pbil.ibcp.fr/
Swiss Model steps
Identification of modelling template:BLAST or FASTA against sequences of PDB,
Aligning the target sequence with the template sequence:The target sequence now needs to be aligned with the template sequence(s).
Framework constructionaveraging the position of each atom in the targetsequence, based on the location of the corresponding atoms in the template.
Building the nonconserved loops, the backbone and side chains
Model refinement energy minimisation
Energy Minimisation - Start
Calculate potentiell energy for a givenmolecule (atom coordinates):
set of nuclear positions of all atoms = R
Energy Minimisation - Method
We move the molecule so as to reduce itspotential energy.There are several routines to do this:- Steepest Descent- Gradient conjugation- and more
Unfortunately no technique can guarantee tofind the global energy minimum of a complexproblem (although simulated annealing ispartial solution).
Modelling Programs
WHATIFINSIGHTII..
GROMOSDISCOVER..
7
Model
SWISS-3DIMAGE (References) is an image database which strives to provide high quality pictures of biologicalmacromolecules with known three-dimensional structure. The database contains mostly images of experimentally elucidatedstructures, but also provides views of well accepted theoretical protein models. The images are provided in several useful formats; both mono and stereo pictures are generally available (Disclaimer).
Viewer:RasmolKinemage....
Applicability of model structures
1. Models which are based on incorrect alignments between target and template sequences. Such alignment errors generally reside in the inaccurate positioning of insertions and deletions. It is however often possible to correct such errors by producing several models based on alignment variants and by selecting the most "sensible" solution. Nevertheless, such models are often useful as the errors are not located in the area of interest e.g. a conserved active site. 2. Models based on correct alignments are much better, but their accuracy can still be medium to low. They are very useful tools for the mutagenesis experiment design, but of very limited assistance during detailed ligand binding studies. 3. The last category of models comprises all those which were build based on templates which share a high degree of sequence identity (> 70%) with the target. They have proven useful during drug design projects.
Molecule Simulation - Molecular Dynamics
- The starting place for most simulations is the experimental crystal or NMR structure. - This is energy minimized, solvated in a box of water.
- System is heated (high energy state)
- Equilibration and simulation for 1 nano seconds The detailed atomic motions are usually unimportant. What really matters are "the ensemble average" properties - i.e., what happens on average (MD is in fact chaotic with sensitive dependence on initial conditions - like the weather!).
Molecular Dynamics - Disadvantage
A disadvantage of conventional molecular dynamicsprocedures is that they can only tackle motions with arelatively short time scale - a few nanoseconds is theapproximate upper limit with current computers.
Protein Analysis
Bioinformatic tools for identification andcharacterization of proteinsPart 1- Similarity searches- Motif (pattern and profile) searches Protein domain databases- Primary structure analysis
Part 2- Secondary structure prediction- Tertiary structure and modelling- Proteomics
ProteomicsProteomics is the analysis of the proteom of a givencell or tissue.
Proteom
The proteom consists of all proteins which are expressed from a certain genome or tissue under certain conditions.The proteom changes with the aging or developmentof a cell or tissue, it is not static as the genome.
What is Proteomics?
8
Proteomics
1. Differences in protein expression depending on time2. Differences in protein expression depending on tissue3. Differences in protein expression depending on organism
2D-Gels > Swiss 2D database (www.expasy.ch)
Metabolic databases: Kegg (www.genome.ad.jp/kegg)
Protein analysis