Upload
arline-palmer
View
215
Download
0
Tags:
Embed Size (px)
Citation preview
COT 6930HPC and Bioinformatics
Protein Structure Prediction
Xingquan ZhuDept. of Computer Science and Engineering
DNA RNA
cDNAESTsUniGene
phenotype
GenomicDNADatabases
Protein sequence databases
protein
Protein structure databases
transcription translation
Gene expressiondatabase
Outline Protein Structure
Why structure How to predict protein structure
Experimental methods Computational methods (predictive methods)
Protein Structure Prediction Secondary structure prediction (2D)
Machine learning methods for protein secondary structure prediction Tertiary structure prediction (3D)
Ab initio Homology modeling
ProteinsProteins
Proteins play a crucial role in virtually all biological processes with a broad range of functions.
The activity of an enzyme or the function of a protein is governed by the
three-dimensional structure
Protein Structure is Hierarchical
Protein Structure Video
http://www.youtube.com/watch?v=lijQ3a8yUYQ
Primary Structure: Sequence
The primary structure of a protein is the amino acid sequence
Protein Structure Prediction Problem
Protein structure prediction Predict protein 3D structure from (amino acid) sequence One step closer to useful biological knowledge Sequence → secondary structure → 3D structure → function
Outline Protein Structure
Why structure How to Predict Protein Structure
Experimental methods Computational methods (predictive methods)
Protein Structure Prediction Secondary structure prediction (2D)
Machine learning methods for Protein Secondary Structure Prediction Tertiary structure prediction (3D)
Ab initio Homology modeling
Why Predict Structure?
Structure determines
function
Molecular function
Structure is more conserved than
sequence
Goals:
1. Predict structure from sequence
2. Predict function based on structure
3. Predict function based on sequence
Why predict structure: Structure is more conserved than sequence
28% sequence identity
Why predict structure: Can Label Proteins by Dominant Structure SCOP: Structural Classification Of Proteins
Why predict structure: Large number proteins vs. relative smaller number folds
Small number of unique folds found in practice 90% proteins < 1000 folds, estimated ~4000 total folds
http://www.rcsb.org/pdb/home/home.doAs of 02/05/2008 48,878 structures
Examples of Fold Classes
How to Predict Protein Structure
A related biological question: what are the factors that determine a structure? Energy Kinematics
How can we determine structure? Experimental methods
X-ray crystallography or NMR (Nuclear magnetic resonance) spectrometry limitation: protein size, require crystallized proteins
Computational methods (predictive methods) 2-D structure (secondary structure) 3-D structure (tertiary structure)
Geometry of Protein Structure
rotatable rotatable
Inter-atomic Forces
Covalent bond (short range, very strong) Binds atoms into molecules / macromolecules
Hydrogen bond (short range, strong) Binds two polar groups (hydrogen + electronegative atom)
Disulfide bond / bridge (short range, very strong) Covalent bond between sulfhydryl (sulfur + hydrogen) groups
Hydrophobic / hydrophillic interaction (weak) Hydrogen bonding w/ H2O in solution
Van der Waal’s interaction (very weak) Nonspecific electrostatic attractive force
Types of Inter-atomic Forces
Quick Overview of Energy
Bond Strength (kcal/mole)
H-bonds 3-7
Ionic bonds 10
Hydrophobic interactions 1-2
Van der vaals interactions 1
Disulfide bridge 51
Protein Folding Animation
http://www.youtube.com/watch?v=fvBO3TqJ6FE http://www.youtube.com/watch?v=swEc_sUVz5I
Two Related Problems in Structure Prediction
Directly predicting protein structure from the amino acid sequence has proved elusive
Two sub-problems Secondary Structure Prediction Tertiary Structure Prediction
Secondary Structure Predication (2D)
For each residues in a protein structure, three possible states: a (a-helix), ß (ß-strand), t (others).
amino acid sequence
Secondary structure sequence
Currently the accuracy of secondary structure methods is nearly 80% (2000).
Secondary structure prediction can provide useful information to improve other sequence and structure analysis methods, such as sequence alignment and 3-D modeling.
http://bioinf.cs.ucl.ac.uk/psipred/psiform.html
Outline Protein Structure
Why structure How to Predict Protein Structure
Experimental methods Computational methods (predictive methods)
Protein Structure Prediction Secondary structure prediction (2D)
Machine learning methods for Protein Secondary Structure Prediction Tertiary structure prediction (3D)
Ab initio Homology modeling
PSSP: Protein Secondary Structure Prediction
Three Generations• Based on statistical information of single
amino acids• Based on local amino acid interaction
(segments). Typically a segment containes 11-21 aminoacids
• Based on evolutionary information of the homology sequences
Secondary Structure preferences for Amino Acids
The normalized frequencies for each conformation were calculated from the fraction of residues of each amino acid that occurred in that conformation, divided by this fraction for all residues.
Random occurrence of a particular amino in a conformation would give a value of unity. A value greater than unity indicates a preference for a particular type of secondary structure.
Outline Protein Structure
Why structure How to Predict Protein Structure
Experimental methods Computational methods (predictive methods)
Protein Structure Prediction Secondary structure prediction (2D)
Machine learning methods for Protein Secondary Structure Prediction Tertiary structure prediction (3D)
Ab initio Homology modeling
Machine learning methods for Protein Secondary Structure Prediction
Introduction to classification Generalize protein secondary structure prediction
as a machine learning problem Introduction to Neural Network
Classification and Classifiers
Given a data base table DB with a set of attribute values and a special atribute C, called a class label.
Example:
A1 A2 A3 A4 C
1 1 m g Tumor
0 1 v g Normal
1 0 m b Normal
Classification and Classifiers
An algorithm is called a classification algorithm if it uses the data to build a set of patterns Decision rules or decision trees, etc. Those patters are structured in such a way that we can use them to
classify unknown sets of objects- unknown records.
For that reason (because of the goal) the classification algorithm is often called shortly a classifier.
Classifier Example
Classification and Classifiers
Building a classifier consists of two phases: Training and testing. In both phases we use data (training data set and disjoint test data
set) for which the class labels are known for ALL of the records. The training data set to create patterns (rules, trees, or to
train a Neural network). Evaluate created patterns with the use of of test data, which
classification is known. The measure for a trained classifier accuracy is called
predictive accuracy.
Predictive Accuracy Evaluation
The main methods of predictive accuracy evaluations are:
• Re-substitution (N ; N)• Holdout (2N/3 ; N/3)• x-fold cross-validation (N-N/x ; N/x)• Leave-one-out (N-1 ; 1), where N is the number of instances in the dataset
The process of building and evaluating a classifier is also called a supervised learning, or lately when dealing with large data bases a classification method in Data Mining
Classification Models: Different Classifiers
Typical classification models Decision Trees (ID3, C4.5) Nearest Neighbors Support Vector Machines Neural Networks
Most of the best classifiers for PSSP are based on Neural Network model
Demonstration
Machine learning methods for Protein Secondary Structure Prediction
Introduction to classification Generalize protein secondary structure prediction
as a machine learning problem Introduction to Neural Network
How to generalize protein secondary prediction as a machine learning problem? Using a sliding window to move along the amino acid
sequence Each window denotes an instance Each amino acid inside the window denotes an attribute The known secondary structure of the central amino acid is the class
label
How to generalize protein secondary prediction as a machine learning problem?
A set of “examples” are generated from sequence with known secondary structures
Examples form a training set Build a neural network classifier Apply the classifier to a sequence with unknown
secondary structure
Machine learning methods for Protein Secondary Structure Prediction
Introduction to classification Generalize protein secondary structure prediction
as a machine learning problem Introduction to Neural Network
Introduction to Neural Network
What is an artificial Neural Network? An extremely simplified model of the brain
Essentially a function approximator Transforms inputs into outputs to the best of its ability
Introduction to Neural Network
Composed of many “neurons” that co-operate to perform the desired function
How do Neural Network Work? A neuron (perceptron) is a single layer NN The output of a neuron is a function of the weighted
sum of the inputs plus a bias
Activation Function
Binary active function f(x)=1 if x>=0 f(x)=0 otherwise
The most common sigmoid function used is the logistic function f(x) = 1/(1 + e-x) The calculation of derivatives are important for neural
networks and the logistic function has a very nice derivative f’(x) = f(x)(1 - f(x))
Where Do The Weights Come From?
The weights in a neural network are the most important factor in determining its function
Training is the act of presenting the network with some sample data and modifying the weights to better approximate the desired function Supervised Training
Supplies the neural network with inputs and the desired outputs
Response of the network to the inputs is measured The weights are modified to reduce the difference between the
actual and desired outputs
Perceptron Example Simplest neural network with the ability to learn
Made up of only input neurons and output neurons Output neurons use a simple threshold activation
function In basic form, can only solve linear problems
Limited applications
Perceptron Example
Perceptron weight updating If the output is not correct, the weights are adjusted
according to the formula: wnew = wold + ·(desired – output)input
Assuming given instance {(1,0,1), 0}
Multi-Layer Feedforward NN
An extension of the perceptron Multiple layers
The addition of one or more “hidden” layers in between the input and output layers
Activation function is not simply a threshold Usually a sigmoid function
A general function approximator Not limited to linear problems
Information flows in one direction The outputs of one layer act as inputs to the next layer
Multi-Layer Feedforward NN Example XOR problem
Back-propagation
Searches for weight values that minimize the total error of the network over the set of training examples Forward pass: Compute the outputs of all units in the
network, and the error of the output layers. Backward pass: The network error is used for
updating the weights (credit assignment problem).
NN for Protein Secondary Structure Prediction
Outline Protein Structure
Why structure How to Predict Protein Structure
Experimental methods Computational methods (predictive methods)
Protein Structure Prediction Secondary structure prediction (2D)
Machine learning methods for Protein Secondary Structure Prediction Tertiary structure prediction (3D)
Ab initio Homology modeling
Ab initio Prediction
Sampling the global conformation space Lattice models / Discrete-state models Molecular Dynamics
Picking native conformations with an energy function Solvation model: how protein interacts with water Pair interactions between amino acids
Lattice String Folding HP model: main modeled force is hydrophobic attraction
Amino Acids are classified into two types Hydrophopic (H) or Polar (P)
NP-hard in both 2-D square and 3-D cubic Constant approximation algorithms Not so relevant biologically
Lattice String Folding
Energy Minimization Many forces act on a protein
Hydrophobic: inside of protein wants to avoid water Hydrophobic molecules associate with each other in water solvent as if water
molecules is the repellent to them. It is like oil/water separation. Packing: atoms can't be too close, nor too far away van der Waals interactions Bond angle/length constraints Long distance, e.g.
Electrostatics & Hydrogen bonds Disulphide bonds Salt bridges
Can calculate all of these forces, and minimize Intractable in general case, but can be useful
Molecular Dynamics (MD)Molecular Dynamics (MD)
In molecular dynamics simulation, we simulate motions of atoms as a function of time according to Newton’s equation of motion. The equations for a system consisting on N atoms can be written as
Here, ri and mi represent the position and mass of atom i and Fi(t) is the force on atom i at time t. Fi(t) is given by
where V ( r1, r2, …, rN) is the potential energy of the system that depends on the positions of the N atoms in the system. ∇i is
). , 2, 1,( ,d
d2
2
Nitt
tm i
ii F
r
,,,, 21 Nii V rrrF
zyxi
kji
(1)
(3)
(2)
Energy Functions used in Energy Functions used in Molecular SimulationMolecular Simulation
pairs ,ticelectrosta
pairs , der Waalsvan
612
Hbonds
1012
dihedralsangles
2
0
bonds
2
0totalcos1
jiij
ji
jiij
ij
ij
ij
ij
ij
ij
ij
b
r
r
B
r
A
r
D
r
C
nKKrrKV
Electrostatic term
H-bonding term
Van der Waals term
Bond stretching term
Dihedral termAngle bending term
r ΦΘ
+ ーO H
rr r
The most time demanding part.
Outline Protein Structure
Why structure How to Predict Protein Structure
Experimental methods Computational methods (predictive methods)
Protein Structure Prediction Secondary structure prediction (2D)
Machine learning methods for Protein Secondary Structure Prediction Tertiary structure prediction (3D)
Ab initio Homology modeling
Homology-based Prediction
Align query sequence with sequences of known structure, usually >30% similar
Superimpose the aligned sequence onto the structure template, according to the computed sequence alignment
Perform local refinement of the resulting structure in 3D
90% of new structures submitted to PDB in the past three years have similar folds in PDB
The number of unique structural folds is small (possibly a few thousand)
Homology-based Prediction
Raw model
Loop modeling
Side chain placement
Refinement
Homology-based Prediction
Outline Protein Structure
Why structure How to predict protein structure
Experimental methods Computational methods (predictive methods)
Protein Structure Prediction Secondary structure prediction (2D)
Machine learning methods for protein secondary structure prediction Tertiary structure prediction (3D)
Ab initio Homology modeling