41
. Class 7: Protein Secondary Structure

Class 7: Protein Secondary Structure. Protein Structure u Amino-acid chains can fold to form 3-dimensional structures u Proteins are sequences that

  • View
    221

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Class 7: Protein Secondary Structure. Protein Structure u Amino-acid chains can fold to form 3-dimensional structures u Proteins are sequences that

.

Class 7: Protein Secondary Structure

Page 2: Class 7: Protein Secondary Structure. Protein Structure u Amino-acid chains can fold to form 3-dimensional structures u Proteins are sequences that

Protein Structure

Amino-acid chains can fold to form 3-dimensional structures

Proteins are sequencesthat have (more or less) stable 3-dimensional configuration

Page 3: Class 7: Protein Secondary Structure. Protein Structure u Amino-acid chains can fold to form 3-dimensional structures u Proteins are sequences that

Why Structure is Important?

The structure a protein takes is crucial for its function Forms “pockets” that can recognize an enzyme

substrate Situates side chain of

specific groups to co-locate to form areas with desired chemical/electrical properties

Creates firm structures such ascollagen, keratins, fibroins

Page 4: Class 7: Protein Secondary Structure. Protein Structure u Amino-acid chains can fold to form 3-dimensional structures u Proteins are sequences that

Determining Structure

X-Ray and NMR methods allow to determine the structure of proteins and protein complexes

These methods are expensive and difficult Could take several work months to process one

proteins

A centralized database (PDB) contains all solved protein structures

XYZ coordinate of atoms within specified precision

~11,000 solved structures

Page 5: Class 7: Protein Secondary Structure. Protein Structure u Amino-acid chains can fold to form 3-dimensional structures u Proteins are sequences that

Structure is Sequence Dependent

Experiments show that for many proteins, the 3-dimensional structure is a function of the sequence

Force the protein to loose its structure, by introducing agents that change the environment

After sequences put back in water, original conformation/activity is restored

However, for complex proteins, there are cellular processes that “help” in folding

Page 6: Class 7: Protein Secondary Structure. Protein Structure u Amino-acid chains can fold to form 3-dimensional structures u Proteins are sequences that

Secondary Structure

-helix -strands

Page 7: Class 7: Protein Secondary Structure. Protein Structure u Amino-acid chains can fold to form 3-dimensional structures u Proteins are sequences that

Helix

Single protein chain Shape maintained by

intramolecular H bondingbetween -C=O and H-N-

Page 8: Class 7: Protein Secondary Structure. Protein Structure u Amino-acid chains can fold to form 3-dimensional structures u Proteins are sequences that

Hydrogen Bonds in -Helixes

Page 9: Class 7: Protein Secondary Structure. Protein Structure u Amino-acid chains can fold to form 3-dimensional structures u Proteins are sequences that

-Strands form Sheets

parallel Anti-parallel

These sheets hold together by hydrogen bonds across strands

Page 10: Class 7: Protein Secondary Structure. Protein Structure u Amino-acid chains can fold to form 3-dimensional structures u Proteins are sequences that

Angular Coordinates

Secondary structures force specific angles between residues

Page 11: Class 7: Protein Secondary Structure. Protein Structure u Amino-acid chains can fold to form 3-dimensional structures u Proteins are sequences that

Ramachandran Plot

We can related angles to types of structures

Page 12: Class 7: Protein Secondary Structure. Protein Structure u Amino-acid chains can fold to form 3-dimensional structures u Proteins are sequences that

Define "secondary structure"

3D protein coordinates may be converted to a 1D secondary structure representation using DSSP or STRIDE

DSSP EEEE_SS_EEEE_GGT__EE_E_HHHHHHHHHHHHHHHGG_TT

DSSP= Database of Secondary Structure in Proteins

Page 13: Class 7: Protein Secondary Structure. Protein Structure u Amino-acid chains can fold to form 3-dimensional structures u Proteins are sequences that

DSSP symbolsH = helix backbone angles (-50,-60) and H-bonding pattern (i-> i+4)

E = extended strand backbone angles (-120,+120) with beta-sheet H-bonds (parallel/anti-parallel are not distinguished)

S= beta-bridge (isolated backbone H-bonds)

T=beta-turn (specific sets of angles and 1 i->i+3 H-bond)

G=3-10 helix or turn (i,i+3 H-bonds)

I=Pi-helix (i,i+5 Hbonds) (rare!)

_= unclassified. None-of-the-above. Generic loop, or beta-strand with no regular H-bonding.

L

Page 14: Class 7: Protein Secondary Structure. Protein Structure u Amino-acid chains can fold to form 3-dimensional structures u Proteins are sequences that

Prediction of Secondary Structure

Input: amino-acid sequence

Output: Annotation sequence of three classes:

alpha beta other (sometimes called coil/turn)

Measure of success: Percentage of residues that were correctly labeled

Page 15: Class 7: Protein Secondary Structure. Protein Structure u Amino-acid chains can fold to form 3-dimensional structures u Proteins are sequences that

Accuracy of 3-state predictions

Q3-score = % of 3-state symbols that are correct

Measured on a "test set"

Test set == An independent set of cases (protein) that were not used to train, or in any way derive, the method being tested.

Best methods:

PHD (Burkhard Rost) -- 72-74% Q3

Psi-pred (David T. Jones) -- 76-78% Q3

True SS: EEEE_SS_EEEE_GGT__EE_E_HHHHHHHHHHHHHHHGG_TTPrediction: EEEELLLLHHHHHHLLLLEEEEEHHHHHHHHHHHHHHHHHHLL

Page 16: Class 7: Protein Secondary Structure. Protein Structure u Amino-acid chains can fold to form 3-dimensional structures u Proteins are sequences that

What can you do with a secondary structure prediction?

(1) Find out if a homolog of unknown structure is missing any of the SS (secondary structure) units, i.e. a helix or a strand.

(2) Find out whether a helix or strand is extended/shortened in the homolog.

(3) Model a large insertion or terminal domain

(4) Aid tertiary structure prediction

Page 17: Class 7: Protein Secondary Structure. Protein Structure u Amino-acid chains can fold to form 3-dimensional structures u Proteins are sequences that

Statistical Methods

From PDB database, calculate the propensity for a given amino acid to adopt a certain ss-type

( | ) ( , )

( ) ( ) ( )i i i

i

P aa p aaP

p p p aa

Example:#Ala=2,000, #residues=20,000, #helix=4,000, #Ala in helix=500

P(,aa) = 500/20,000, p(p(aa) = 2,000/20,000

P = 500 / (4,000/10) = 1.25

Used in Chou-Fasman algorithm (1974)

Page 18: Class 7: Protein Secondary Structure. Protein Structure u Amino-acid chains can fold to form 3-dimensional structures u Proteins are sequences that
Page 19: Class 7: Protein Secondary Structure. Protein Structure u Amino-acid chains can fold to form 3-dimensional structures u Proteins are sequences that

Chou-Fasman: Initiation

T S P T A E L M R S T GP(H) 69 77 57 69 142 151 121 145 98 77 69 57

T S P T A E L M R S T GP(H) 69 77 57 69 142 151 121 145 98 77 69 57

Identify regions where 4/6 have a P(H) >1.00 “alpha-helix nucleus”

Page 20: Class 7: Protein Secondary Structure. Protein Structure u Amino-acid chains can fold to form 3-dimensional structures u Proteins are sequences that

Chou-Fasman: Propagation

T S P T A E L M R S T GP(H) 69 77 57 69 142 151 121 145 98 77 69 57

Extend helix in both directions until a set of four residues have an average P(H) <1.00.

Page 21: Class 7: Protein Secondary Structure. Protein Structure u Amino-acid chains can fold to form 3-dimensional structures u Proteins are sequences that

Chou-Fasman Prediction

Predict as -helix segment with E[P] > 1.03 E[P] > E[P] Not including proline

Predict as -strand segment with E[P] > 1.05 E[P] > E[P]

Others are labeled as turns.

(Various extensions appear in the literature)

Page 22: Class 7: Protein Secondary Structure. Protein Structure u Amino-acid chains can fold to form 3-dimensional structures u Proteins are sequences that

Achieved accuracy: around 50% Shortcoming of this method: ignoring the context of

the sequence when predicting using amino-acids

We would like to use the sequence context as an input to a classifier

There are many ways to address this. The most successful to date are based on neural

networks

Page 23: Class 7: Protein Secondary Structure. Protein Structure u Amino-acid chains can fold to form 3-dimensional structures u Proteins are sequences that

A Neuron

Page 24: Class 7: Protein Secondary Structure. Protein Structure u Amino-acid chains can fold to form 3-dimensional structures u Proteins are sequences that

Artificial Neuron

W1

W2

Wk

i

iiaWbf )(

Input Output

a1

a2

ak

•A neuron is a multiple-input -> single output unit

•Wi = weights assigned to inputs; b = internal “bias”

•f = output function (linear, sigmoid)

Page 25: Class 7: Protein Secondary Structure. Protein Structure u Amino-acid chains can fold to form 3-dimensional structures u Proteins are sequences that

Artificial Neural Network

W1

W2

Wk

a1

a2

a3

Input

Hidden Output

o1

om

•Neurons in hidden layers compute “features” from outputs of previous layers

•Output neurons can be interpreted as a classifier

Page 26: Class 7: Protein Secondary Structure. Protein Structure u Amino-acid chains can fold to form 3-dimensional structures u Proteins are sequences that

Example: Fruit Classifer

Shape Texture Weight ColorApple ellipse hard heavy redOrange round soft light yellow

Page 27: Class 7: Protein Secondary Structure. Protein Structure u Amino-acid chains can fold to form 3-dimensional structures u Proteins are sequences that

Qian-Sejnowski Architecture

......

...

...

oooo

HiddenInput Output

Si

Si-w

Si+w

ss

kksss

ajajkajkk

jiaj

os

ubo

wialh

as1i

maxarg

)(

}{

,

,,,,

,

xe11

xl )(

Page 28: Class 7: Protein Secondary Structure. Protein Structure u Amino-acid chains can fold to form 3-dimensional structures u Proteins are sequences that

Neural Network Prediction

A neural network defines a function from inputs to outputs

Inputs can be discrete or continuous valued In this case, the network defines a function from a

window of size 2w+1 around a residue to a secondary structure label for it

Structure element determined by max(o, o, oo)

Page 29: Class 7: Protein Secondary Structure. Protein Structure u Amino-acid chains can fold to form 3-dimensional structures u Proteins are sequences that

Training Neural Networks

By modifying the network weights, we change the function

Training is performed by Defining an error score for training pairs

<input,output> Performing gradient-descent minimization of the

error score Back-propagation algorithm allows to compute

the gradient efficiently We have to be careful not to overfit training

data

Page 30: Class 7: Protein Secondary Structure. Protein Structure u Amino-acid chains can fold to form 3-dimensional structures u Proteins are sequences that

Smoothing Outputs

The Qian-Sejnowski network assigns each residue a secondary structure by taking max(o, o, oo)

Some sequences of secondary structure are impossible:

To smooth the output of the network, another layer

is applied on top of the three output units for each residue:

Neural network Markov model

Page 31: Class 7: Protein Secondary Structure. Protein Structure u Amino-acid chains can fold to form 3-dimensional structures u Proteins are sequences that

Success Rate

Variants of the neural network architecture and other methods achieved accuracy of about 65% on unseen proteins

Depending on the exact choice of training/test sets

Page 32: Class 7: Protein Secondary Structure. Protein Structure u Amino-acid chains can fold to form 3-dimensional structures u Proteins are sequences that

Breaking the 70% Threshold

A innovation that made a crucial difference uses evolutionary information to improve prediction

Key idea: Structure is preserved more than sequence Surviving mutations are not random Suppose we find homologues (same structure) of

the query sequence The type of replacements at position i during

evolution provides us with information about the use of the residue i in the secondary structure

Page 33: Class 7: Protein Secondary Structure. Protein Structure u Amino-acid chains can fold to form 3-dimensional structures u Proteins are sequences that

Nearest Neighbor Approach

Select a window around the target residues Perform local alignment to sequences with known

structure Choice of alignment weight matrix to match

remote homologies Alignment weight takes into account the

secondary structure of aligned sequence Use max (na, nb, nc) or max(sa, sb, sc) Key: Scoring measure of evolutionary similarity.

Page 34: Class 7: Protein Secondary Structure. Protein Structure u Amino-acid chains can fold to form 3-dimensional structures u Proteins are sequences that

PHD Approach

Multi-step procedure: Perform BLAST search to find local alignments Remove alignments that are “too close” Perform multiple alignments of sequences Construct a profile (PSSM) of amino-acid

frequencies at each residue Use this profile as input to the neural network A second network performs “smoothing”

Page 35: Class 7: Protein Secondary Structure. Protein Structure u Amino-acid chains can fold to form 3-dimensional structures u Proteins are sequences that

PHD Architecture

Page 36: Class 7: Protein Secondary Structure. Protein Structure u Amino-acid chains can fold to form 3-dimensional structures u Proteins are sequences that

Psi-pred : same idea

(Step 1) Run PSI-Blast --> output sequence profile

(Step 2) 15-residue sliding window = 315 weights, multiplied by hidden weights in 1st neural net. Output is 3 weights (1 weight for each state H, E or L) per position.

(Step 3) 45 input weights, multiplied by weights in 2nd neural network, summed. Output is final 3-state prediction.

Performs slightly better than PHD

Page 37: Class 7: Protein Secondary Structure. Protein Structure u Amino-acid chains can fold to form 3-dimensional structures u Proteins are sequences that

Other Classification Methods

Neural Networks were used as a classifier in the described methods.

We can apply the same idea, with other classifiers. E.g.: SVM

Advantages: Effectively avoid overfittingSupplies prediction confidence

Page 38: Class 7: Protein Secondary Structure. Protein Structure u Amino-acid chains can fold to form 3-dimensional structures u Proteins are sequences that

SVM based approach

Suggested by S. Hua and Z. Sun, (2001). Multiple sequence alignment from HSSP database

(same as PHD) Sliding window of w 21 w input dimension Apply SVM with RBF kernel Multiclass problem:

Training: one-against-others (e.g. H/~H, E/~E, L/~L), binary (e.g. H/E)

maximum output score Decision tree method Jury decision method

Page 39: Class 7: Protein Secondary Structure. Protein Structure u Amino-acid chains can fold to form 3-dimensional structures u Proteins are sequences that

Decision tree

H / ~H

E / CYes

Yes

No

No

H E C

E / ~E

C/ HYes

Yes

No

No

E C H

C / ~C

H / EYes

Yes

No

No

C H E

Page 40: Class 7: Protein Secondary Structure. Protein Structure u Amino-acid chains can fold to form 3-dimensional structures u Proteins are sequences that

Accuracy on CB513 set

Classifier Q3 QH QE QC SOV

Max 72.9 74.8 58.6 79.0 75.4

Tree1 68.9 73.5 54.0 73.1 72.1

Tree2 68.2 72.0 61.0 69.0 71.4

Tree3 67.5 69.5 46.6 77.0 70.8

NN 72.0 74.7 57.7 77.4 75.0

Vote 70.7 73.0 74.7 76.6 73.2

Jury 73.5 75.2 60.3 79.5 76.2

Page 41: Class 7: Protein Secondary Structure. Protein Structure u Amino-acid chains can fold to form 3-dimensional structures u Proteins are sequences that

State of the Art

Both PHD and Nearest neighbor get about 72%-74% accuracy

Both predicted well in CASP2 (1996) PSI-Pred slightly better (around 76%) Recent trend: combining classification methods

Best predictions in CASP3 (1998)

Failures: Long term effects: S-S bonds, parallel strands Chemical patterns Wrong prediction at the ends of helices/strands