CAP5510 – Bioinformatics Protein Structures

CAP5510 – BioinformaticsProtein Structures

Tamer Kahveci

CISE Department

University of Florida

What and Why?

• Proteins fold into a three dimensional shape

• Structure can reveal functional information that we can not find from sequence

• Misfolding proteins can cause diseases

– Sickle cell anemia, mad cow disease

• Used in drug design

Hemoglobin

Normal v.s. sickled blood cells

E → V

HIV proteaseinhibitor

• Understand protein structures• Primary, secondary, tertiary

• Learn how protein shapes are– determined– Predicted

• Structure comparison (?)

A Protein Sequence

>gi|22330039|ref|NP_683383.1| unknown protein; protein id: At1g45196.1 [Arabidopsis thaliana]

MPSESSYKVHRPAKSGGSRRDSSPDSIIFTPESNLSLFSSASVSVDRCSSTSDAHDRDDSLISAWKEEFEVKKDDESQNL

DSARSSFSVALRECQERRSRSEALAKKLDYQRTVSLDLSNVTSTSPRVVNVKRASVSTNKSSVFPSPGTPTYLHSMQKGW

SSERVPLRSNGGRSPPNAGFLPLYSGRTVPSKWEDAERWIVSPLAKEGAARTSFGASHERRPKAKSGPLGPPGFAYYSLY

SPAVPMVHGGNMGGLTASSPFSAGVLPETVSSRGSTTAAFPQRIDPSMARSVSIHGCSETLASSSQDDIHESMKDAATDA

QAVSRRDMATQMSPEGSIRFSPERQCSFSPSSPSPLPISELLNAHSNRAEVKDLQVDEKVTVTRWSKKHRGLYHGNGSKM

• Basic Amino AcidStructure:– The side chain, R,

varies for each ofthe 20 amino acids

Aminogroup

Carboxylgroup

Side chain

Amino Acid Composition

The Peptide Bond

• Dehydration synthesis• Repeating backbone: N–C –C –N–C –C

– Convention – start at amino terminus and proceed to carboxy terminus

Peptidyl polymers

• A few amino acids in a chain are called a polypeptide. A protein is usually composed of 50 to 400+ amino acids.

• We call the units of a protein amino acid residues.

carbonylcarbonylcarboncarbon

amideamidenitrogennitrogen

Side chain properties

• Carbon does not make hydrogen bonds with water easily – hydrophobic

• O and N are generally more likely than C to h-bond to water – hydrophilic

• We group the amino acids into three general groups:– Hydrophobic– Charged (positive/basic & negative/acidic)– Polar

The Hydrophobic Amino Acids

The Charged Amino Acids

The Polar Amino Acids

More Polar Amino Acids

And then there’s…And then there’s…

Phi () – the angle of rotation about the N-C bond.

Psi () – the angle of rotation about the C-C bond.

The planar bond angles and bond lengths are fixed.

Planarity of the Peptide Bond

• Primary structure = the linear sequence of amino acids comprising a protein:

AGVGTVPMTAYGNDIQYYGQVT…• Secondary structure

– Regular patterns of hydrogen bonding in proteins result in two patterns that emerge in nearly every protein structure known: the -helix and the-sheet

– The location of direction of these periodic, repeating structures is known as the secondary structure of the protein

Primary & Secondary Structure

The Alpha Helix

Properties of the Alpha Helix

60°• Hydrogen bonds

between C=O ofresidue n, andNH of residuen+4

• 3.6 residues/turn• 1.5 Å/residue rise• 100°/residue turn

Properties of -helices

• 4 – 40+ residues in length

• Often amphipathic or “dual-natured”– Half hydrophobic and half hydrophilic

• If we examine many -helices,we find trends…– Helix formers: Ala, Glu, Leu, Met– Helix breakers: Pro, Gly, Tyr, Ser

The beta strand (& sheet)

135° +135°

Properties of beta sheets

• Formed of stretches of 5-10 residues in extended conformation

• Parallel/aniparallel,contiguous/non-contiguous

Anti-Parallel Beta Sheets

Parallel Beta Sheets

Mixed Beta Sheets

Turns and Loops

• Secondary structure elements are connected by regions of turns and loops

• Turns – short regions of non-, non- conformation

• Loops – larger stretches with no secondary structure. – Sequences vary much more than secondary

structure regions

Ramachandran Plot

Levels of Protein

Structure

• Secondary structure elements combine to form tertiary structure

• Quaternary structure occurs in multienzyme complexes

Protein Structure Example

Beta Sheet

Helix Loop

ID: 12as2 chains

Wireframe Ball and stick

Views of a Protein

Views of a protein

Spacefill Cartoon

CPK colors

Carbon = green, black, or grey

Nitrogen = blue

Oxygen = red

Sulfur = yellow

Hydrogen = white

Common Protein Motifs

• Four helical bundle:

Globin domain:

Mostly Helical Folding Motifs

/ barrel:

/ Motifs

Open TwistedBeta Sheets

Beta Barrels

Determining the Structure of a Protein

Experimental Methods

•X-ray

•NMR

As of August 2013, structure of > 85,000 proteins are determined

X-Ray Crystallography

Crystals diffract X-rays in regular patterns (Max Von Laue, 1912)

Discovery of X-rays(Wilhelm Conrad Röntgen, 1895)

The first X-ray diffraction pattern from a protein crystal (Dorothy Hodgkin, 1934)

X-Ray Crystallography

• Grow millions of protein crystals– Takes months

• Expose to radiation beam• Analyze the image with

computer– Average over many copies

of images• PDB• Not all proteins can be

crystallized!

• Nuclear Magnetic Resonance• Nuclei of atoms vibrate when exposed to

oscillating magnetic field• Detect vibrations by external sensors• Computes inter-atomic distances.

• Requires complex analysis. NMR can be used for short sequences (<200 residues)

• More than one model can be derived from NMR.

Determining the Structure of a Protein

Computational Methods

The Protein Folding Problem

• Central question of molecular biology:“Given a particular sequence of amino acid residues (primary structure), what will the secondary/tertiary/quaternary structure of the resulting protein be?”

• Input: AAVIKYGCAL…Output: 11, 22…

Structure v.s. Sequence

• Observation: A protein with the same sequence (under the same circumstances) yields the same shape.

• Protein folds into a shape that minimizes the energy needed to stay in that shape.

• Protein folds in ~10-15 seconds.

Secondary Structure Prediction

Chou-Fasman methods

• Uses statistically obtained Chou-Fasman parameters.

• For each amino acid has– P(a): alpha– P(b): beta– P(t): turn– f(): additional turn parameter.

Chou-Fasman ParametersName Abbrv P(a) P(b) P(turn) f(i) f(i+1) f(i+2) f(i+3)Alanine A 142 83 66 0.06 0.076 0.035 0.058Arginine R 98 93 95 0.07 0.106 0.099 0.085Aspartic Acid D 101 54 146 0.147 0.11 0.179 0.081Asparagine N 67 89 156 0.161 0.083 0.191 0.091Cysteine C 70 119 119 0.149 0.05 0.117 0.128Glutamic Acid E 151 37 74 0.056 0.06 0.077 0.064Glutamine Q 111 110 98 0.074 0.098 0.037 0.098Glycine G 57 75 156 0.102 0.085 0.19 0.152Histidine H 100 87 95 0.14 0.047 0.093 0.054Isoleucine I 108 160 47 0.043 0.034 0.013 0.056Leucine L 121 130 59 0.061 0.025 0.036 0.07Lysine K 114 74 101 0.055 0.115 0.072 0.095Methionine M 145 105 60 0.068 0.082 0.014 0.055Phenylalanine F 113 138 60 0.059 0.041 0.065 0.065Proline P 57 55 152 0.102 0.301 0.034 0.068Serine S 77 75 143 0.12 0.139 0.125 0.106Threonine T 83 119 96 0.086 0.108 0.065 0.079Tryptophan W 108 137 96 0.077 0.013 0.064 0.167Tyrosine Y 69 147 114 0.082 0.065 0.114 0.125Valine V 106 170 50 0.062 0.048 0.028 0.053

C.-F. Alpha Helix Prediction (1)

• Find P(a) for all letters• Find 6 contiguous letters, at least 4 of them have P(a) >

100• Declare these regions as alpha helix

A E A T T L C M Q S T Y C Y V

142 151 142 83 83 121 70 145 111 77 83 69 70 69 106

83 37 83 119 119 130 119 105 110 75 119 147 119 147 170

• Extend in both directions until 4 consecutive letters with P(a) < 100 found

142 151 142 83 83 121 70 145 111 77 83 69 70 69 106

83 37 83 119 119 130 119 105 110 75 119 147 119 147 170

• Find sum of P(a) (Sa) and sum of P(b) (Sb) in the extended region– If region is long enough ( >= 5 letters) and P(a) > P(b) then

declare the extended region as alpha helix

142 151 142 83 83 121 70 145 111 77 83 69 70 69 106

83 37 83 119 119 130 119 105 110 75 119 147 119 147 170

C.-F. Beta Sheet Prediction

• Same as alpha helix replace P(a) with P(b)

• Resolving overlapping alpha helix & beta sheet– Compute sum of P(a) (Sa) and sum of P(b)

(Sb) in the overlap.– If Sa > Sb => alpha helix– If Sb > Sa => beta sheet

C.-F. Turn Prediction

• An amino acid is predicted as turn if all of the following holds:– f(i)*f(i+1)*f(i+2)*f(i+3) > 0.000075– Avg(P(i+k)) > 100, for k=0, 1, 2, 3– Sum(P(t)) > Sum(P(a)) and Sum(P(b)) for i+k, (k=0, 1, 2, 3)

142 151 142 83 83 121 70 145 111 77 83 69 70 69 106

83 37 83 119 119 130 119 105 110 75 119 147 119 69 170

66 74 66 96 96 59 119 60 98 143 96 114 119 114 50

i i+1 i+2 i+3

Other Methods for SSE Prediction

• Similarity searching– Predator

• Markov chain

• Neural networks– PHD

• ~65% to 80% accuracy

Tertiary Structure Prediction

Forces driving protein folding

• It is believed that hydrophobic collapse is a key driving force for protein folding– Hydrophobic core– Polar surface interacting with solvent

• Minimum volume (no cavities)• Disulfide bond formation stabilizes• Hydrogen bonds• Polar and electrostatic interactions

• Simple lattice models (HP-models or Hydrophobic-Polar models)– Two types of residues:

hydrophobic and polar– 2-D or 3-D lattice– The only force is

hydrophobic collapse– Score = number of

HH contacts

Fold Optimization

Scoring Lattice Models

• H/P model scoring: count noncovalent hydrophobic interactions.

• Sometimes:– Penalize for buried polar or surface hydrophobic residues

• For smaller polypeptides, exhaustive search can be used– Looking at the “best” fold, even in such a simple

model, can teach us interesting things about the protein folding process

• For larger chains, other optimization and search methods must be used– Greedy, branch and bound– Evolutionary computing, simulated annealing

Can we use lattice models?

The “hydrophobic zipper” effect

Ken Dill ~ 1997

Representing a lattice model

• Absolute directions– UURRDLDRRU

• Relative directions– LFRFRRLLFFL– Advantage, we can’t have

UD or RL in absolute– Only three directions: LRF

• What about bumps? LFRRR– Bad score– Use a better representation

Preference-order representation

• Each position has two “preferences”– If it can’t have either of the

two, it will take the “least favorite” path if possible

• Example: {LR},{FL},{RL},{FR},{RL},{RL},{FL},{RF}

• Can still cause bumps:{LF},{FR},{RL},{FL},{RL},{FL},{RF},{RL},{FL}

More realistic models

• Higher resolution lattices (45° lattice, etc.)• Off-lattice models

– Local moves– Optimization/search methods and /

representations• Greedy search• Branch and bound• EC, Monte Carlo, simulated annealing, etc.

How to Evaluate the Result?

• Now that we have a more realistic off-lattice model, we need a better energy function to evaluate a conformation (fold).

• Theoretical force field: G = Gvan der Waals + Gh-bonds + Gsolvent + Gcoulomb

• Empirical force fields– Start with a database– Look at neighboring residues – similar to known

protein folds?

Comparative Modeling

1. Identify similar protein sequences from a database of known proteins (BLAST)

2. Find conserved regions by aligning these proteins (CLUSTAL-W)

3. Predict alpha helices and beta sheets from conserved regions, backbone

4. Predict loops5. Predict side chain positions6. Evaluate

Threading: Fold recognition

• Given:– Sequence: IVACIVSTEYDVMKAAR…

– A database of molecular coordinates

• Map the sequence onto each fold

• Evaluate– Objective 1: improve

scoring function– Objective 2: folding

Folding : still a hard problem

• Levinthal’s paradox – Consider a 100 residue protein. If each residue can take only 3 positions, there are 3100 = 5 1047 possible conformations.– If it takes 10-13s to convert from 1 structure to

another, exhaustive search would take 1.6 1027 years.

Protein Classification

• Class: Similar secondary structure properties– All alpha, all beta, alpha/beta, alpha+beta

• Fold: major secondary structure similarity.– Globin like (6 helices, folded leaf, partly opened)

• Super family: distant homologs. 25-30% sequence identity.

• Family: close homologs. Evolved from the same ancestor. High identity.

CAP5510 – Bioinformatics Protein Structures

Documents

1 CAP5510 – Bioinformatics Fall 2015 Tamer Kahveci CISE Department University of Florida

Lecture 12 Bioinformatics. Retrieving Protein Sequences

Arne Elofsson (arne@bioinfo.se) EMBRACE: workshop on protein bioinformatics Welcome EMBRACE The new type of bioinformatics Web-services Membrane protein

Computing for Bioinformatics Lecture 8: protein folding

Statistical approaches to protein matching in Bioinformatics

1 CAP5510 – Bioinformatics Protein Structures Tamer Kahveci CISE Department University of Florida

Protein Structure and Bioinformatics Databases

Bioinformatics t7-protein structure-v2013_wim_vancriekinge

Bioinformatics master course DNA/Protein structure-function analysis and prediction Lecture 1: Protein Structure Basics (1) Centre for Integrative Bioinformatics

Protein Structure Bioinformatics IntroductionIntroduction to Protein Structure Bioinformatics 29.9.2004 Lorenza Bordoli 8 Secondary Structure prediction Assumption: ¾there should

Protein Bioinformatics Course Matthew Betts & Rob Russell AG Russell (Protein Evolution)

Protein Structure Nimrod Rubinstein Bioinformatics Seminar

BMC Bioinformatics BioMed - Ghent University · BMC Bioinformatics Software Open Access The Protein Identifier Cross-Referencing (PICR) service: reconciling protein identifiers across

260.841 Protein Bioinformatics: Mass Spectrometrybiostat.jhsph.edu/~iruczins/teaching/260.841/notes/c4.2.pdf · 260.841 Protein Bioinformatics: Mass Spectrometry Robert J. Cotter

CAP5510 – Bioinformatics Multiple Alignment

1 Machine Learning for Bioinformatics. 2 Topics in Bioinformatics Structure analysis Protein structure comparison Protein structure prediction RNA

CAP5510 – Bioinformatics Substitution Patterns

BIOL3014 Review Advanced Bioinformatics. Protein Structure

1 CAP5510 – Bioinformatics Database Searches for Biological Sequences or Imperfect Alignments Tamer Kahveci CISE Department University of Florida

Structural bioinformatics SSIPe: accurately estimating … · 2020. 5. 26. · Structural bioinformatics SSIPe: accurately estimating protein–protein binding affinity change upon